2016-03-08 19:46:37

by Jerome Glisse

[permalink] [raw]
Subject: HMM (Heterogeneous Memory Management)

Last time I spoke with Linus and Andrew, the requirement for getting
HMM upstream was having real hardware working with it beside Mellanox
(as Mellanox does not use all HMM features), both with closed source
driver and open source driver. Work on open source driver is underway,
and I anticipate we will get update from NVIDIA and other parties on
their efforts and plans shortly.

I am re-posting now because I want people to have time to look at HMM
again. The open source driver will stay behind close doors until
hardware is released. I can however have the upstream maintainer share
his progress here if anyone feels the need for that.

Other parties such as IBM and Mediatek are also interested in HMM. I
expect they will comment on their respective hardware when they can.

I hope that HMM can be considered for inclusion upstream soon.

This version is virtualy the same as the one since last post (modulo
rebase differences). Tree with the patchset:

git://people.freedesktop.org/~glisse/linux hmm-v12 branch

HMM (HMM (Heterogeneous Memory Management) is an helper layer
for device driver, its main features are :
- Shadow CPU page table of a process into a device specific
format page table and keep both page table synchronize.
- Handle DMA mapping of system ram page on behalf of device
(for shadowed page table entry).
- Migrate private anonymous memory to private device memory
and handle CPU page fault (which triggers a migration back
to system memory so CPU can access it).

Benefits of HMM :
- Avoid current model where device driver have to pin page
which blocks several kernel features (KSM, migration, ...).
- No impact on existing workload that do not use HMM (it only
adds couple more if() to common code path).
- Intended as common infrastructure for various hardware.
- Allow userspace API to move away from explicit copy code
path where application programmer has to manage manually
memcpy to and from device memory.
- Transparent to userspace, for instance allowing library to
use GPU without involving application linked against it.

Change log :

v12:
- Rebase

v11:
- Fix PROT_NONE case
- Fix missing page table walk callback
- Add support for hugetlbfs

v10:
- Minor fixes here and there.

v9:
- Added new device driver helpers.
- Added documentions.
- Improved page table code claritity (minor architectural changes
and better names).

v8:
- Removed currently unuse fence code.
- Added DMA mapping on behalf of device.

v7:
- Redone and simplified page table code to match Linus suggestion
http://article.gmane.org/gmane.linux.kernel.mm/125257

... Lost in translation ...


Why doing this ?

Mirroring a process address space is mandatory with OpenCL 2.0 and
with other GPU compute APIs. OpenCL 2.0 allows different level of
implementation and currently only the lowest 2 are supported on
Linux. To implement the highest level, where CPU and GPU access
can happen concurently and are cache coherent, HMM is needed, or
something providing same functionality, for instance through
platform hardware.

Hardware solution such as PCIE ATS/PASID is limited to mirroring
system memory and does not provide way to migrate memory to device
memory (which offer significantly more bandwidth, up to 10 times
faster than regular system memory with discrete GPU, also have
lower latency than PCIE transaction).

Current CPU with GPU on same die (AMD or Intel) use the ATS/PASID
and for Intel a special level of cache (backed by a large pool of
fast memory).

For foreseeable future, discrete GPUs will remain releveant as they
can have a large quantity of faster memory than integrated GPU.

Thus we believe HMM will allow us to leverage discrete GPUs memory
in a transparent fashion to the application, with minimum disruption
to the linux kernel mm code. Also HMM can work along hardware
solution such as PCIE ATS/PASID (leaving regular case to ATS/PASID
while HMM handles the migrated memory case).


Design :

The patch 1, 2, 3 and 4 augment the mmu notifier API with new
informations to more efficiently mirror CPU page table updates.

The first side of HMM, process address space mirroring, is
implemented in patch 5 through 14. This use a secondary page
table, in which HMM mirror memory actively use by the device.
HMM does not take a reference on any of the page, it use the
mmu notifier API to track changes to the CPU page table and to
update the mirror page table. All this while providing a simple
API to device driver.

To implement this we use a "generic" page table and not a radix
tree because we need to store more flags than radix tree allows
and we need to store dma address (sizeof(dma_addr_t) > sizeof(long)
on some platform).


(1) Previous patchset posting :
v1 http://lwn.net/Articles/597289/
v2 https://lkml.org/lkml/2014/6/12/559
v3 https://lkml.org/lkml/2014/6/13/633
v4 https://lkml.org/lkml/2014/8/29/423
v5 https://lkml.org/lkml/2014/11/3/759
v6 http://lwn.net/Articles/619737/
v7 http://lwn.net/Articles/627316/
v8 https://lwn.net/Articles/645515/
v9 https://lwn.net/Articles/651553/
v10 https://lwn.net/Articles/654430/
v11 https://lkml.org/lkml/2015/10/21/739

Cheers,
Jérôme

To: "Andrew Morton" <[email protected]>,
To: <[email protected]>,
To: linux-mm <[email protected]>,
Cc: "Linus Torvalds" <[email protected]>,
Cc: "Mel Gorman" <[email protected]>,
Cc: "H. Peter Anvin" <[email protected]>,
Cc: "Peter Zijlstra" <[email protected]>,
Cc: "Linda Wang" <[email protected]>,
Cc: "Kevin E Martin" <[email protected]>,
Cc: "Andrea Arcangeli" <[email protected]>,
Cc: "Johannes Weiner" <[email protected]>,
Cc: "Larry Woodman" <[email protected]>,
Cc: "Rik van Riel" <[email protected]>,
Cc: "Dave Airlie" <[email protected]>,
Cc: "Jeff Law" <[email protected]>,
Cc: "Brendan Conoboy" <[email protected]>,
Cc: "Joe Donohue" <[email protected]>,
Cc: "Christophe Harle" <[email protected]>,
Cc: "Duncan Poole" <[email protected]>,
Cc: "Sherry Cheung" <[email protected]>,
Cc: "Subhash Gutti" <[email protected]>,
Cc: "John Hubbard" <[email protected]>,
Cc: "Mark Hairgrove" <[email protected]>,
Cc: "Lucien Dunning" <[email protected]>,
Cc: "Cameron Buschardt" <[email protected]>,
Cc: "Arvind Gopalakrishnan" <[email protected]>,
Cc: "Haggai Eran" <[email protected]>,
Cc: "Or Gerlitz" <[email protected]>,
Cc: "Sagi Grimberg" <[email protected]>
Cc: "Shachar Raindel" <[email protected]>,
Cc: "Liran Liss" <[email protected]>,
Cc: "Roland Dreier" <[email protected]>,
Cc: "Sander, Ben" <[email protected]>,
Cc: "Stoner, Greg" <[email protected]>,
Cc: "Bridgman, John" <[email protected]>,
Cc: "Mantor, Michael" <[email protected]>,
Cc: "Blinzer, Paul" <[email protected]>,
Cc: "Morichetti, Laurent" <[email protected]>,
Cc: "Deucher, Alexander" <[email protected]>,
Cc: "Leonid Shamis" <[email protected]>
Cc: "Aneesh Kumar K.V" <[email protected]>


2016-03-08 19:46:48

by Jerome Glisse

[permalink] [raw]
Subject: [PATCH v12 01/29] mmu_notifier: add event information to address invalidation v9

The event information will be useful for new user of mmu_notifier API.
The event argument differentiate between a vma disappearing, a page
being write protected or simply a page being unmaped. This allow new
user to take different path for different event for instance on unmap
the resource used to track a vma are still valid and should stay around.
While if the event is saying that a vma is being destroy it means that any
resources used to track this vma can be free.

Changed since v1:
- renamed action into event (updated commit message too).
- simplified the event names and clarified their usage
also documenting what exceptation the listener can have in
respect to each event.

Changed since v2:
- Avoid crazy name.
- Do not move code that do not need to move.

Changed since v3:
- Separate huge page split from mlock/munlock and softdirty.

Changed since v4:
- Rebase (no other changes).

Changed since v5:
- Typo fix.
- Changed zap_page_range from MMU_MUNMAP to MMU_MIGRATE to reflect the
fact that the address range is still valid just the page backing it
are no longer.

Changed since v6:
- try_to_unmap_one() only invalidate when doing migration.
- Differentiate fork from other case.

Changed since v7:
- Renamed MMU_HUGE_PAGE_SPLIT to MMU_HUGE_PAGE_SPLIT.
- Renamed MMU_ISDIRTY to MMU_CLEAR_SOFT_DIRTY.
- Renamed MMU_WRITE_PROTECT to MMU_KSM_WRITE_PROTECT.
- English syntax fixes.

Changed since v8:
- Added freeze/unfreeze for new huge page splitting.

Signed-off-by: Jérôme Glisse <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c | 3 +-
drivers/gpu/drm/i915/i915_gem_userptr.c | 3 +-
drivers/gpu/drm/radeon/radeon_mn.c | 3 +-
drivers/infiniband/core/umem_odp.c | 9 ++-
drivers/iommu/amd_iommu_v2.c | 3 +-
drivers/misc/sgi-gru/grutlbpurge.c | 9 ++-
drivers/xen/gntdev.c | 9 ++-
fs/proc/task_mmu.c | 6 +-
include/linux/mmu_notifier.h | 137 +++++++++++++++++++++++++++-----
kernel/events/uprobes.c | 10 ++-
mm/huge_memory.c | 39 ++++++---
mm/hugetlb.c | 23 +++---
mm/ksm.c | 18 +++--
mm/madvise.c | 4 +-
mm/memory.c | 27 ++++---
mm/migrate.c | 9 ++-
mm/mmu_notifier.c | 28 ++++---
mm/mprotect.c | 6 +-
mm/mremap.c | 6 +-
mm/rmap.c | 4 +-
virt/kvm/kvm_main.c | 12 ++-
21 files changed, 265 insertions(+), 103 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
index b1969f2..7ca805c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
@@ -121,7 +121,8 @@ static void amdgpu_mn_release(struct mmu_notifier *mn,
static void amdgpu_mn_invalidate_range_start(struct mmu_notifier *mn,
struct mm_struct *mm,
unsigned long start,
- unsigned long end)
+ unsigned long end,
+ enum mmu_event event)
{
struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn);
struct interval_tree_node *it;
diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
index 19fb0bdd..6767026 100644
--- a/drivers/gpu/drm/i915/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
@@ -117,7 +117,8 @@ static unsigned long cancel_userptr(struct i915_mmu_object *mo)
static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
struct mm_struct *mm,
unsigned long start,
- unsigned long end)
+ unsigned long end,
+ enum mmu_event event)
{
struct i915_mmu_notifier *mn =
container_of(_mn, struct i915_mmu_notifier, mn);
diff --git a/drivers/gpu/drm/radeon/radeon_mn.c b/drivers/gpu/drm/radeon/radeon_mn.c
index eef006c..3a9615b 100644
--- a/drivers/gpu/drm/radeon/radeon_mn.c
+++ b/drivers/gpu/drm/radeon/radeon_mn.c
@@ -121,7 +121,8 @@ static void radeon_mn_release(struct mmu_notifier *mn,
static void radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
struct mm_struct *mm,
unsigned long start,
- unsigned long end)
+ unsigned long end,
+ enum mmu_event event)
{
struct radeon_mn *rmn = container_of(mn, struct radeon_mn, mn);
struct interval_tree_node *it;
diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index 40becdb..6ed69fa 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -165,7 +165,8 @@ static int invalidate_page_trampoline(struct ib_umem *item, u64 start,

static void ib_umem_notifier_invalidate_page(struct mmu_notifier *mn,
struct mm_struct *mm,
- unsigned long address)
+ unsigned long address,
+ enum mmu_event event)
{
struct ib_ucontext *context = container_of(mn, struct ib_ucontext, mn);

@@ -192,7 +193,8 @@ static int invalidate_range_start_trampoline(struct ib_umem *item, u64 start,
static void ib_umem_notifier_invalidate_range_start(struct mmu_notifier *mn,
struct mm_struct *mm,
unsigned long start,
- unsigned long end)
+ unsigned long end,
+ enum mmu_event event)
{
struct ib_ucontext *context = container_of(mn, struct ib_ucontext, mn);

@@ -217,7 +219,8 @@ static int invalidate_range_end_trampoline(struct ib_umem *item, u64 start,
static void ib_umem_notifier_invalidate_range_end(struct mmu_notifier *mn,
struct mm_struct *mm,
unsigned long start,
- unsigned long end)
+ unsigned long end,
+ enum mmu_event event)
{
struct ib_ucontext *context = container_of(mn, struct ib_ucontext, mn);

diff --git a/drivers/iommu/amd_iommu_v2.c b/drivers/iommu/amd_iommu_v2.c
index 7caf2fa..2b4be22 100644
--- a/drivers/iommu/amd_iommu_v2.c
+++ b/drivers/iommu/amd_iommu_v2.c
@@ -392,7 +392,8 @@ static int mn_clear_flush_young(struct mmu_notifier *mn,

static void mn_invalidate_page(struct mmu_notifier *mn,
struct mm_struct *mm,
- unsigned long address)
+ unsigned long address,
+ enum mmu_event event)
{
__mn_flush_page(mn, address);
}
diff --git a/drivers/misc/sgi-gru/grutlbpurge.c b/drivers/misc/sgi-gru/grutlbpurge.c
index e936d43..1c220ae 100644
--- a/drivers/misc/sgi-gru/grutlbpurge.c
+++ b/drivers/misc/sgi-gru/grutlbpurge.c
@@ -221,7 +221,8 @@ void gru_flush_all_tlb(struct gru_state *gru)
*/
static void gru_invalidate_range_start(struct mmu_notifier *mn,
struct mm_struct *mm,
- unsigned long start, unsigned long end)
+ unsigned long start, unsigned long end,
+ enum mmu_event event)
{
struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
ms_notifier);
@@ -235,7 +236,8 @@ static void gru_invalidate_range_start(struct mmu_notifier *mn,

static void gru_invalidate_range_end(struct mmu_notifier *mn,
struct mm_struct *mm, unsigned long start,
- unsigned long end)
+ unsigned long end,
+ enum mmu_event event)
{
struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
ms_notifier);
@@ -248,7 +250,8 @@ static void gru_invalidate_range_end(struct mmu_notifier *mn,
}

static void gru_invalidate_page(struct mmu_notifier *mn, struct mm_struct *mm,
- unsigned long address)
+ unsigned long address,
+ enum mmu_event event)
{
struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
ms_notifier);
diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index 1be5dd0..91c6804 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -467,7 +467,9 @@ static void unmap_if_in_range(struct grant_map *map,

static void mn_invl_range_start(struct mmu_notifier *mn,
struct mm_struct *mm,
- unsigned long start, unsigned long end)
+ unsigned long start,
+ unsigned long end,
+ enum mmu_event event)
{
struct gntdev_priv *priv = container_of(mn, struct gntdev_priv, mn);
struct grant_map *map;
@@ -484,9 +486,10 @@ static void mn_invl_range_start(struct mmu_notifier *mn,

static void mn_invl_page(struct mmu_notifier *mn,
struct mm_struct *mm,
- unsigned long address)
+ unsigned long address,
+ enum mmu_event event)
{
- mn_invl_range_start(mn, mm, address, address + PAGE_SIZE);
+ mn_invl_range_start(mn, mm, address, address + PAGE_SIZE, event);
}

static void mn_release(struct mmu_notifier *mn,
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index e6ee732..68942ee 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1037,11 +1037,13 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
downgrade_write(&mm->mmap_sem);
break;
}
- mmu_notifier_invalidate_range_start(mm, 0, -1);
+ mmu_notifier_invalidate_range_start(mm, 0, -1,
+ MMU_CLEAR_SOFT_DIRTY);
}
walk_page_range(0, ~0UL, &clear_refs_walk);
if (type == CLEAR_REFS_SOFT_DIRTY)
- mmu_notifier_invalidate_range_end(mm, 0, -1);
+ mmu_notifier_invalidate_range_end(mm, 0, -1,
+ MMU_CLEAR_SOFT_DIRTY);
flush_tlb_mm(mm);
up_read(&mm->mmap_sem);
out_mm:
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index a1a210d..906aad0 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -9,6 +9,72 @@
struct mmu_notifier;
struct mmu_notifier_ops;

+/* MMU Events report fine-grained information to the callback routine, allowing
+ * the event listener to make a more informed decision as to what action to
+ * take. The event types are:
+ *
+ * - MMU_FORK a process is forking. This will lead to vmas getting
+ * write-protected, in order to set up COW
+ *
+ * - MMU_HUGE_PAGE_SPLIT the pages don't move, nor does their content change,
+ * but the page table structure is updated (levels added or removed).
+ *
+ * - MMU_HUGE_FREEZE/MMU_HUGE_UNFREEZE huge page splitting is a multi-step,
+ * first pmd is freeze and then split before being unfreeze.
+ *
+ * - MMU_CLEAR_SOFT_DIRTY need to write protect so write properly update the
+ * soft dirty bit of page table entry.
+ *
+ * - MMU_MIGRATE: memory is migrating from one page to another, thus all write
+ * access must stop after invalidate_range_start callback returns.
+ * Furthermore, no read access should be allowed either, as a new page can
+ * be remapped with write access before the invalidate_range_end callback
+ * happens and thus any read access to old page might read stale data. There
+ * are several sources for this event, including:
+ *
+ * - A page moving to swap (various reasons, including page reclaim),
+ * - An mremap syscall,
+ * - migration for NUMA reasons,
+ * - balancing the memory pool,
+ * - write fault on COW page,
+ * - and more that are not listed here.
+ *
+ * - MMU_MPROT: memory access protection is changing. Refer to the vma to get
+ * the new access protection. All memory access are still valid until the
+ * invalidate_range_end callback.
+ *
+ * - MMU_MUNLOCK: unlock memory. Content of page table stays the same but
+ * page are unlocked.
+ *
+ * - MMU_MUNMAP: the range is being unmapped (outcome of a munmap syscall or
+ * process destruction). However, access is still allowed, up until the
+ * invalidate_range_free_pages callback. This also implies that secondary
+ * page table can be trimmed, because the address range is no longer valid.
+ *
+ * - MMU_WRITE_BACK: memory is being written back to disk, all write accesses
+ * must stop after invalidate_range_start callback returns. Read access are
+ * still allowed.
+ *
+ * - MMU_KSM_WRITE_PROTECT: memory is being write protected for KSM.
+ *
+ * If in doubt when adding a new notifier caller, please use MMU_MIGRATE,
+ * because it will always lead to reasonable behavior, but will not allow the
+ * listener a chance to optimize its events.
+ */
+enum mmu_event {
+ MMU_FORK = 0,
+ MMU_HUGE_PAGE_SPLIT,
+ MMU_HUGE_FREEZE,
+ MMU_HUGE_UNFREEZE,
+ MMU_CLEAR_SOFT_DIRTY,
+ MMU_MIGRATE,
+ MMU_MPROT,
+ MMU_MUNLOCK,
+ MMU_MUNMAP,
+ MMU_WRITE_BACK,
+ MMU_KSM_WRITE_PROTECT,
+};
+
#ifdef CONFIG_MMU_NOTIFIER

/*
@@ -92,7 +158,8 @@ struct mmu_notifier_ops {
void (*change_pte)(struct mmu_notifier *mn,
struct mm_struct *mm,
unsigned long address,
- pte_t pte);
+ pte_t pte,
+ enum mmu_event event);

/*
* Before this is invoked any secondary MMU is still ok to
@@ -103,7 +170,8 @@ struct mmu_notifier_ops {
*/
void (*invalidate_page)(struct mmu_notifier *mn,
struct mm_struct *mm,
- unsigned long address);
+ unsigned long address,
+ enum mmu_event event);

/*
* invalidate_range_start() and invalidate_range_end() must be
@@ -150,10 +218,14 @@ struct mmu_notifier_ops {
*/
void (*invalidate_range_start)(struct mmu_notifier *mn,
struct mm_struct *mm,
- unsigned long start, unsigned long end);
+ unsigned long start,
+ unsigned long end,
+ enum mmu_event event);
void (*invalidate_range_end)(struct mmu_notifier *mn,
struct mm_struct *mm,
- unsigned long start, unsigned long end);
+ unsigned long start,
+ unsigned long end,
+ enum mmu_event event);

/*
* invalidate_range() is either called between
@@ -219,13 +291,20 @@ extern int __mmu_notifier_clear_young(struct mm_struct *mm,
extern int __mmu_notifier_test_young(struct mm_struct *mm,
unsigned long address);
extern void __mmu_notifier_change_pte(struct mm_struct *mm,
- unsigned long address, pte_t pte);
+ unsigned long address,
+ pte_t pte,
+ enum mmu_event event);
extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
- unsigned long address);
+ unsigned long address,
+ enum mmu_event event);
extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
- unsigned long start, unsigned long end);
+ unsigned long start,
+ unsigned long end,
+ enum mmu_event event);
extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
- unsigned long start, unsigned long end);
+ unsigned long start,
+ unsigned long end,
+ enum mmu_event event);
extern void __mmu_notifier_invalidate_range(struct mm_struct *mm,
unsigned long start, unsigned long end);

@@ -262,31 +341,38 @@ static inline int mmu_notifier_test_young(struct mm_struct *mm,
}

static inline void mmu_notifier_change_pte(struct mm_struct *mm,
- unsigned long address, pte_t pte)
+ unsigned long address,
+ pte_t pte,
+ enum mmu_event event)
{
if (mm_has_notifiers(mm))
- __mmu_notifier_change_pte(mm, address, pte);
+ __mmu_notifier_change_pte(mm, address, pte, event);
}

static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
- unsigned long address)
+ unsigned long address,
+ enum mmu_event event)
{
if (mm_has_notifiers(mm))
- __mmu_notifier_invalidate_page(mm, address);
+ __mmu_notifier_invalidate_page(mm, address, event);
}

static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
- unsigned long start, unsigned long end)
+ unsigned long start,
+ unsigned long end,
+ enum mmu_event event)
{
if (mm_has_notifiers(mm))
- __mmu_notifier_invalidate_range_start(mm, start, end);
+ __mmu_notifier_invalidate_range_start(mm, start, end, event);
}

static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
- unsigned long start, unsigned long end)
+ unsigned long start,
+ unsigned long end,
+ enum mmu_event event)
{
if (mm_has_notifiers(mm))
- __mmu_notifier_invalidate_range_end(mm, start, end);
+ __mmu_notifier_invalidate_range_end(mm, start, end, event);
}

static inline void mmu_notifier_invalidate_range(struct mm_struct *mm,
@@ -403,13 +489,13 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm)
* old page would remain mapped readonly in the secondary MMUs after the new
* page is already writable by some CPU through the primary MMU.
*/
-#define set_pte_at_notify(__mm, __address, __ptep, __pte) \
+#define set_pte_at_notify(__mm, __address, __ptep, __pte, __event) \
({ \
struct mm_struct *___mm = __mm; \
unsigned long ___address = __address; \
pte_t ___pte = __pte; \
\
- mmu_notifier_change_pte(___mm, ___address, ___pte); \
+ mmu_notifier_change_pte(___mm, ___address, ___pte, __event); \
set_pte_at(___mm, ___address, __ptep, ___pte); \
})

@@ -437,22 +523,29 @@ static inline int mmu_notifier_test_young(struct mm_struct *mm,
}

static inline void mmu_notifier_change_pte(struct mm_struct *mm,
- unsigned long address, pte_t pte)
+ unsigned long address,
+ pte_t pte,
+ enum mmu_event event)
{
}

static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
- unsigned long address)
+ unsigned long address,
+ enum mmu_event event)
{
}

static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
- unsigned long start, unsigned long end)
+ unsigned long start,
+ unsigned long end,
+ enum mmu_event event)
{
}

static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
- unsigned long start, unsigned long end)
+ unsigned long start,
+ unsigned long end,
+ enum mmu_event event)
{
}

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 0167679..4f84dc1 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -169,7 +169,8 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
/* For try_to_free_swap() and munlock_vma_page() below */
lock_page(page);

- mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_start(mm, mmun_start,
+ mmun_end, MMU_MIGRATE);
err = -EAGAIN;
ptep = page_check_address(page, mm, addr, &ptl, 0);
if (!ptep)
@@ -187,7 +188,9 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,

flush_cache_page(vma, addr, pte_pfn(*ptep));
ptep_clear_flush_notify(vma, addr, ptep);
- set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
+ set_pte_at_notify(mm, addr, ptep,
+ mk_pte(kpage, vma->vm_page_prot),
+ MMU_MIGRATE);

page_remove_rmap(page, false);
if (!page_mapped(page))
@@ -201,7 +204,8 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
err = 0;
unlock:
mem_cgroup_cancel_charge(kpage, memcg, false);
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end,
+ MMU_MIGRATE);
unlock_page(page);
return err;
}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 09de368..75633bb 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1207,7 +1207,8 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,

mmun_start = haddr;
mmun_end = haddr + HPAGE_PMD_SIZE;
- mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
+ MMU_MIGRATE);

ptl = pmd_lock(mm, pmd);
if (unlikely(!pmd_same(*pmd, orig_pmd)))
@@ -1241,7 +1242,8 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
page_remove_rmap(page, true);
spin_unlock(ptl);

- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_end(mm, mmun_start,
+ mmun_end, MMU_MIGRATE);

ret |= VM_FAULT_WRITE;
put_page(page);
@@ -1251,7 +1253,8 @@ out:

out_free_pages:
spin_unlock(ptl);
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_end(mm, mmun_start,
+ mmun_end, MMU_MIGRATE);
for (i = 0; i < HPAGE_PMD_NR; i++) {
memcg = (void *)page_private(pages[i]);
set_page_private(pages[i], 0);
@@ -1356,7 +1359,8 @@ alloc:

mmun_start = haddr;
mmun_end = haddr + HPAGE_PMD_SIZE;
- mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
+ MMU_MIGRATE);

spin_lock(ptl);
if (page)
@@ -1388,7 +1392,8 @@ alloc:
}
spin_unlock(ptl);
out_mn:
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_end(mm, mmun_start,
+ mmun_end, MMU_MIGRATE);
out:
return ret;
out_unlock:
@@ -2459,7 +2464,8 @@ static void collapse_huge_page(struct mm_struct *mm,

mmun_start = address;
mmun_end = address + HPAGE_PMD_SIZE;
- mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_start(mm, mmun_start,
+ mmun_end, MMU_MIGRATE);
pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
/*
* After this gup_fast can't run anymore. This also removes
@@ -2469,7 +2475,8 @@ static void collapse_huge_page(struct mm_struct *mm,
*/
_pmd = pmdp_collapse_flush(vma, address, pmd);
spin_unlock(pmd_ptl);
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_end(mm, mmun_start,
+ mmun_end, MMU_MIGRATE);

spin_lock(pte_ptl);
isolated = __collapse_huge_page_isolate(vma, address, pte);
@@ -3027,7 +3034,8 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
struct page *page = NULL;
unsigned long haddr = address & HPAGE_PMD_MASK;

- mmu_notifier_invalidate_range_start(mm, haddr, haddr + HPAGE_PMD_SIZE);
+ mmu_notifier_invalidate_range_start(mm, haddr, haddr + HPAGE_PMD_SIZE,
+ MMU_HUGE_PAGE_SPLIT);
ptl = pmd_lock(mm, pmd);
if (pmd_trans_huge(*pmd)) {
page = pmd_page(*pmd);
@@ -3040,7 +3048,8 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
__split_huge_pmd_locked(vma, pmd, haddr, false);
out:
spin_unlock(ptl);
- mmu_notifier_invalidate_range_end(mm, haddr, haddr + HPAGE_PMD_SIZE);
+ mmu_notifier_invalidate_range_end(mm, haddr, haddr + HPAGE_PMD_SIZE,
+ MMU_HUGE_PAGE_SPLIT);
if (page) {
lock_page(page);
munlock_vma_page(page);
@@ -3205,10 +3214,12 @@ static void freeze_page(struct anon_vma *anon_vma, struct page *page)
unsigned long address = __vma_address(page, avc->vma);

mmu_notifier_invalidate_range_start(avc->vma->vm_mm,
- address, address + HPAGE_PMD_SIZE);
+ address, address + HPAGE_PMD_SIZE,
+ MMU_HUGE_FREEZE);
freeze_page_vma(avc->vma, page, address);
mmu_notifier_invalidate_range_end(avc->vma->vm_mm,
- address, address + HPAGE_PMD_SIZE);
+ address, address + HPAGE_PMD_SIZE,
+ MMU_HUGE_FREEZE);
}
}

@@ -3286,10 +3297,12 @@ static void unfreeze_page(struct anon_vma *anon_vma, struct page *page)
unsigned long address = __vma_address(page, avc->vma);

mmu_notifier_invalidate_range_start(avc->vma->vm_mm,
- address, address + HPAGE_PMD_SIZE);
+ address, address + HPAGE_PMD_SIZE,
+ MMU_HUGE_UNFREEZE);
unfreeze_page_vma(avc->vma, page, address);
mmu_notifier_invalidate_range_end(avc->vma->vm_mm,
- address, address + HPAGE_PMD_SIZE);
+ address, address + HPAGE_PMD_SIZE,
+ MMU_HUGE_UNFREEZE);
}
}

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 01f2b48..1721d87 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3059,7 +3059,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
mmun_start = vma->vm_start;
mmun_end = vma->vm_end;
if (cow)
- mmu_notifier_invalidate_range_start(src, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_start(src, mmun_start,
+ mmun_end, MMU_MIGRATE);

for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
spinlock_t *src_ptl, *dst_ptl;
@@ -3114,7 +3115,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
}

if (cow)
- mmu_notifier_invalidate_range_end(src, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_end(src, mmun_start,
+ mmun_end, MMU_MIGRATE);

return ret;
}
@@ -3140,7 +3142,8 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
BUG_ON(end & ~huge_page_mask(h));

tlb_start_vma(tlb, vma);
- mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_start(mm, mmun_start,
+ mmun_end, MMU_MIGRATE);
address = start;
again:
for (; address < end; address += sz) {
@@ -3215,7 +3218,8 @@ unlock:
if (address < end && !ref_page)
goto again;
}
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_end(mm, mmun_start,
+ mmun_end, MMU_MIGRATE);
tlb_end_vma(tlb, vma);
}

@@ -3402,8 +3406,8 @@ retry_avoidcopy:

mmun_start = address & huge_page_mask(h);
mmun_end = mmun_start + huge_page_size(h);
- mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
-
+ mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
+ MMU_MIGRATE);
/*
* Retake the page table lock to check for racing updates
* before the page tables are altered
@@ -3424,7 +3428,8 @@ retry_avoidcopy:
new_page = old_page;
}
spin_unlock(ptl);
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end,
+ MMU_MIGRATE);
out_release_all:
page_cache_release(new_page);
out_release_old:
@@ -3907,7 +3912,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
BUG_ON(address >= end);
flush_cache_range(vma, address, end);

- mmu_notifier_invalidate_range_start(mm, start, end);
+ mmu_notifier_invalidate_range_start(mm, start, end, MMU_MPROT);
i_mmap_lock_write(vma->vm_file->f_mapping);
for (; address < end; address += huge_page_size(h)) {
spinlock_t *ptl;
@@ -3957,7 +3962,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
flush_tlb_range(vma, start, end);
mmu_notifier_invalidate_range(mm, start, end);
i_mmap_unlock_write(vma->vm_file->f_mapping);
- mmu_notifier_invalidate_range_end(mm, start, end);
+ mmu_notifier_invalidate_range_end(mm, start, end, MMU_MPROT);

return pages << h->order;
}
diff --git a/mm/ksm.c b/mm/ksm.c
index 823d78b..5b2f07a 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1011,7 +1011,8 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,

mmun_start = addr;
mmun_end = addr + PAGE_SIZE;
- mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
+ MMU_KSM_WRITE_PROTECT);

ptep = page_check_address(page, mm, addr, &ptl, 0);
if (!ptep)
@@ -1043,7 +1044,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
if (pte_dirty(entry))
set_page_dirty(page);
entry = pte_mkclean(pte_wrprotect(entry));
- set_pte_at_notify(mm, addr, ptep, entry);
+ set_pte_at_notify(mm, addr, ptep, entry, MMU_KSM_WRITE_PROTECT);
}
*orig_pte = *ptep;
err = 0;
@@ -1051,7 +1052,8 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
out_unlock:
pte_unmap_unlock(ptep, ptl);
out_mn:
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end,
+ MMU_KSM_WRITE_PROTECT);
out:
return err;
}
@@ -1087,7 +1089,8 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,

mmun_start = addr;
mmun_end = addr + PAGE_SIZE;
- mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
+ MMU_MIGRATE);

ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
if (!pte_same(*ptep, orig_pte)) {
@@ -1100,7 +1103,9 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,

flush_cache_page(vma, addr, pte_pfn(*ptep));
ptep_clear_flush_notify(vma, addr, ptep);
- set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
+ set_pte_at_notify(mm, addr, ptep,
+ mk_pte(kpage, vma->vm_page_prot),
+ MMU_MIGRATE);

page_remove_rmap(page, false);
if (!page_mapped(page))
@@ -1110,7 +1115,8 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
pte_unmap_unlock(ptep, ptl);
err = 0;
out_mn:
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end,
+ MMU_MIGRATE);
out:
return err;
}
diff --git a/mm/madvise.c b/mm/madvise.c
index a50ac188..f2fdfcd 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -433,9 +433,9 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
tlb_gather_mmu(&tlb, mm, start, end);
update_hiwater_rss(mm);

- mmu_notifier_invalidate_range_start(mm, start, end);
+ mmu_notifier_invalidate_range_start(mm, start, end, MMU_MUNMAP);
madvise_free_page_range(&tlb, vma, start, end);
- mmu_notifier_invalidate_range_end(mm, start, end);
+ mmu_notifier_invalidate_range_end(mm, start, end, MMU_MUNMAP);
tlb_finish_mmu(&tlb, start, end);

return 0;
diff --git a/mm/memory.c b/mm/memory.c
index 1ef093a..502a8a3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1037,7 +1037,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
mmun_end = end;
if (is_cow)
mmu_notifier_invalidate_range_start(src_mm, mmun_start,
- mmun_end);
+ mmun_end, MMU_FORK);

ret = 0;
dst_pgd = pgd_offset(dst_mm, addr);
@@ -1054,7 +1054,8 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
} while (dst_pgd++, src_pgd++, addr = next, addr != end);

if (is_cow)
- mmu_notifier_invalidate_range_end(src_mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_end(src_mm, mmun_start,
+ mmun_end, MMU_FORK);
return ret;
}

@@ -1314,10 +1315,12 @@ void unmap_vmas(struct mmu_gather *tlb,
{
struct mm_struct *mm = vma->vm_mm;

- mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
+ mmu_notifier_invalidate_range_start(mm, start_addr,
+ end_addr, MMU_MUNMAP);
for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
unmap_single_vma(tlb, vma, start_addr, end_addr, NULL);
- mmu_notifier_invalidate_range_end(mm, start_addr, end_addr);
+ mmu_notifier_invalidate_range_end(mm, start_addr,
+ end_addr, MMU_MUNMAP);
}

/**
@@ -1339,10 +1342,10 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start,
lru_add_drain();
tlb_gather_mmu(&tlb, mm, start, end);
update_hiwater_rss(mm);
- mmu_notifier_invalidate_range_start(mm, start, end);
+ mmu_notifier_invalidate_range_start(mm, start, end, MMU_MIGRATE);
for ( ; vma && vma->vm_start < end; vma = vma->vm_next)
unmap_single_vma(&tlb, vma, start, end, details);
- mmu_notifier_invalidate_range_end(mm, start, end);
+ mmu_notifier_invalidate_range_end(mm, start, end, MMU_MIGRATE);
tlb_finish_mmu(&tlb, start, end);
}

@@ -1365,9 +1368,9 @@ static void zap_page_range_single(struct vm_area_struct *vma, unsigned long addr
lru_add_drain();
tlb_gather_mmu(&tlb, mm, address, end);
update_hiwater_rss(mm);
- mmu_notifier_invalidate_range_start(mm, address, end);
+ mmu_notifier_invalidate_range_start(mm, address, end, MMU_MUNMAP);
unmap_single_vma(&tlb, vma, address, end, details);
- mmu_notifier_invalidate_range_end(mm, address, end);
+ mmu_notifier_invalidate_range_end(mm, address, end, MMU_MUNMAP);
tlb_finish_mmu(&tlb, address, end);
}

@@ -2091,7 +2094,8 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,

__SetPageUptodate(new_page);

- mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_start(mm, mmun_start,
+ mmun_end, MMU_MIGRATE);

/*
* Re-check the pte - we dropped the lock
@@ -2125,7 +2129,7 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
* mmu page tables (such as kvm shadow page tables), we want the
* new page to be mapped directly into the secondary page table.
*/
- set_pte_at_notify(mm, address, page_table, entry);
+ set_pte_at_notify(mm, address, page_table, entry, MMU_MIGRATE);
update_mmu_cache(vma, address, page_table);
if (old_page) {
/*
@@ -2164,7 +2168,8 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
page_cache_release(new_page);

pte_unmap_unlock(page_table, ptl);
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_end(mm, mmun_start,
+ mmun_end, MMU_MIGRATE);
if (old_page) {
/*
* Don't let another task, with possibly unlocked vma,
diff --git a/mm/migrate.c b/mm/migrate.c
index 577c94b..09ba4bb 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1791,12 +1791,14 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
WARN_ON(PageLRU(new_page));

/* Recheck the target PMD */
- mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_start(mm, mmun_start,
+ mmun_end, MMU_MIGRATE);
ptl = pmd_lock(mm, pmd);
if (unlikely(!pmd_same(*pmd, entry) || page_count(page) != 2)) {
fail_putback:
spin_unlock(ptl);
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_end(mm, mmun_start,
+ mmun_end, MMU_MIGRATE);

/* Reverse changes made by migrate_page_copy() */
if (TestClearPageActive(new_page))
@@ -1848,7 +1850,8 @@ fail_putback:
set_page_owner_migrate_reason(new_page, MR_NUMA_MISPLACED);

spin_unlock(ptl);
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_end(mm, mmun_start,
+ mmun_end, MMU_MIGRATE);

/* Take an "isolate" reference and put new page on the LRU. */
get_page(new_page);
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 5fbdd36..b806bdb 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -159,8 +159,10 @@ int __mmu_notifier_test_young(struct mm_struct *mm,
return young;
}

-void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
- pte_t pte)
+void __mmu_notifier_change_pte(struct mm_struct *mm,
+ unsigned long address,
+ pte_t pte,
+ enum mmu_event event)
{
struct mmu_notifier *mn;
int id;
@@ -168,13 +170,14 @@ void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
id = srcu_read_lock(&srcu);
hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
if (mn->ops->change_pte)
- mn->ops->change_pte(mn, mm, address, pte);
+ mn->ops->change_pte(mn, mm, address, pte, event);
}
srcu_read_unlock(&srcu, id);
}

void __mmu_notifier_invalidate_page(struct mm_struct *mm,
- unsigned long address)
+ unsigned long address,
+ enum mmu_event event)
{
struct mmu_notifier *mn;
int id;
@@ -182,13 +185,16 @@ void __mmu_notifier_invalidate_page(struct mm_struct *mm,
id = srcu_read_lock(&srcu);
hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
if (mn->ops->invalidate_page)
- mn->ops->invalidate_page(mn, mm, address);
+ mn->ops->invalidate_page(mn, mm, address, event);
}
srcu_read_unlock(&srcu, id);
}

void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
- unsigned long start, unsigned long end)
+ unsigned long start,
+ unsigned long end,
+ enum mmu_event event)
+
{
struct mmu_notifier *mn;
int id;
@@ -196,14 +202,17 @@ void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
id = srcu_read_lock(&srcu);
hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
if (mn->ops->invalidate_range_start)
- mn->ops->invalidate_range_start(mn, mm, start, end);
+ mn->ops->invalidate_range_start(mn, mm, start,
+ end, event);
}
srcu_read_unlock(&srcu, id);
}
EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start);

void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
- unsigned long start, unsigned long end)
+ unsigned long start,
+ unsigned long end,
+ enum mmu_event event)
{
struct mmu_notifier *mn;
int id;
@@ -221,7 +230,8 @@ void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
if (mn->ops->invalidate_range)
mn->ops->invalidate_range(mn, mm, start, end);
if (mn->ops->invalidate_range_end)
- mn->ops->invalidate_range_end(mn, mm, start, end);
+ mn->ops->invalidate_range_end(mn, mm, start,
+ end, event);
}
srcu_read_unlock(&srcu, id);
}
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 6ff5dfa..8f4a8c9 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -156,7 +156,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
/* invoke the mmu notifier if the pmd is populated */
if (!mni_start) {
mni_start = addr;
- mmu_notifier_invalidate_range_start(mm, mni_start, end);
+ mmu_notifier_invalidate_range_start(mm, mni_start,
+ end, MMU_MPROT);
}

if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
@@ -186,7 +187,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
} while (pmd++, addr = next, addr != end);

if (mni_start)
- mmu_notifier_invalidate_range_end(mm, mni_start, end);
+ mmu_notifier_invalidate_range_end(mm, mni_start, end,
+ MMU_MPROT);

if (nr_huge_updates)
count_vm_numa_events(NUMA_HUGE_PTE_UPDATES, nr_huge_updates);
diff --git a/mm/mremap.c b/mm/mremap.c
index 3fa0a467..9544022 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -175,7 +175,8 @@ unsigned long move_page_tables(struct vm_area_struct *vma,

mmun_start = old_addr;
mmun_end = old_end;
- mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start,
+ mmun_end, MMU_MIGRATE);

for (; old_addr < old_end; old_addr += extent, new_addr += extent) {
cond_resched();
@@ -227,7 +228,8 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
if (likely(need_flush))
flush_tlb_range(vma, old_end-len, old_addr);

- mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start,
+ mmun_end, MMU_MIGRATE);

return len + old_addr - old_end; /* how much done */
}
diff --git a/mm/rmap.c b/mm/rmap.c
index 02f0bfc..a24d0b2 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1054,7 +1054,7 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma,
pte_unmap_unlock(pte, ptl);

if (ret) {
- mmu_notifier_invalidate_page(mm, address);
+ mmu_notifier_invalidate_page(mm, address, MMU_WRITE_BACK);
(*cleaned)++;
}
out:
@@ -1552,7 +1552,7 @@ discard:
out_unmap:
pte_unmap_unlock(pte, ptl);
if (ret != SWAP_FAIL && ret != SWAP_MLOCK && !(flags & TTU_MUNLOCK))
- mmu_notifier_invalidate_page(mm, address);
+ mmu_notifier_invalidate_page(mm, address, MMU_MIGRATE);
out:
return ret;
}
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 61bc4b9..3889354 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -272,7 +272,8 @@ static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)

static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
struct mm_struct *mm,
- unsigned long address)
+ unsigned long address,
+ enum mmu_event event)
{
struct kvm *kvm = mmu_notifier_to_kvm(mn);
int need_tlb_flush, idx;
@@ -314,7 +315,8 @@ static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
struct mm_struct *mm,
unsigned long address,
- pte_t pte)
+ pte_t pte,
+ enum mmu_event event)
{
struct kvm *kvm = mmu_notifier_to_kvm(mn);
int idx;
@@ -330,7 +332,8 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
struct mm_struct *mm,
unsigned long start,
- unsigned long end)
+ unsigned long end,
+ enum mmu_event event)
{
struct kvm *kvm = mmu_notifier_to_kvm(mn);
int need_tlb_flush = 0, idx;
@@ -356,7 +359,8 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
struct mm_struct *mm,
unsigned long start,
- unsigned long end)
+ unsigned long end,
+ enum mmu_event event)
{
struct kvm *kvm = mmu_notifier_to_kvm(mn);

--
2.4.3

2016-03-08 19:47:01

by Jerome Glisse

[permalink] [raw]
Subject: [PATCH v12 02/29] mmu_notifier: keep track of active invalidation ranges v5

The invalidate_range_start() and invalidate_range_end() can be
considered as forming an "atomic" section for the cpu page table
update point of view. Between this two function the cpu page
table content is unreliable for the address range being
invalidated.

This patch use a structure define at all place doing range
invalidation. This structure is added to a list for the duration
of the update ie added with invalid_range_start() and removed
with invalidate_range_end().

Helpers allow querying if a range is valid and wait for it if
necessary.

For proper synchronization, user must block any new range
invalidation from inside their invalidate_range_start() callback.
Otherwise there is no guarantee that a new range invalidation will
not be added after the call to the helper function to query for
existing range.

Changed since v1:
- Fix a possible deadlock in mmu_notifier_range_wait_active()

Changed since v2:
- Add the range to invalid range list before calling ->range_start().
- Del the range from invalid range list after calling ->range_end().
- Remove useless list initialization.

Changed since v3:
- Improved commit message.
- Added comment to explain how helpers function are suppose to be use.
- English syntax fixes.

Changed since v4:
- Syntax fixes.
- Rename from range_*_valid to range_*active|inactive.

Signed-off-by: Jérôme Glisse <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Reviewed-by: Haggai Eran <[email protected]>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c | 13 ++--
drivers/gpu/drm/i915/i915_gem_userptr.c | 16 ++--
drivers/gpu/drm/radeon/radeon_mn.c | 16 ++--
drivers/infiniband/core/umem_odp.c | 20 ++---
drivers/misc/sgi-gru/grutlbpurge.c | 15 ++--
drivers/xen/gntdev.c | 15 ++--
fs/proc/task_mmu.c | 11 ++-
include/linux/mmu_notifier.h | 55 +++++++-------
kernel/events/uprobes.c | 13 ++--
mm/huge_memory.c | 79 +++++++++-----------
mm/hugetlb.c | 55 +++++++-------
mm/ksm.c | 28 +++----
mm/madvise.c | 21 +++---
mm/memory.c | 72 ++++++++++--------
mm/migrate.c | 34 ++++-----
mm/mmu_notifier.c | 128 +++++++++++++++++++++++++++++---
mm/mprotect.c | 18 +++--
mm/mremap.c | 14 ++--
virt/kvm/kvm_main.c | 14 ++--
19 files changed, 369 insertions(+), 268 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
index 7ca805c..7c9eb1b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
@@ -119,27 +119,24 @@ static void amdgpu_mn_release(struct mmu_notifier *mn,
* unmap them by move them into system domain again.
*/
static void amdgpu_mn_invalidate_range_start(struct mmu_notifier *mn,
- struct mm_struct *mm,
- unsigned long start,
- unsigned long end,
- enum mmu_event event)
+ struct mm_struct *mm,
+ const struct mmu_notifier_range *range)
{
struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn);
struct interval_tree_node *it;
-
/* notification is exclusive, but interval is inclusive */
- end -= 1;
+ unsigned long end = range->end - 1;

mutex_lock(&rmn->lock);

- it = interval_tree_iter_first(&rmn->objects, start, end);
+ it = interval_tree_iter_first(&rmn->objects, range->start, end);
while (it) {
struct amdgpu_mn_node *node;
struct amdgpu_bo *bo;
long r;

node = container_of(it, struct amdgpu_mn_node, it);
- it = interval_tree_iter_next(it, start, end);
+ it = interval_tree_iter_next(it, range->start, end);

list_for_each_entry(bo, &node->bos, mn_list) {

diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
index 6767026..5824abe 100644
--- a/drivers/gpu/drm/i915/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
@@ -115,22 +115,20 @@ static unsigned long cancel_userptr(struct i915_mmu_object *mo)
}

static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
- struct mm_struct *mm,
- unsigned long start,
- unsigned long end,
- enum mmu_event event)
+ struct mm_struct *mm,
+ const struct mmu_notifier_range *range)
{
struct i915_mmu_notifier *mn =
container_of(_mn, struct i915_mmu_notifier, mn);
struct i915_mmu_object *mo;
-
/* interval ranges are inclusive, but invalidate range is exclusive */
- end--;
+ unsigned long end = range->end - 1;

spin_lock(&mn->lock);
if (mn->has_linear) {
list_for_each_entry(mo, &mn->linear, link) {
- if (mo->it.last < start || mo->it.start > end)
+ if (mo->it.last < range->start ||
+ mo->it.start > end)
continue;

cancel_userptr(mo);
@@ -138,8 +136,10 @@ static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
} else {
struct interval_tree_node *it;

- it = interval_tree_iter_first(&mn->objects, start, end);
+ it = interval_tree_iter_first(&mn->objects, range->start, end);
while (it) {
+ unsigned long start;
+
mo = container_of(it, struct i915_mmu_object, it);
start = cancel_userptr(mo);
it = interval_tree_iter_next(it, start, end);
diff --git a/drivers/gpu/drm/radeon/radeon_mn.c b/drivers/gpu/drm/radeon/radeon_mn.c
index 3a9615b..5276f01 100644
--- a/drivers/gpu/drm/radeon/radeon_mn.c
+++ b/drivers/gpu/drm/radeon/radeon_mn.c
@@ -112,34 +112,30 @@ static void radeon_mn_release(struct mmu_notifier *mn,
*
* @mn: our notifier
* @mn: the mm this callback is about
- * @start: start of updated range
- * @end: end of updated range
+ * @range: Address range information.
*
* We block for all BOs between start and end to be idle and
* unmap them by move them into system domain again.
*/
static void radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
- struct mm_struct *mm,
- unsigned long start,
- unsigned long end,
- enum mmu_event event)
+ struct mm_struct *mm,
+ const struct mmu_notifier_range *range)
{
struct radeon_mn *rmn = container_of(mn, struct radeon_mn, mn);
struct interval_tree_node *it;
-
/* notification is exclusive, but interval is inclusive */
- end -= 1;
+ unsigned long end = range->end - 1;

mutex_lock(&rmn->lock);

- it = interval_tree_iter_first(&rmn->objects, start, end);
+ it = interval_tree_iter_first(&rmn->objects, range->start, end);
while (it) {
struct radeon_mn_node *node;
struct radeon_bo *bo;
long r;

node = container_of(it, struct radeon_mn_node, it);
- it = interval_tree_iter_next(it, start, end);
+ it = interval_tree_iter_next(it, range->start, end);

list_for_each_entry(bo, &node->bos, mn_list) {

diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index 6ed69fa..58d9a00 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -191,10 +191,8 @@ static int invalidate_range_start_trampoline(struct ib_umem *item, u64 start,
}

static void ib_umem_notifier_invalidate_range_start(struct mmu_notifier *mn,
- struct mm_struct *mm,
- unsigned long start,
- unsigned long end,
- enum mmu_event event)
+ struct mm_struct *mm,
+ const struct mmu_notifier_range *range)
{
struct ib_ucontext *context = container_of(mn, struct ib_ucontext, mn);

@@ -203,8 +201,8 @@ static void ib_umem_notifier_invalidate_range_start(struct mmu_notifier *mn,

ib_ucontext_notifier_start_account(context);
down_read(&context->umem_rwsem);
- rbt_ib_umem_for_each_in_range(&context->umem_tree, start,
- end,
+ rbt_ib_umem_for_each_in_range(&context->umem_tree, range->start,
+ range->end,
invalidate_range_start_trampoline, NULL);
up_read(&context->umem_rwsem);
}
@@ -217,10 +215,8 @@ static int invalidate_range_end_trampoline(struct ib_umem *item, u64 start,
}

static void ib_umem_notifier_invalidate_range_end(struct mmu_notifier *mn,
- struct mm_struct *mm,
- unsigned long start,
- unsigned long end,
- enum mmu_event event)
+ struct mm_struct *mm,
+ const struct mmu_notifier_range *range)
{
struct ib_ucontext *context = container_of(mn, struct ib_ucontext, mn);

@@ -228,8 +224,8 @@ static void ib_umem_notifier_invalidate_range_end(struct mmu_notifier *mn,
return;

down_read(&context->umem_rwsem);
- rbt_ib_umem_for_each_in_range(&context->umem_tree, start,
- end,
+ rbt_ib_umem_for_each_in_range(&context->umem_tree, range->start,
+ range->end,
invalidate_range_end_trampoline, NULL);
up_read(&context->umem_rwsem);
ib_ucontext_notifier_end_account(context);
diff --git a/drivers/misc/sgi-gru/grutlbpurge.c b/drivers/misc/sgi-gru/grutlbpurge.c
index 1c220ae..40cf589 100644
--- a/drivers/misc/sgi-gru/grutlbpurge.c
+++ b/drivers/misc/sgi-gru/grutlbpurge.c
@@ -221,8 +221,7 @@ void gru_flush_all_tlb(struct gru_state *gru)
*/
static void gru_invalidate_range_start(struct mmu_notifier *mn,
struct mm_struct *mm,
- unsigned long start, unsigned long end,
- enum mmu_event event)
+ const struct mmu_notifier_range *range)
{
struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
ms_notifier);
@@ -230,14 +229,13 @@ static void gru_invalidate_range_start(struct mmu_notifier *mn,
STAT(mmu_invalidate_range);
atomic_inc(&gms->ms_range_active);
gru_dbg(grudev, "gms %p, start 0x%lx, end 0x%lx, act %d\n", gms,
- start, end, atomic_read(&gms->ms_range_active));
- gru_flush_tlb_range(gms, start, end - start);
+ range->start, range->end, atomic_read(&gms->ms_range_active));
+ gru_flush_tlb_range(gms, range->start, range->end - range->start);
}

static void gru_invalidate_range_end(struct mmu_notifier *mn,
- struct mm_struct *mm, unsigned long start,
- unsigned long end,
- enum mmu_event event)
+ struct mm_struct *mm,
+ const struct mmu_notifier_range *range)
{
struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
ms_notifier);
@@ -246,7 +244,8 @@ static void gru_invalidate_range_end(struct mmu_notifier *mn,
(void)atomic_dec_and_test(&gms->ms_range_active);

wake_up_all(&gms->ms_wait_queue);
- gru_dbg(grudev, "gms %p, start 0x%lx, end 0x%lx\n", gms, start, end);
+ gru_dbg(grudev, "gms %p, start 0x%lx, end 0x%lx\n", gms,
+ range->start, range->end);
}

static void gru_invalidate_page(struct mmu_notifier *mn, struct mm_struct *mm,
diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index 91c6804..0ca3720 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -467,19 +467,17 @@ static void unmap_if_in_range(struct grant_map *map,

static void mn_invl_range_start(struct mmu_notifier *mn,
struct mm_struct *mm,
- unsigned long start,
- unsigned long end,
- enum mmu_event event)
+ const struct mmu_notifier_range *range)
{
struct gntdev_priv *priv = container_of(mn, struct gntdev_priv, mn);
struct grant_map *map;

mutex_lock(&priv->lock);
list_for_each_entry(map, &priv->maps, next) {
- unmap_if_in_range(map, start, end);
+ unmap_if_in_range(map, range->start, range->end);
}
list_for_each_entry(map, &priv->freeable_maps, next) {
- unmap_if_in_range(map, start, end);
+ unmap_if_in_range(map, range->start, range->end);
}
mutex_unlock(&priv->lock);
}
@@ -489,7 +487,12 @@ static void mn_invl_page(struct mmu_notifier *mn,
unsigned long address,
enum mmu_event event)
{
- mn_invl_range_start(mn, mm, address, address + PAGE_SIZE, event);
+ struct mmu_notifier_range range;
+
+ range.start = address;
+ range.end = address + PAGE_SIZE;
+ range.event = event;
+ mn_invl_range_start(mn, mm, &range);
}

static void mn_release(struct mmu_notifier *mn,
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 68942ee..062cc53 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1011,6 +1011,11 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
.mm = mm,
.private = &cp,
};
+ struct mmu_notifier_range range = {
+ .start = 0,
+ .end = ~0UL,
+ .event = MMU_CLEAR_SOFT_DIRTY,
+ };

if (type == CLEAR_REFS_MM_HIWATER_RSS) {
/*
@@ -1037,13 +1042,11 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
downgrade_write(&mm->mmap_sem);
break;
}
- mmu_notifier_invalidate_range_start(mm, 0, -1,
- MMU_CLEAR_SOFT_DIRTY);
+ mmu_notifier_invalidate_range_start(mm, &range);
}
walk_page_range(0, ~0UL, &clear_refs_walk);
if (type == CLEAR_REFS_SOFT_DIRTY)
- mmu_notifier_invalidate_range_end(mm, 0, -1,
- MMU_CLEAR_SOFT_DIRTY);
+ mmu_notifier_invalidate_range_end(mm, &range);
flush_tlb_mm(mm);
up_read(&mm->mmap_sem);
out_mm:
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 906aad0..c4ba044 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -75,6 +75,13 @@ enum mmu_event {
MMU_KSM_WRITE_PROTECT,
};

+struct mmu_notifier_range {
+ struct list_head list;
+ unsigned long start;
+ unsigned long end;
+ enum mmu_event event;
+};
+
#ifdef CONFIG_MMU_NOTIFIER

/*
@@ -88,6 +95,12 @@ struct mmu_notifier_mm {
struct hlist_head list;
/* to serialize the list modifications and hlist_unhashed */
spinlock_t lock;
+ /* List of all active range invalidations. */
+ struct list_head ranges;
+ /* Number of active range invalidations. */
+ int nranges;
+ /* For threads waiting on range invalidations. */
+ wait_queue_head_t wait_queue;
};

struct mmu_notifier_ops {
@@ -218,14 +231,10 @@ struct mmu_notifier_ops {
*/
void (*invalidate_range_start)(struct mmu_notifier *mn,
struct mm_struct *mm,
- unsigned long start,
- unsigned long end,
- enum mmu_event event);
+ const struct mmu_notifier_range *range);
void (*invalidate_range_end)(struct mmu_notifier *mn,
struct mm_struct *mm,
- unsigned long start,
- unsigned long end,
- enum mmu_event event);
+ const struct mmu_notifier_range *range);

/*
* invalidate_range() is either called between
@@ -298,15 +307,17 @@ extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
unsigned long address,
enum mmu_event event);
extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
- unsigned long start,
- unsigned long end,
- enum mmu_event event);
+ struct mmu_notifier_range *range);
extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
- unsigned long start,
- unsigned long end,
- enum mmu_event event);
+ struct mmu_notifier_range *range);
extern void __mmu_notifier_invalidate_range(struct mm_struct *mm,
unsigned long start, unsigned long end);
+extern bool mmu_notifier_range_inactive(struct mm_struct *mm,
+ unsigned long start,
+ unsigned long end);
+extern void mmu_notifier_range_wait_active(struct mm_struct *mm,
+ unsigned long start,
+ unsigned long end);

static inline void mmu_notifier_release(struct mm_struct *mm)
{
@@ -358,21 +369,17 @@ static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
}

static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
- unsigned long start,
- unsigned long end,
- enum mmu_event event)
+ struct mmu_notifier_range *range)
{
if (mm_has_notifiers(mm))
- __mmu_notifier_invalidate_range_start(mm, start, end, event);
+ __mmu_notifier_invalidate_range_start(mm, range);
}

static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
- unsigned long start,
- unsigned long end,
- enum mmu_event event)
+ struct mmu_notifier_range *range)
{
if (mm_has_notifiers(mm))
- __mmu_notifier_invalidate_range_end(mm, start, end, event);
+ __mmu_notifier_invalidate_range_end(mm, range);
}

static inline void mmu_notifier_invalidate_range(struct mm_struct *mm,
@@ -536,16 +543,12 @@ static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
}

static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
- unsigned long start,
- unsigned long end,
- enum mmu_event event)
+ struct mmu_notifier_range *range)
{
}

static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
- unsigned long start,
- unsigned long end,
- enum mmu_event event)
+ struct mmu_notifier_range *range)
{
}

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 4f84dc1..36dca4b 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -156,9 +156,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
spinlock_t *ptl;
pte_t *ptep;
int err;
- /* For mmu_notifiers */
- const unsigned long mmun_start = addr;
- const unsigned long mmun_end = addr + PAGE_SIZE;
+ struct mmu_notifier_range range;
struct mem_cgroup *memcg;

err = mem_cgroup_try_charge(kpage, vma->vm_mm, GFP_KERNEL, &memcg,
@@ -169,8 +167,10 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
/* For try_to_free_swap() and munlock_vma_page() below */
lock_page(page);

- mmu_notifier_invalidate_range_start(mm, mmun_start,
- mmun_end, MMU_MIGRATE);
+ range.start = addr;
+ range.end = addr + PAGE_SIZE;
+ range.event = MMU_MIGRATE;
+ mmu_notifier_invalidate_range_start(mm, &range);
err = -EAGAIN;
ptep = page_check_address(page, mm, addr, &ptl, 0);
if (!ptep)
@@ -204,8 +204,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
err = 0;
unlock:
mem_cgroup_cancel_charge(kpage, memcg, false);
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end,
- MMU_MIGRATE);
+ mmu_notifier_invalidate_range_end(mm, &range);
unlock_page(page);
return err;
}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 75633bb..9f5de26 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1165,8 +1165,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
pmd_t _pmd;
int ret = 0, i;
struct page **pages;
- unsigned long mmun_start; /* For mmu_notifiers */
- unsigned long mmun_end; /* For mmu_notifiers */
+ struct mmu_notifier_range range;

pages = kmalloc(sizeof(struct page *) * HPAGE_PMD_NR,
GFP_KERNEL);
@@ -1205,10 +1204,10 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
cond_resched();
}

- mmun_start = haddr;
- mmun_end = haddr + HPAGE_PMD_SIZE;
- mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
- MMU_MIGRATE);
+ range.start = haddr;
+ range.end = haddr + HPAGE_PMD_SIZE;
+ range.event = MMU_MIGRATE;
+ mmu_notifier_invalidate_range_start(mm, &range);

ptl = pmd_lock(mm, pmd);
if (unlikely(!pmd_same(*pmd, orig_pmd)))
@@ -1242,8 +1241,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
page_remove_rmap(page, true);
spin_unlock(ptl);

- mmu_notifier_invalidate_range_end(mm, mmun_start,
- mmun_end, MMU_MIGRATE);
+ mmu_notifier_invalidate_range_end(mm, &range);

ret |= VM_FAULT_WRITE;
put_page(page);
@@ -1253,8 +1251,7 @@ out:

out_free_pages:
spin_unlock(ptl);
- mmu_notifier_invalidate_range_end(mm, mmun_start,
- mmun_end, MMU_MIGRATE);
+ mmu_notifier_invalidate_range_end(mm, &range);
for (i = 0; i < HPAGE_PMD_NR; i++) {
memcg = (void *)page_private(pages[i]);
set_page_private(pages[i], 0);
@@ -1273,9 +1270,8 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
struct page *page = NULL, *new_page;
struct mem_cgroup *memcg;
unsigned long haddr;
- unsigned long mmun_start; /* For mmu_notifiers */
- unsigned long mmun_end; /* For mmu_notifiers */
gfp_t huge_gfp; /* for allocation and charge */
+ struct mmu_notifier_range range;

ptl = pmd_lockptr(mm, pmd);
VM_BUG_ON_VMA(!vma->anon_vma, vma);
@@ -1357,10 +1353,10 @@ alloc:
copy_user_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR);
__SetPageUptodate(new_page);

- mmun_start = haddr;
- mmun_end = haddr + HPAGE_PMD_SIZE;
- mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
- MMU_MIGRATE);
+ range.start = haddr;
+ range.end = haddr + HPAGE_PMD_SIZE;
+ range.event = MMU_MIGRATE;
+ mmu_notifier_invalidate_range_start(mm, &range);

spin_lock(ptl);
if (page)
@@ -1392,8 +1388,7 @@ alloc:
}
spin_unlock(ptl);
out_mn:
- mmu_notifier_invalidate_range_end(mm, mmun_start,
- mmun_end, MMU_MIGRATE);
+ mmu_notifier_invalidate_range_end(mm, &range);
out:
return ret;
out_unlock:
@@ -2401,8 +2396,7 @@ static void collapse_huge_page(struct mm_struct *mm,
int isolated = 0, result = 0;
unsigned long hstart, hend;
struct mem_cgroup *memcg;
- unsigned long mmun_start; /* For mmu_notifiers */
- unsigned long mmun_end; /* For mmu_notifiers */
+ struct mmu_notifier_range range;
gfp_t gfp;

VM_BUG_ON(address & ~HPAGE_PMD_MASK);
@@ -2462,10 +2456,10 @@ static void collapse_huge_page(struct mm_struct *mm,
pte = pte_offset_map(pmd, address);
pte_ptl = pte_lockptr(mm, pmd);

- mmun_start = address;
- mmun_end = address + HPAGE_PMD_SIZE;
- mmu_notifier_invalidate_range_start(mm, mmun_start,
- mmun_end, MMU_MIGRATE);
+ range.start = address;
+ range.end = address + HPAGE_PMD_SIZE;
+ range.event = MMU_MIGRATE;
+ mmu_notifier_invalidate_range_start(mm, &range);
pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
/*
* After this gup_fast can't run anymore. This also removes
@@ -2475,8 +2469,7 @@ static void collapse_huge_page(struct mm_struct *mm,
*/
_pmd = pmdp_collapse_flush(vma, address, pmd);
spin_unlock(pmd_ptl);
- mmu_notifier_invalidate_range_end(mm, mmun_start,
- mmun_end, MMU_MIGRATE);
+ mmu_notifier_invalidate_range_end(mm, &range);

spin_lock(pte_ptl);
isolated = __collapse_huge_page_isolate(vma, address, pte);
@@ -3033,9 +3026,12 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
struct mm_struct *mm = vma->vm_mm;
struct page *page = NULL;
unsigned long haddr = address & HPAGE_PMD_MASK;
+ struct mmu_notifier_range range;

- mmu_notifier_invalidate_range_start(mm, haddr, haddr + HPAGE_PMD_SIZE,
- MMU_HUGE_PAGE_SPLIT);
+ range.start = haddr;
+ range.end = haddr + HPAGE_PMD_SIZE;
+ range.event = MMU_HUGE_PAGE_SPLIT;
+ mmu_notifier_invalidate_range_start(mm, &range);
ptl = pmd_lock(mm, pmd);
if (pmd_trans_huge(*pmd)) {
page = pmd_page(*pmd);
@@ -3048,8 +3044,7 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
__split_huge_pmd_locked(vma, pmd, haddr, false);
out:
spin_unlock(ptl);
- mmu_notifier_invalidate_range_end(mm, haddr, haddr + HPAGE_PMD_SIZE,
- MMU_HUGE_PAGE_SPLIT);
+ mmu_notifier_invalidate_range_end(mm, &range);
if (page) {
lock_page(page);
munlock_vma_page(page);
@@ -3212,14 +3207,14 @@ static void freeze_page(struct anon_vma *anon_vma, struct page *page)
anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff,
pgoff + HPAGE_PMD_NR - 1) {
unsigned long address = __vma_address(page, avc->vma);
+ struct mmu_notifier_range range;

- mmu_notifier_invalidate_range_start(avc->vma->vm_mm,
- address, address + HPAGE_PMD_SIZE,
- MMU_HUGE_FREEZE);
+ range.start = address;
+ range.end = address + HPAGE_PMD_SIZE;
+ range.event = MMU_HUGE_FREEZE;
+ mmu_notifier_invalidate_range_start(avc->vma->vm_mm, &range);
freeze_page_vma(avc->vma, page, address);
- mmu_notifier_invalidate_range_end(avc->vma->vm_mm,
- address, address + HPAGE_PMD_SIZE,
- MMU_HUGE_FREEZE);
+ mmu_notifier_invalidate_range_end(avc->vma->vm_mm, &range);
}
}

@@ -3295,14 +3290,14 @@ static void unfreeze_page(struct anon_vma *anon_vma, struct page *page)
anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root,
pgoff, pgoff + HPAGE_PMD_NR - 1) {
unsigned long address = __vma_address(page, avc->vma);
+ struct mmu_notifier_range range;

- mmu_notifier_invalidate_range_start(avc->vma->vm_mm,
- address, address + HPAGE_PMD_SIZE,
- MMU_HUGE_UNFREEZE);
+ range.start = address;
+ range.end = address + HPAGE_PMD_SIZE;
+ range.event = MMU_HUGE_UNFREEZE;
+ mmu_notifier_invalidate_range_start(avc->vma->vm_mm, &range);
unfreeze_page_vma(avc->vma, page, address);
- mmu_notifier_invalidate_range_end(avc->vma->vm_mm,
- address, address + HPAGE_PMD_SIZE,
- MMU_HUGE_UNFREEZE);
+ mmu_notifier_invalidate_range_end(avc->vma->vm_mm, &range);
}
}

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 1721d87..cd42f75 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3050,17 +3050,16 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
int cow;
struct hstate *h = hstate_vma(vma);
unsigned long sz = huge_page_size(h);
- unsigned long mmun_start; /* For mmu_notifiers */
- unsigned long mmun_end; /* For mmu_notifiers */
+ struct mmu_notifier_range range;
int ret = 0;

cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;

- mmun_start = vma->vm_start;
- mmun_end = vma->vm_end;
+ range.start = vma->vm_start;
+ range.end = vma->vm_end;
+ range.event = MMU_MIGRATE;
if (cow)
- mmu_notifier_invalidate_range_start(src, mmun_start,
- mmun_end, MMU_MIGRATE);
+ mmu_notifier_invalidate_range_start(src, &range);

for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
spinlock_t *src_ptl, *dst_ptl;
@@ -3100,8 +3099,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
} else {
if (cow) {
huge_ptep_set_wrprotect(src, addr, src_pte);
- mmu_notifier_invalidate_range(src, mmun_start,
- mmun_end);
+ mmu_notifier_invalidate_range(src, range.start,
+ range.end);
}
entry = huge_ptep_get(src_pte);
ptepage = pte_page(entry);
@@ -3115,8 +3114,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
}

if (cow)
- mmu_notifier_invalidate_range_end(src, mmun_start,
- mmun_end, MMU_MIGRATE);
+ mmu_notifier_invalidate_range_end(src, &range);

return ret;
}
@@ -3134,16 +3132,17 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
struct page *page;
struct hstate *h = hstate_vma(vma);
unsigned long sz = huge_page_size(h);
- const unsigned long mmun_start = start; /* For mmu_notifiers */
- const unsigned long mmun_end = end; /* For mmu_notifiers */
+ struct mmu_notifier_range range;

WARN_ON(!is_vm_hugetlb_page(vma));
BUG_ON(start & ~huge_page_mask(h));
BUG_ON(end & ~huge_page_mask(h));

+ range.start = start;
+ range.end = end;
+ range.event = MMU_MIGRATE;
tlb_start_vma(tlb, vma);
- mmu_notifier_invalidate_range_start(mm, mmun_start,
- mmun_end, MMU_MIGRATE);
+ mmu_notifier_invalidate_range_start(mm, &range);
address = start;
again:
for (; address < end; address += sz) {
@@ -3218,8 +3217,7 @@ unlock:
if (address < end && !ref_page)
goto again;
}
- mmu_notifier_invalidate_range_end(mm, mmun_start,
- mmun_end, MMU_MIGRATE);
+ mmu_notifier_invalidate_range_end(mm, &range);
tlb_end_vma(tlb, vma);
}

@@ -3324,8 +3322,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
struct hstate *h = hstate_vma(vma);
struct page *old_page, *new_page;
int ret = 0, outside_reserve = 0;
- unsigned long mmun_start; /* For mmu_notifiers */
- unsigned long mmun_end; /* For mmu_notifiers */
+ struct mmu_notifier_range range;

old_page = pte_page(pte);

@@ -3404,10 +3401,11 @@ retry_avoidcopy:
__SetPageUptodate(new_page);
set_page_huge_active(new_page);

- mmun_start = address & huge_page_mask(h);
- mmun_end = mmun_start + huge_page_size(h);
- mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
- MMU_MIGRATE);
+ range.start = address & huge_page_mask(h);
+ range.end = range.start + huge_page_size(h);
+ range.event = MMU_MIGRATE;
+ mmu_notifier_invalidate_range_start(mm, &range);
+
/*
* Retake the page table lock to check for racing updates
* before the page tables are altered
@@ -3419,7 +3417,7 @@ retry_avoidcopy:

/* Break COW */
huge_ptep_clear_flush(vma, address, ptep);
- mmu_notifier_invalidate_range(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range(mm, range.start, range.end);
set_huge_pte_at(mm, address, ptep,
make_huge_pte(vma, new_page, 1));
page_remove_rmap(old_page, true);
@@ -3428,8 +3426,7 @@ retry_avoidcopy:
new_page = old_page;
}
spin_unlock(ptl);
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end,
- MMU_MIGRATE);
+ mmu_notifier_invalidate_range_end(mm, &range);
out_release_all:
page_cache_release(new_page);
out_release_old:
@@ -3908,11 +3905,15 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
pte_t pte;
struct hstate *h = hstate_vma(vma);
unsigned long pages = 0;
+ struct mmu_notifier_range range;

BUG_ON(address >= end);
flush_cache_range(vma, address, end);

- mmu_notifier_invalidate_range_start(mm, start, end, MMU_MPROT);
+ range.start = start;
+ range.end = end;
+ range.event = MMU_MPROT;
+ mmu_notifier_invalidate_range_start(mm, &range);
i_mmap_lock_write(vma->vm_file->f_mapping);
for (; address < end; address += huge_page_size(h)) {
spinlock_t *ptl;
@@ -3962,7 +3963,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
flush_tlb_range(vma, start, end);
mmu_notifier_invalidate_range(mm, start, end);
i_mmap_unlock_write(vma->vm_file->f_mapping);
- mmu_notifier_invalidate_range_end(mm, start, end, MMU_MPROT);
+ mmu_notifier_invalidate_range_end(mm, &range);

return pages << h->order;
}
diff --git a/mm/ksm.c b/mm/ksm.c
index 5b2f07a..75194f5 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -994,14 +994,13 @@ static inline int pages_identical(struct page *page1, struct page *page2)
static int write_protect_page(struct vm_area_struct *vma, struct page *page,
pte_t *orig_pte)
{
+ struct mmu_notifier_range range;
struct mm_struct *mm = vma->vm_mm;
unsigned long addr;
pte_t *ptep;
spinlock_t *ptl;
int swapped;
int err = -EFAULT;
- unsigned long mmun_start; /* For mmu_notifiers */
- unsigned long mmun_end; /* For mmu_notifiers */

addr = page_address_in_vma(page, vma);
if (addr == -EFAULT)
@@ -1009,10 +1008,10 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,

BUG_ON(PageTransCompound(page));

- mmun_start = addr;
- mmun_end = addr + PAGE_SIZE;
- mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
- MMU_KSM_WRITE_PROTECT);
+ range.start = addr;
+ range.end = addr + PAGE_SIZE;
+ range.event = MMU_KSM_WRITE_PROTECT;
+ mmu_notifier_invalidate_range_start(mm, &range);

ptep = page_check_address(page, mm, addr, &ptl, 0);
if (!ptep)
@@ -1052,8 +1051,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
out_unlock:
pte_unmap_unlock(ptep, ptl);
out_mn:
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end,
- MMU_KSM_WRITE_PROTECT);
+ mmu_notifier_invalidate_range_end(mm, &range);
out:
return err;
}
@@ -1076,8 +1074,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
spinlock_t *ptl;
unsigned long addr;
int err = -EFAULT;
- unsigned long mmun_start; /* For mmu_notifiers */
- unsigned long mmun_end; /* For mmu_notifiers */
+ struct mmu_notifier_range range;

addr = page_address_in_vma(page, vma);
if (addr == -EFAULT)
@@ -1087,10 +1084,10 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
if (!pmd)
goto out;

- mmun_start = addr;
- mmun_end = addr + PAGE_SIZE;
- mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
- MMU_MIGRATE);
+ range.start = addr;
+ range.end = addr + PAGE_SIZE;
+ range.event = MMU_MIGRATE;
+ mmu_notifier_invalidate_range_start(mm, &range);

ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
if (!pte_same(*ptep, orig_pte)) {
@@ -1115,8 +1112,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
pte_unmap_unlock(ptep, ptl);
err = 0;
out_mn:
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end,
- MMU_MIGRATE);
+ mmu_notifier_invalidate_range_end(mm, &range);
out:
return err;
}
diff --git a/mm/madvise.c b/mm/madvise.c
index f2fdfcd..6584b70 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -411,8 +411,8 @@ static void madvise_free_page_range(struct mmu_gather *tlb,
static int madvise_free_single_vma(struct vm_area_struct *vma,
unsigned long start_addr, unsigned long end_addr)
{
- unsigned long start, end;
struct mm_struct *mm = vma->vm_mm;
+ struct mmu_notifier_range range;
struct mmu_gather tlb;

if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP))
@@ -422,21 +422,22 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
if (!vma_is_anonymous(vma))
return -EINVAL;

- start = max(vma->vm_start, start_addr);
- if (start >= vma->vm_end)
+ range.start = max(vma->vm_start, start_addr);
+ if (range.start >= vma->vm_end)
return -EINVAL;
- end = min(vma->vm_end, end_addr);
- if (end <= vma->vm_start)
+ range.end = min(vma->vm_end, end_addr);
+ if (range.end <= vma->vm_start)
return -EINVAL;

lru_add_drain();
- tlb_gather_mmu(&tlb, mm, start, end);
+ tlb_gather_mmu(&tlb, mm, range.start, range.end);
update_hiwater_rss(mm);

- mmu_notifier_invalidate_range_start(mm, start, end, MMU_MUNMAP);
- madvise_free_page_range(&tlb, vma, start, end);
- mmu_notifier_invalidate_range_end(mm, start, end, MMU_MUNMAP);
- tlb_finish_mmu(&tlb, start, end);
+ range.event = MMU_MUNMAP;
+ mmu_notifier_invalidate_range_start(mm, &range);
+ madvise_free_page_range(&tlb, vma, range.start, range.end);
+ mmu_notifier_invalidate_range_end(mm, &range);
+ tlb_finish_mmu(&tlb, range.start, range.end);

return 0;
}
diff --git a/mm/memory.c b/mm/memory.c
index 502a8a3..532d80f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -998,8 +998,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
unsigned long next;
unsigned long addr = vma->vm_start;
unsigned long end = vma->vm_end;
- unsigned long mmun_start; /* For mmu_notifiers */
- unsigned long mmun_end; /* For mmu_notifiers */
+ struct mmu_notifier_range range;
bool is_cow;
int ret;

@@ -1033,11 +1032,11 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
* is_cow_mapping() returns true.
*/
is_cow = is_cow_mapping(vma->vm_flags);
- mmun_start = addr;
- mmun_end = end;
+ range.start = addr;
+ range.end = end;
+ range.event = MMU_FORK;
if (is_cow)
- mmu_notifier_invalidate_range_start(src_mm, mmun_start,
- mmun_end, MMU_FORK);
+ mmu_notifier_invalidate_range_start(src_mm, &range);

ret = 0;
dst_pgd = pgd_offset(dst_mm, addr);
@@ -1054,8 +1053,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
} while (dst_pgd++, src_pgd++, addr = next, addr != end);

if (is_cow)
- mmu_notifier_invalidate_range_end(src_mm, mmun_start,
- mmun_end, MMU_FORK);
+ mmu_notifier_invalidate_range_end(src_mm, &range);
return ret;
}

@@ -1314,13 +1312,16 @@ void unmap_vmas(struct mmu_gather *tlb,
unsigned long end_addr)
{
struct mm_struct *mm = vma->vm_mm;
+ struct mmu_notifier_range range = {
+ .start = start_addr,
+ .end = end_addr,
+ .event = MMU_MUNMAP,
+ };

- mmu_notifier_invalidate_range_start(mm, start_addr,
- end_addr, MMU_MUNMAP);
+ mmu_notifier_invalidate_range_start(mm, &range);
for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
unmap_single_vma(tlb, vma, start_addr, end_addr, NULL);
- mmu_notifier_invalidate_range_end(mm, start_addr,
- end_addr, MMU_MUNMAP);
+ mmu_notifier_invalidate_range_end(mm, &range);
}

/**
@@ -1337,16 +1338,20 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start,
{
struct mm_struct *mm = vma->vm_mm;
struct mmu_gather tlb;
- unsigned long end = start + size;
+ struct mmu_notifier_range range = {
+ .start = start,
+ .end = start + size,
+ .event = MMU_MIGRATE,
+ };

lru_add_drain();
- tlb_gather_mmu(&tlb, mm, start, end);
+ tlb_gather_mmu(&tlb, mm, start, range.end);
update_hiwater_rss(mm);
- mmu_notifier_invalidate_range_start(mm, start, end, MMU_MIGRATE);
- for ( ; vma && vma->vm_start < end; vma = vma->vm_next)
- unmap_single_vma(&tlb, vma, start, end, details);
- mmu_notifier_invalidate_range_end(mm, start, end, MMU_MIGRATE);
- tlb_finish_mmu(&tlb, start, end);
+ mmu_notifier_invalidate_range_start(mm, &range);
+ for ( ; vma && vma->vm_start < range.end; vma = vma->vm_next)
+ unmap_single_vma(&tlb, vma, start, range.end, details);
+ mmu_notifier_invalidate_range_end(mm, &range);
+ tlb_finish_mmu(&tlb, start, range.end);
}

/**
@@ -1363,15 +1368,19 @@ static void zap_page_range_single(struct vm_area_struct *vma, unsigned long addr
{
struct mm_struct *mm = vma->vm_mm;
struct mmu_gather tlb;
- unsigned long end = address + size;
+ struct mmu_notifier_range range = {
+ .start = address,
+ .end = address + size,
+ .event = MMU_MUNMAP,
+ };

lru_add_drain();
- tlb_gather_mmu(&tlb, mm, address, end);
+ tlb_gather_mmu(&tlb, mm, address, range.end);
update_hiwater_rss(mm);
- mmu_notifier_invalidate_range_start(mm, address, end, MMU_MUNMAP);
- unmap_single_vma(&tlb, vma, address, end, details);
- mmu_notifier_invalidate_range_end(mm, address, end, MMU_MUNMAP);
- tlb_finish_mmu(&tlb, address, end);
+ mmu_notifier_invalidate_range_start(mm, &range);
+ unmap_single_vma(&tlb, vma, address, range.end, details);
+ mmu_notifier_invalidate_range_end(mm, &range);
+ tlb_finish_mmu(&tlb, address, range.end);
}

/**
@@ -2004,6 +2013,7 @@ static inline int wp_page_reuse(struct mm_struct *mm,
__releases(ptl)
{
pte_t entry;
+
/*
* Clear the pages cpupid information as the existing
* information potentially belongs to a now completely
@@ -2071,9 +2081,8 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
spinlock_t *ptl = NULL;
pte_t entry;
int page_copied = 0;
- const unsigned long mmun_start = address & PAGE_MASK; /* For mmu_notifiers */
- const unsigned long mmun_end = mmun_start + PAGE_SIZE; /* For mmu_notifiers */
struct mem_cgroup *memcg;
+ struct mmu_notifier_range range;

if (unlikely(anon_vma_prepare(vma)))
goto oom;
@@ -2094,8 +2103,10 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,

__SetPageUptodate(new_page);

- mmu_notifier_invalidate_range_start(mm, mmun_start,
- mmun_end, MMU_MIGRATE);
+ range.start = address & PAGE_MASK;
+ range.end = range.start + PAGE_SIZE;
+ range.event = MMU_MIGRATE;
+ mmu_notifier_invalidate_range_start(mm, &range);

/*
* Re-check the pte - we dropped the lock
@@ -2168,8 +2179,7 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
page_cache_release(new_page);

pte_unmap_unlock(page_table, ptl);
- mmu_notifier_invalidate_range_end(mm, mmun_start,
- mmun_end, MMU_MIGRATE);
+ mmu_notifier_invalidate_range_end(mm, &range);
if (old_page) {
/*
* Don't let another task, with possibly unlocked vma,
diff --git a/mm/migrate.c b/mm/migrate.c
index 09ba4bb..ab96df1 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1749,10 +1749,13 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
int isolated = 0;
struct page *new_page = NULL;
int page_lru = page_is_file_cache(page);
- unsigned long mmun_start = address & HPAGE_PMD_MASK;
- unsigned long mmun_end = mmun_start + HPAGE_PMD_SIZE;
+ struct mmu_notifier_range range;
pmd_t orig_entry;

+ range.start = address & HPAGE_PMD_MASK;
+ range.end = range.start + HPAGE_PMD_SIZE;
+ range.event = MMU_MIGRATE;
+
/*
* Rate-limit the amount of data that is being migrated to a node.
* Optimal placement is no good if the memory bus is saturated and
@@ -1778,7 +1781,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
* mapping or not. Hence use the tlb range variant
*/
if (mm_tlb_flush_pending(mm))
- flush_tlb_range(vma, mmun_start, mmun_end);
+ flush_tlb_range(vma, range.start, range.end);

/* Prepare a page as a migration target */
__SetPageLocked(new_page);
@@ -1791,14 +1794,12 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
WARN_ON(PageLRU(new_page));

/* Recheck the target PMD */
- mmu_notifier_invalidate_range_start(mm, mmun_start,
- mmun_end, MMU_MIGRATE);
+ mmu_notifier_invalidate_range_start(mm, &range);
ptl = pmd_lock(mm, pmd);
if (unlikely(!pmd_same(*pmd, entry) || page_count(page) != 2)) {
fail_putback:
spin_unlock(ptl);
- mmu_notifier_invalidate_range_end(mm, mmun_start,
- mmun_end, MMU_MIGRATE);
+ mmu_notifier_invalidate_range_end(mm, &range);

/* Reverse changes made by migrate_page_copy() */
if (TestClearPageActive(new_page))
@@ -1830,16 +1831,16 @@ fail_putback:
* The SetPageUptodate on the new page and page_add_new_anon_rmap
* guarantee the copy is visible before the pagetable update.
*/
- flush_cache_range(vma, mmun_start, mmun_end);
- page_add_anon_rmap(new_page, vma, mmun_start, true);
- pmdp_huge_clear_flush_notify(vma, mmun_start, pmd);
- set_pmd_at(mm, mmun_start, pmd, entry);
+ flush_cache_range(vma, range.start, range.end);
+ page_add_anon_rmap(new_page, vma, range.start, true);
+ pmdp_huge_clear_flush_notify(vma, range.start, pmd);
+ set_pmd_at(mm, range.start, pmd, entry);
update_mmu_cache_pmd(vma, address, &entry);

if (page_count(page) != 2) {
- set_pmd_at(mm, mmun_start, pmd, orig_entry);
- flush_pmd_tlb_range(vma, mmun_start, mmun_end);
- mmu_notifier_invalidate_range(mm, mmun_start, mmun_end);
+ set_pmd_at(mm, range.start, pmd, orig_entry);
+ flush_pmd_tlb_range(vma, range.start, range.end);
+ mmu_notifier_invalidate_range(mm, range.start, range.end);
update_mmu_cache_pmd(vma, address, &entry);
page_remove_rmap(new_page, true);
goto fail_putback;
@@ -1850,8 +1851,7 @@ fail_putback:
set_page_owner_migrate_reason(new_page, MR_NUMA_MISPLACED);

spin_unlock(ptl);
- mmu_notifier_invalidate_range_end(mm, mmun_start,
- mmun_end, MMU_MIGRATE);
+ mmu_notifier_invalidate_range_end(mm, &range);

/* Take an "isolate" reference and put new page on the LRU. */
get_page(new_page);
@@ -1876,7 +1876,7 @@ out_dropref:
ptl = pmd_lock(mm, pmd);
if (pmd_same(*pmd, entry)) {
entry = pmd_modify(entry, vma->vm_page_prot);
- set_pmd_at(mm, mmun_start, pmd, entry);
+ set_pmd_at(mm, range.start, pmd, entry);
update_mmu_cache_pmd(vma, address, &entry);
}
spin_unlock(ptl);
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index b806bdb..c43c851 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -191,28 +191,28 @@ void __mmu_notifier_invalidate_page(struct mm_struct *mm,
}

void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
- unsigned long start,
- unsigned long end,
- enum mmu_event event)
+ struct mmu_notifier_range *range)

{
struct mmu_notifier *mn;
int id;

+ spin_lock(&mm->mmu_notifier_mm->lock);
+ list_add_tail(&range->list, &mm->mmu_notifier_mm->ranges);
+ mm->mmu_notifier_mm->nranges++;
+ spin_unlock(&mm->mmu_notifier_mm->lock);
+
id = srcu_read_lock(&srcu);
hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
if (mn->ops->invalidate_range_start)
- mn->ops->invalidate_range_start(mn, mm, start,
- end, event);
+ mn->ops->invalidate_range_start(mn, mm, range);
}
srcu_read_unlock(&srcu, id);
}
EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start);

void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
- unsigned long start,
- unsigned long end,
- enum mmu_event event)
+ struct mmu_notifier_range *range)
{
struct mmu_notifier *mn;
int id;
@@ -228,12 +228,23 @@ void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
* (besides the pointer check).
*/
if (mn->ops->invalidate_range)
- mn->ops->invalidate_range(mn, mm, start, end);
+ mn->ops->invalidate_range(mn, mm,
+ range->start, range->end);
if (mn->ops->invalidate_range_end)
- mn->ops->invalidate_range_end(mn, mm, start,
- end, event);
+ mn->ops->invalidate_range_end(mn, mm, range);
}
srcu_read_unlock(&srcu, id);
+
+ spin_lock(&mm->mmu_notifier_mm->lock);
+ list_del_init(&range->list);
+ mm->mmu_notifier_mm->nranges--;
+ spin_unlock(&mm->mmu_notifier_mm->lock);
+
+ /*
+ * Wakeup after callback so they can do their job before any of the
+ * waiters resume.
+ */
+ wake_up(&mm->mmu_notifier_mm->wait_queue);
}
EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_end);

@@ -252,6 +263,98 @@ void __mmu_notifier_invalidate_range(struct mm_struct *mm,
}
EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range);

+/* mmu_notifier_range_inactive_locked() - test if range overlaps with active
+ * invalidation.
+ *
+ * @mm: The mm struct.
+ * @start: Start address of the range (inclusive).
+ * @end: End address of the range (exclusive).
+ * Returns: false if overlaps with an active invalidation, true otherwise.
+ *
+ * This function tests whether any active invalidation range conflicts with a
+ * given range ([start, end[), active invalidations are added to a list inside
+ * __mmu_notifier_invalidate_range_start() and removed from that list inside
+ * __mmu_notifier_invalidate_range_end().
+ */
+static bool mmu_notifier_range_inactive_locked(struct mm_struct *mm,
+ unsigned long start,
+ unsigned long end)
+{
+ struct mmu_notifier_range *range;
+
+ list_for_each_entry(range, &mm->mmu_notifier_mm->ranges, list) {
+ if (range->end > start && range->start < end)
+ return false;
+ }
+ return true;
+}
+
+/* mmu_notifier_range_inactive() - test if range overlaps with active
+ * invalidation.
+ *
+ * @mm: The mm struct.
+ * @start: Start address of the range (inclusive).
+ * @end: End address of the range (exclusive).
+ *
+ * Same as mmu_notifier_range_inactive_locked() but take the mmu_notifier lock.
+ */
+bool mmu_notifier_range_inactive(struct mm_struct *mm,
+ unsigned long start,
+ unsigned long end)
+{
+ bool valid;
+
+ spin_lock(&mm->mmu_notifier_mm->lock);
+ valid = mmu_notifier_range_inactive_locked(mm, start, end);
+ spin_unlock(&mm->mmu_notifier_mm->lock);
+ return valid;
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_range_inactive);
+
+/* mmu_notifier_range_wait_active() - wait for a range to have no conflict with
+ * active invalidation.
+ *
+ * @mm: The mm struct.
+ * @start: Start address of the range (inclusive).
+ * @end: End address of the range (exclusive).
+ *
+ * This function wait for any active range invalidation that conflict with the
+ * given range, to end.
+ *
+ * Note by the time this function return a new range invalidation that conflict
+ * might have started. So you need to atomically block new range and query
+ * again if range is still valid with mmu_notifier_range_inactive(). So call
+ * sequence should be :
+ *
+ * again:
+ * mmu_notifier_range_wait_active()
+ * // Stop new invalidation using common lock with your range_start callback.
+ * lock_block_new_invalidation()
+ * if (!mmu_notifier_range_inactive()) {
+ * unlock_block_new_invalidation();
+ * goto again;
+ * }
+ * // Here you can safely access CPU page table for the range, knowing that you
+ * // will see valid entry and no one can change them.
+ * unlock_block_new_invalidation()
+ */
+void mmu_notifier_range_wait_active(struct mm_struct *mm,
+ unsigned long start,
+ unsigned long end)
+{
+ spin_lock(&mm->mmu_notifier_mm->lock);
+ while (!mmu_notifier_range_inactive_locked(mm, start, end)) {
+ int nranges = mm->mmu_notifier_mm->nranges;
+
+ spin_unlock(&mm->mmu_notifier_mm->lock);
+ wait_event(mm->mmu_notifier_mm->wait_queue,
+ nranges != mm->mmu_notifier_mm->nranges);
+ spin_lock(&mm->mmu_notifier_mm->lock);
+ }
+ spin_unlock(&mm->mmu_notifier_mm->lock);
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_range_wait_active);
+
static int do_mmu_notifier_register(struct mmu_notifier *mn,
struct mm_struct *mm,
int take_mmap_sem)
@@ -281,6 +384,9 @@ static int do_mmu_notifier_register(struct mmu_notifier *mn,
if (!mm_has_notifiers(mm)) {
INIT_HLIST_HEAD(&mmu_notifier_mm->list);
spin_lock_init(&mmu_notifier_mm->lock);
+ INIT_LIST_HEAD(&mmu_notifier_mm->ranges);
+ mmu_notifier_mm->nranges = 0;
+ init_waitqueue_head(&mmu_notifier_mm->wait_queue);

mm->mmu_notifier_mm = mmu_notifier_mm;
mmu_notifier_mm = NULL;
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 8f4a8c9..f0b5b94 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -142,7 +142,9 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
unsigned long next;
unsigned long pages = 0;
unsigned long nr_huge_updates = 0;
- unsigned long mni_start = 0;
+ struct mmu_notifier_range range = {
+ .start = 0,
+ };

pmd = pmd_offset(pud, addr);
do {
@@ -154,10 +156,11 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
continue;

/* invoke the mmu notifier if the pmd is populated */
- if (!mni_start) {
- mni_start = addr;
- mmu_notifier_invalidate_range_start(mm, mni_start,
- end, MMU_MPROT);
+ if (!range.start) {
+ range.start = addr;
+ range.end = end;
+ range.event = MMU_MPROT;
+ mmu_notifier_invalidate_range_start(mm, &range);
}

if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
@@ -186,9 +189,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
pages += this_pages;
} while (pmd++, addr = next, addr != end);

- if (mni_start)
- mmu_notifier_invalidate_range_end(mm, mni_start, end,
- MMU_MPROT);
+ if (range.start)
+ mmu_notifier_invalidate_range_end(mm, &range);

if (nr_huge_updates)
count_vm_numa_events(NUMA_HUGE_PTE_UPDATES, nr_huge_updates);
diff --git a/mm/mremap.c b/mm/mremap.c
index 9544022..2d2bc47 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -165,18 +165,17 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
bool need_rmap_locks)
{
unsigned long extent, next, old_end;
+ struct mmu_notifier_range range;
pmd_t *old_pmd, *new_pmd;
bool need_flush = false;
- unsigned long mmun_start; /* For mmu_notifiers */
- unsigned long mmun_end; /* For mmu_notifiers */

old_end = old_addr + len;
flush_cache_range(vma, old_addr, old_end);

- mmun_start = old_addr;
- mmun_end = old_end;
- mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start,
- mmun_end, MMU_MIGRATE);
+ range.start = old_addr;
+ range.end = old_end;
+ range.event = MMU_MIGRATE;
+ mmu_notifier_invalidate_range_start(vma->vm_mm, &range);

for (; old_addr < old_end; old_addr += extent, new_addr += extent) {
cond_resched();
@@ -228,8 +227,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
if (likely(need_flush))
flush_tlb_range(vma, old_end-len, old_addr);

- mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start,
- mmun_end, MMU_MIGRATE);
+ mmu_notifier_invalidate_range_end(vma->vm_mm, &range);

return len + old_addr - old_end; /* how much done */
}
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 3889354..b059307 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -330,10 +330,8 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
}

static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
- struct mm_struct *mm,
- unsigned long start,
- unsigned long end,
- enum mmu_event event)
+ struct mm_struct *mm,
+ const struct mmu_notifier_range *range)
{
struct kvm *kvm = mmu_notifier_to_kvm(mn);
int need_tlb_flush = 0, idx;
@@ -346,7 +344,7 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
* count is also read inside the mmu_lock critical section.
*/
kvm->mmu_notifier_count++;
- need_tlb_flush = kvm_unmap_hva_range(kvm, start, end);
+ need_tlb_flush = kvm_unmap_hva_range(kvm, range->start, range->end);
need_tlb_flush |= kvm->tlbs_dirty;
/* we've to flush the tlb before the pages can be freed */
if (need_tlb_flush)
@@ -357,10 +355,8 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
}

static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
- struct mm_struct *mm,
- unsigned long start,
- unsigned long end,
- enum mmu_event event)
+ struct mm_struct *mm,
+ const struct mmu_notifier_range *range)
{
struct kvm *kvm = mmu_notifier_to_kvm(mn);

--
2.4.3

2016-03-08 19:47:10

by Jerome Glisse

[permalink] [raw]
Subject: [PATCH v12 03/29] mmu_notifier: pass page pointer to mmu_notifier_invalidate_page() v2

Listener of mm event might not have easy way to get the struct page
behind an address invalidated with mmu_notifier_invalidate_page()
function as this happens after the cpu page table have been clear/
updated. This happens for instance if the listener is storing a dma
mapping inside its secondary page table. To avoid complex reverse
dma mapping lookup just pass along a pointer to the page being
invalidated.

Changed since v1:
- English syntax fixes.

Signed-off-by: Jérôme Glisse <[email protected]>
---
drivers/infiniband/core/umem_odp.c | 1 +
drivers/iommu/amd_iommu_v2.c | 1 +
drivers/misc/sgi-gru/grutlbpurge.c | 1 +
drivers/xen/gntdev.c | 1 +
include/linux/mmu_notifier.h | 6 +++++-
mm/mmu_notifier.c | 3 ++-
mm/rmap.c | 4 ++--
virt/kvm/kvm_main.c | 1 +
8 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index 58d9a00..0541761 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -166,6 +166,7 @@ static int invalidate_page_trampoline(struct ib_umem *item, u64 start,
static void ib_umem_notifier_invalidate_page(struct mmu_notifier *mn,
struct mm_struct *mm,
unsigned long address,
+ struct page *page,
enum mmu_event event)
{
struct ib_ucontext *context = container_of(mn, struct ib_ucontext, mn);
diff --git a/drivers/iommu/amd_iommu_v2.c b/drivers/iommu/amd_iommu_v2.c
index 2b4be22..bb5c678 100644
--- a/drivers/iommu/amd_iommu_v2.c
+++ b/drivers/iommu/amd_iommu_v2.c
@@ -393,6 +393,7 @@ static int mn_clear_flush_young(struct mmu_notifier *mn,
static void mn_invalidate_page(struct mmu_notifier *mn,
struct mm_struct *mm,
unsigned long address,
+ struct page *page,
enum mmu_event event)
{
__mn_flush_page(mn, address);
diff --git a/drivers/misc/sgi-gru/grutlbpurge.c b/drivers/misc/sgi-gru/grutlbpurge.c
index 40cf589..4268649 100644
--- a/drivers/misc/sgi-gru/grutlbpurge.c
+++ b/drivers/misc/sgi-gru/grutlbpurge.c
@@ -250,6 +250,7 @@ static void gru_invalidate_range_end(struct mmu_notifier *mn,

static void gru_invalidate_page(struct mmu_notifier *mn, struct mm_struct *mm,
unsigned long address,
+ struct page *page,
enum mmu_event event)
{
struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index 0ca3720..c318aff 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -485,6 +485,7 @@ static void mn_invl_range_start(struct mmu_notifier *mn,
static void mn_invl_page(struct mmu_notifier *mn,
struct mm_struct *mm,
unsigned long address,
+ struct page *page,
enum mmu_event event)
{
struct mmu_notifier_range range;
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index c4ba044..9e65a3f 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -184,6 +184,7 @@ struct mmu_notifier_ops {
void (*invalidate_page)(struct mmu_notifier *mn,
struct mm_struct *mm,
unsigned long address,
+ struct page *page,
enum mmu_event event);

/*
@@ -305,6 +306,7 @@ extern void __mmu_notifier_change_pte(struct mm_struct *mm,
enum mmu_event event);
extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
unsigned long address,
+ struct page *page,
enum mmu_event event);
extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
struct mmu_notifier_range *range);
@@ -362,10 +364,11 @@ static inline void mmu_notifier_change_pte(struct mm_struct *mm,

static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
unsigned long address,
+ struct page *page,
enum mmu_event event)
{
if (mm_has_notifiers(mm))
- __mmu_notifier_invalidate_page(mm, address, event);
+ __mmu_notifier_invalidate_page(mm, address, page, event);
}

static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
@@ -538,6 +541,7 @@ static inline void mmu_notifier_change_pte(struct mm_struct *mm,

static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
unsigned long address,
+ struct page *page,
enum mmu_event event)
{
}
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index c43c851..316e4a9 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -177,6 +177,7 @@ void __mmu_notifier_change_pte(struct mm_struct *mm,

void __mmu_notifier_invalidate_page(struct mm_struct *mm,
unsigned long address,
+ struct page *page,
enum mmu_event event)
{
struct mmu_notifier *mn;
@@ -185,7 +186,7 @@ void __mmu_notifier_invalidate_page(struct mm_struct *mm,
id = srcu_read_lock(&srcu);
hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
if (mn->ops->invalidate_page)
- mn->ops->invalidate_page(mn, mm, address, event);
+ mn->ops->invalidate_page(mn, mm, address, page, event);
}
srcu_read_unlock(&srcu, id);
}
diff --git a/mm/rmap.c b/mm/rmap.c
index a24d0b2..063f8de 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1054,7 +1054,7 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma,
pte_unmap_unlock(pte, ptl);

if (ret) {
- mmu_notifier_invalidate_page(mm, address, MMU_WRITE_BACK);
+ mmu_notifier_invalidate_page(mm, address, page, MMU_WRITE_BACK);
(*cleaned)++;
}
out:
@@ -1552,7 +1552,7 @@ discard:
out_unmap:
pte_unmap_unlock(pte, ptl);
if (ret != SWAP_FAIL && ret != SWAP_MLOCK && !(flags & TTU_MUNLOCK))
- mmu_notifier_invalidate_page(mm, address, MMU_MIGRATE);
+ mmu_notifier_invalidate_page(mm, address, page, MMU_MIGRATE);
out:
return ret;
}
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index b059307..d6dbaab 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -273,6 +273,7 @@ static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
struct mm_struct *mm,
unsigned long address,
+ struct page *page,
enum mmu_event event)
{
struct kvm *kvm = mmu_notifier_to_kvm(mn);
--
2.4.3

2016-03-08 19:47:14

by Jerome Glisse

[permalink] [raw]
Subject: [PATCH v12 04/29] mmu_notifier: allow range invalidation to exclude a specific mmu_notifier

This patch allow to invalidate a range while excluding call to a specific
mmu_notifier which allow for a subsystem to invalidate a range for everyone
but itself.

Signed-off-by: Jérôme Glisse <[email protected]>
---
include/linux/mmu_notifier.h | 66 ++++++++++++++++++++++++++++++++++++++++----
mm/mmu_notifier.c | 16 +++++++++--
2 files changed, 73 insertions(+), 9 deletions(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 9e65a3f..acebf00 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -309,11 +309,15 @@ extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
struct page *page,
enum mmu_event event);
extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
- struct mmu_notifier_range *range);
+ struct mmu_notifier_range *range,
+ const struct mmu_notifier *exclude);
extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
- struct mmu_notifier_range *range);
+ struct mmu_notifier_range *range,
+ const struct mmu_notifier *exclude);
extern void __mmu_notifier_invalidate_range(struct mm_struct *mm,
- unsigned long start, unsigned long end);
+ unsigned long start,
+ unsigned long end,
+ const struct mmu_notifier *exclude);
extern bool mmu_notifier_range_inactive(struct mm_struct *mm,
unsigned long start,
unsigned long end);
@@ -375,21 +379,49 @@ static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
struct mmu_notifier_range *range)
{
if (mm_has_notifiers(mm))
- __mmu_notifier_invalidate_range_start(mm, range);
+ __mmu_notifier_invalidate_range_start(mm, range, NULL);
}

static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
struct mmu_notifier_range *range)
{
if (mm_has_notifiers(mm))
- __mmu_notifier_invalidate_range_end(mm, range);
+ __mmu_notifier_invalidate_range_end(mm, range, NULL);
}

static inline void mmu_notifier_invalidate_range(struct mm_struct *mm,
unsigned long start, unsigned long end)
{
if (mm_has_notifiers(mm))
- __mmu_notifier_invalidate_range(mm, start, end);
+ __mmu_notifier_invalidate_range(mm, start, end, NULL);
+}
+
+static inline void mmu_notifier_invalidate_range_start_excluding(
+ struct mm_struct *mm,
+ struct mmu_notifier_range *range,
+ const struct mmu_notifier *exclude)
+{
+ if (mm_has_notifiers(mm))
+ __mmu_notifier_invalidate_range_start(mm, range, exclude);
+}
+
+static inline void mmu_notifier_invalidate_range_end_excluding(
+ struct mm_struct *mm,
+ struct mmu_notifier_range *range,
+ const struct mmu_notifier *exclude)
+{
+ if (mm_has_notifiers(mm))
+ __mmu_notifier_invalidate_range_end(mm, range, exclude);
+}
+
+static inline void mmu_notifier_invalidate_range_excluding(
+ struct mm_struct *mm,
+ unsigned long start,
+ unsigned long end,
+ const struct mmu_notifier *exclude)
+{
+ if (mm_has_notifiers(mm))
+ __mmu_notifier_invalidate_range(mm, start, end, exclude);
}

static inline void mmu_notifier_mm_init(struct mm_struct *mm)
@@ -561,6 +593,28 @@ static inline void mmu_notifier_invalidate_range(struct mm_struct *mm,
{
}

+static inline void mmu_notifier_invalidate_range_start_excluding(
+ struct mm_struct *mm,
+ struct mmu_notifier_range *range,
+ const struct mmu_notifier *exclude)
+{
+}
+
+static inline void mmu_notifier_invalidate_range_end_excluding(
+ struct mm_struct *mm,
+ struct mmu_notifier_range *range,
+ const struct mmu_notifier *exclude)
+{
+}
+
+static inline void mmu_notifier_invalidate_range_excluding(
+ struct mm_struct *mm,
+ unsigned long start,
+ unsigned long end,
+ const struct mmu_notifier *exclude)
+{
+}
+
static inline void mmu_notifier_mm_init(struct mm_struct *mm)
{
}
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 316e4a9..651246f 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -192,7 +192,8 @@ void __mmu_notifier_invalidate_page(struct mm_struct *mm,
}

void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
- struct mmu_notifier_range *range)
+ struct mmu_notifier_range *range,
+ const struct mmu_notifier *exclude)

{
struct mmu_notifier *mn;
@@ -205,6 +206,8 @@ void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,

id = srcu_read_lock(&srcu);
hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
+ if (mn == exclude)
+ continue;
if (mn->ops->invalidate_range_start)
mn->ops->invalidate_range_start(mn, mm, range);
}
@@ -213,13 +216,16 @@ void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start);

void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
- struct mmu_notifier_range *range)
+ struct mmu_notifier_range *range,
+ const struct mmu_notifier *exclude)
{
struct mmu_notifier *mn;
int id;

id = srcu_read_lock(&srcu);
hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
+ if (mn == exclude)
+ continue;
/*
* Call invalidate_range here too to avoid the need for the
* subsystem of having to register an invalidate_range_end
@@ -250,13 +256,17 @@ void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_end);

void __mmu_notifier_invalidate_range(struct mm_struct *mm,
- unsigned long start, unsigned long end)
+ unsigned long start,
+ unsigned long end,
+ const struct mmu_notifier *exclude)
{
struct mmu_notifier *mn;
int id;

id = srcu_read_lock(&srcu);
hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
+ if (mn == exclude)
+ continue;
if (mn->ops->invalidate_range)
mn->ops->invalidate_range(mn, mm, start, end);
}
--
2.4.3

2016-03-08 19:47:20

by Jerome Glisse

[permalink] [raw]
Subject: [PATCH v12 07/29] HMM: add per mirror page table v4.

This patch add the per mirror page table. It also propagate CPU page
table update to this per mirror page table using mmu_notifier callback.
All update are contextualized with an HMM event structure that convey
all information needed by device driver to take proper actions (update
its own mmu to reflect changes and schedule proper flushing).

Core HMM is responsible for updating the per mirror page table once
the device driver is done with its update. Most importantly HMM will
properly propagate HMM page table dirty bit to underlying page.

Changed since v1:
- Removed unused fence code to defer it to latter patches.

Changed since v2:
- Use new bit flag helper for mirror page table manipulation.
- Differentiate fork event with HMM_FORK from other events.

Changed since v3:
- Get rid of HMM_ISDIRTY and rely on write protect instead.
- Adapt to HMM page table changes

Signed-off-by: Jérôme Glisse <[email protected]>
Signed-off-by: Sherry Cheung <[email protected]>
Signed-off-by: Subhash Gutti <[email protected]>
Signed-off-by: Mark Hairgrove <[email protected]>
Signed-off-by: John Hubbard <[email protected]>
Signed-off-by: Jatin Kumar <[email protected]>
---
include/linux/hmm.h | 83 ++++++++++++++++++++
mm/hmm.c | 221 ++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 304 insertions(+)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index b559c0b..5488fa9 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -46,6 +46,7 @@
#include <linux/mmu_notifier.h>
#include <linux/workqueue.h>
#include <linux/mman.h>
+#include <linux/hmm_pt.h>


struct hmm_device;
@@ -53,6 +54,38 @@ struct hmm_mirror;
struct hmm;


+/*
+ * hmm_event - each event is described by a type associated with a struct.
+ */
+enum hmm_etype {
+ HMM_NONE = 0,
+ HMM_FORK,
+ HMM_MIGRATE,
+ HMM_MUNMAP,
+ HMM_DEVICE_RFAULT,
+ HMM_DEVICE_WFAULT,
+ HMM_WRITE_PROTECT,
+};
+
+/* struct hmm_event - memory event information.
+ *
+ * @list: So HMM can keep track of all active events.
+ * @start: First address (inclusive).
+ * @end: Last address (exclusive).
+ * @pte_mask: HMM pte update mask (bit(s) that are still valid).
+ * @etype: Event type (munmap, migrate, truncate, ...).
+ * @backoff: Only meaningful for device page fault.
+ */
+struct hmm_event {
+ struct list_head list;
+ unsigned long start;
+ unsigned long end;
+ dma_addr_t pte_mask;
+ enum hmm_etype etype;
+ bool backoff;
+};
+
+
/* hmm_device - Each device must register one and only one hmm_device.
*
* The hmm_device is the link btw HMM and each device driver.
@@ -83,6 +116,54 @@ struct hmm_device_ops {
* so device driver callback can not sleep.
*/
void (*free)(struct hmm_mirror *mirror);
+
+ /* update() - update device mmu following an event.
+ *
+ * @mirror: The mirror that link process address space with the device.
+ * @event: The event that triggered the update.
+ * Returns: 0 on success or error code {-EIO, -ENOMEM}.
+ *
+ * Called to update device page table for a range of address.
+ * The event type provide the nature of the update :
+ * - Range is no longer valid (munmap).
+ * - Range protection changes (mprotect, COW, ...).
+ * - Range is unmapped (swap, reclaim, page migration, ...).
+ * - Device page fault.
+ * - ...
+ *
+ * Thought most device driver only need to use pte_mask as it reflects
+ * change that will happen to the HMM page table ie :
+ * new_pte = old_pte & event->pte_mask;
+ *
+ * Device driver must not update the HMM mirror page table (except the
+ * dirty bit see below). Core HMM will update HMM page table after the
+ * update is done.
+ *
+ * Note that device must be cache coherent with system memory (snooping
+ * in case of PCIE devices) so there should be no need for device to
+ * flush anything.
+ *
+ * When write protection is turned on device driver must make sure the
+ * hardware will no longer be able to write to the page otherwise file
+ * system corruption may occur.
+ *
+ * Device must properly set the dirty bit using hmm_pte_set_bit() on
+ * each page entry for memory that was written by the device. If device
+ * can not properly account for write access then the dirty bit must be
+ * set unconditionally so that proper write back of file backed page
+ * can happen.
+ *
+ * Device driver must not fail lightly, any failure result in device
+ * process being kill.
+ *
+ * Return 0 on success, error value otherwise :
+ * -ENOMEM Not enough memory for performing the operation.
+ * -EIO Some input/output error with the device.
+ *
+ * All other return value trigger warning and are transformed to -EIO.
+ */
+ int (*update)(struct hmm_mirror *mirror,
+ struct hmm_event *event);
};


@@ -149,6 +230,7 @@ int hmm_device_unregister(struct hmm_device *device);
* @kref: Reference counter (private to HMM do not use).
* @dlist: List of all hmm_mirror for same device.
* @mlist: List of all hmm_mirror for same process.
+ * @pt: Mirror page table.
*
* Each device that want to mirror an address space must register one of this
* struct for each of the address space it wants to mirror. Same device can
@@ -161,6 +243,7 @@ struct hmm_mirror {
struct kref kref;
struct list_head dlist;
struct hlist_node mlist;
+ struct hmm_pt pt;
};

int hmm_mirror_register(struct hmm_mirror *mirror);
diff --git a/mm/hmm.c b/mm/hmm.c
index 8d861c4..c172a49 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -45,6 +45,50 @@
#include "internal.h"

static struct mmu_notifier_ops hmm_notifier_ops;
+static void hmm_mirror_kill(struct hmm_mirror *mirror);
+static inline int hmm_mirror_update(struct hmm_mirror *mirror,
+ struct hmm_event *event);
+static void hmm_mirror_update_pt(struct hmm_mirror *mirror,
+ struct hmm_event *event);
+
+
+/* hmm_event - use to track information relating to an event.
+ *
+ * Each change to cpu page table or fault from a device is considered as an
+ * event by hmm. For each event there is a common set of things that need to
+ * be tracked. The hmm_event struct centralize those and the helper functions
+ * help dealing with all this.
+ */
+
+static inline int hmm_event_init(struct hmm_event *event,
+ struct hmm *hmm,
+ unsigned long start,
+ unsigned long end,
+ enum hmm_etype etype)
+{
+ event->start = start & PAGE_MASK;
+ event->end = min(end, hmm->vm_end);
+ if (event->start >= event->end)
+ return -EINVAL;
+ event->etype = etype;
+ event->pte_mask = (dma_addr_t)-1ULL;
+ switch (etype) {
+ case HMM_DEVICE_RFAULT:
+ case HMM_DEVICE_WFAULT:
+ break;
+ case HMM_FORK:
+ case HMM_WRITE_PROTECT:
+ event->pte_mask ^= (1 << HMM_PTE_WRITE_BIT);
+ break;
+ case HMM_MIGRATE:
+ case HMM_MUNMAP:
+ event->pte_mask = 0;
+ break;
+ default:
+ return -EINVAL;
+ }
+ return 0;
+}


/* hmm - core HMM functions.
@@ -123,6 +167,27 @@ static inline struct hmm *hmm_unref(struct hmm *hmm)
return NULL;
}

+static void hmm_update(struct hmm *hmm, struct hmm_event *event)
+{
+ struct hmm_mirror *mirror;
+
+ /* Is this hmm already fully stop ? */
+ if (hmm->mm->hmm != hmm)
+ return;
+
+again:
+ down_read(&hmm->rwsem);
+ hlist_for_each_entry(mirror, &hmm->mirrors, mlist)
+ if (hmm_mirror_update(mirror, event)) {
+ mirror = hmm_mirror_ref(mirror);
+ up_read(&hmm->rwsem);
+ hmm_mirror_kill(mirror);
+ hmm_mirror_unref(&mirror);
+ goto again;
+ }
+ up_read(&hmm->rwsem);
+}
+

/* hmm_notifier - HMM callback for mmu_notifier tracking change to process mm.
*
@@ -139,6 +204,7 @@ static void hmm_notifier_release(struct mmu_notifier *mn, struct mm_struct *mm)
down_write(&hmm->rwsem);
while (hmm->mirrors.first) {
struct hmm_mirror *mirror;
+ struct hmm_event event;

/*
* Here we are holding the mirror reference from the mirror
@@ -151,6 +217,10 @@ static void hmm_notifier_release(struct mmu_notifier *mn, struct mm_struct *mm)
hlist_del_init(&mirror->mlist);
up_write(&hmm->rwsem);

+ /* Make sure everything is unmapped. */
+ hmm_event_init(&event, mirror->hmm, 0, -1UL, HMM_MUNMAP);
+ hmm_mirror_update(mirror, &event);
+
mirror->device->ops->release(mirror);
hmm_mirror_unref(&mirror);

@@ -161,8 +231,92 @@ static void hmm_notifier_release(struct mmu_notifier *mn, struct mm_struct *mm)
hmm_unref(hmm);
}

+static void hmm_mmu_mprot_to_etype(struct mm_struct *mm,
+ unsigned long addr,
+ enum mmu_event mmu_event,
+ enum hmm_etype *etype)
+{
+ struct vm_area_struct *vma;
+
+ vma = find_vma(mm, addr);
+ if (!vma || vma->vm_start > addr || !(vma->vm_flags & VM_READ)) {
+ *etype = HMM_MUNMAP;
+ return;
+ }
+
+ if (!(vma->vm_flags & VM_WRITE)) {
+ *etype = HMM_WRITE_PROTECT;
+ return;
+ }
+
+ *etype = HMM_NONE;
+}
+
+static void hmm_notifier_invalidate_range_start(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ const struct mmu_notifier_range *range)
+{
+ struct hmm_event event;
+ unsigned long start = range->start, end = range->end;
+ struct hmm *hmm;
+
+ hmm = container_of(mn, struct hmm, mmu_notifier);
+ if (start >= hmm->vm_end)
+ return;
+
+ switch (range->event) {
+ case MMU_FORK:
+ event.etype = HMM_FORK;
+ break;
+ case MMU_MUNLOCK:
+ /* Still same physical ram backing same address. */
+ return;
+ case MMU_MPROT:
+ hmm_mmu_mprot_to_etype(mm, start, range->event, &event.etype);
+ if (event.etype == HMM_NONE)
+ return;
+ break;
+ case MMU_CLEAR_SOFT_DIRTY:
+ case MMU_WRITE_BACK:
+ case MMU_KSM_WRITE_PROTECT:
+ event.etype = HMM_WRITE_PROTECT;
+ break;
+ /* FIXME be more clever about huge page spliting. */
+ case MMU_HUGE_FREEZE:
+ case MMU_HUGE_UNFREEZE:
+ case MMU_HUGE_PAGE_SPLIT:
+ case MMU_MUNMAP:
+ event.etype = HMM_MUNMAP;
+ break;
+ case MMU_MIGRATE:
+ default:
+ event.etype = HMM_MIGRATE;
+ break;
+ }
+
+ hmm_event_init(&event, hmm, start, end, event.etype);
+
+ hmm_update(hmm, &event);
+}
+
+static void hmm_notifier_invalidate_page(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ unsigned long addr,
+ struct page *page,
+ enum mmu_event mmu_event)
+{
+ struct mmu_notifier_range range;
+
+ range.start = addr & PAGE_MASK;
+ range.end = range.start + PAGE_SIZE;
+ range.event = mmu_event;
+ hmm_notifier_invalidate_range_start(mn, mm, &range);
+}
+
static struct mmu_notifier_ops hmm_notifier_ops = {
.release = hmm_notifier_release,
+ .invalidate_page = hmm_notifier_invalidate_page,
+ .invalidate_range_start = hmm_notifier_invalidate_range_start,
};


@@ -192,6 +346,7 @@ static void hmm_mirror_destroy(struct kref *kref)
mirror = container_of(kref, struct hmm_mirror, kref);
device = mirror->device;

+ hmm_pt_fini(&mirror->pt);
hmm_unref(mirror->hmm);

spin_lock(&device->lock);
@@ -211,6 +366,59 @@ void hmm_mirror_unref(struct hmm_mirror **mirror)
}
EXPORT_SYMBOL(hmm_mirror_unref);

+static inline int hmm_mirror_update(struct hmm_mirror *mirror,
+ struct hmm_event *event)
+{
+ struct hmm_device *device = mirror->device;
+ int ret = 0;
+
+ ret = device->ops->update(mirror, event);
+ hmm_mirror_update_pt(mirror, event);
+ return ret;
+}
+
+static void hmm_mirror_update_pt(struct hmm_mirror *mirror,
+ struct hmm_event *event)
+{
+ unsigned long addr;
+ struct hmm_pt_iter iter;
+
+ hmm_pt_iter_init(&iter, &mirror->pt);
+ for (addr = event->start; addr != event->end;) {
+ unsigned long next = event->end;
+ dma_addr_t *hmm_pte;
+
+ hmm_pte = hmm_pt_iter_lookup(&iter, addr, &next);
+ if (!hmm_pte) {
+ addr = next;
+ continue;
+ }
+ /*
+ * The directory lock protect against concurrent clearing of
+ * page table bit flags. Exceptions being the dirty bit and
+ * the device driver private flags.
+ */
+ hmm_pt_iter_directory_lock(&iter);
+ do {
+ if (!hmm_pte_test_valid_pfn(hmm_pte))
+ continue;
+ if (hmm_pte_test_and_clear_dirty(hmm_pte) &&
+ hmm_pte_test_write(hmm_pte)) {
+ struct page *page;
+
+ page = pfn_to_page(hmm_pte_pfn(*hmm_pte));
+ set_page_dirty(page);
+ }
+ *hmm_pte &= event->pte_mask;
+ if (hmm_pte_test_valid_pfn(hmm_pte))
+ continue;
+ hmm_pt_iter_directory_unref(&iter);
+ } while (addr += PAGE_SIZE, hmm_pte++, addr != next);
+ hmm_pt_iter_directory_unlock(&iter);
+ }
+ hmm_pt_iter_fini(&iter);
+}
+
/* hmm_mirror_register() - register mirror against current process for a device.
*
* @mirror: The mirror struct being registered.
@@ -242,6 +450,11 @@ int hmm_mirror_register(struct hmm_mirror *mirror)
* necessary to make the error path easier for driver and for hmm.
*/
kref_init(&mirror->kref);
+ mirror->pt.last = TASK_SIZE - 1;
+ if (hmm_pt_init(&mirror->pt)) {
+ kfree(mirror);
+ return -ENOMEM;
+ }
INIT_HLIST_NODE(&mirror->mlist);
INIT_LIST_HEAD(&mirror->dlist);
spin_lock(&mirror->device->lock);
@@ -278,6 +491,7 @@ int hmm_mirror_register(struct hmm_mirror *mirror)
hmm_unref(hmm);
goto error;
}
+ BUG_ON(mirror->pt.last >= hmm->vm_end);
return 0;

error:
@@ -298,8 +512,15 @@ static void hmm_mirror_kill(struct hmm_mirror *mirror)

down_write(&hmm->rwsem);
if (!hlist_unhashed(&mirror->mlist)) {
+ struct hmm_event event;
+
hlist_del_init(&mirror->mlist);
up_write(&hmm->rwsem);
+
+ /* Make sure everything is unmapped. */
+ hmm_event_init(&event, mirror->hmm, 0, -1UL, HMM_MUNMAP);
+ hmm_mirror_update(mirror, &event);
+
device->ops->release(mirror);
hmm_mirror_unref(&mirror);
} else
--
2.4.3

2016-03-08 19:47:31

by Jerome Glisse

[permalink] [raw]
Subject: [PATCH v12 06/29] HMM: add HMM page table v4.

Heterogeneous memory management main purpose is to mirror a process
address. To do so it must maintain a secondary page table that is
use by the device driver to program the device or build a device
specific page table.

Radix tree can't be use to create this secondary page table because
HMM needs more flags than RADIX_TREE_MAX_TAGS (while this can be
increase we believe HMM will require so much flags that cost will
becomes prohibitive to others users of radix tree).

Moreover radix tree is built around long but for HMM we need to
store dma address and on some platform sizeof(dma_addr_t) is bigger
than sizeof(long). Thus radix tree is unsuitable to fulfill HMM
requirement hence why we introduce this code which allows to create
page table that can grow and shrink dynamicly.

The design is very close to CPU page table as it reuse some of the
feature such as spinlock embedded in struct page.

Changed since v1:
- Use PAGE_SHIFT as shift value to reserve low bit for private
device specific flags. This is to allow device driver to use
and some of the lower bits for their own device specific purpose.
- Add set of helper for atomically clear, setting and testing bit
on dma_addr_t pointer. Atomicity being useful only for dirty bit.
- Differentiate btw DMA mapped entry and non mapped entry (pfn).
- Split page directory entry and page table entry helpers.

Changed since v2:
- Rename hmm_pt_iter_update() -> hmm_pt_iter_lookup().
- Rename hmm_pt_iter_fault() -> hmm_pt_iter_populate().
- Add hmm_pt_iter_walk()
- Remove hmm_pt_iter_next() (useless now).
- Code simplification and improved comments.
- Fix hmm_pt_fini_directory().

Changed since v3:
- Fix hmm_pt_iter_directory_unref_safe().

Signed-off-by: Jérôme Glisse <[email protected]>
Signed-off-by: Sherry Cheung <[email protected]>
Signed-off-by: Subhash Gutti <[email protected]>
Signed-off-by: Mark Hairgrove <[email protected]>
Signed-off-by: John Hubbard <[email protected]>
Signed-off-by: Jatin Kumar <[email protected]>
---
MAINTAINERS | 2 +
include/linux/hmm_pt.h | 342 ++++++++++++++++++++++++++++
mm/Makefile | 2 +-
mm/hmm_pt.c | 603 +++++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 948 insertions(+), 1 deletion(-)
create mode 100644 include/linux/hmm_pt.h
create mode 100644 mm/hmm_pt.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 0fc4c5f..c8f98ae 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5045,6 +5045,8 @@ L: [email protected]
S: Maintained
F: mm/hmm.c
F: include/linux/hmm.h
+F: mm/hmm_pt.c
+F: include/linux/hmm_pt.h

HOST AP DRIVER
M: Jouni Malinen <[email protected]>
diff --git a/include/linux/hmm_pt.h b/include/linux/hmm_pt.h
new file mode 100644
index 0000000..4a8beb1
--- /dev/null
+++ b/include/linux/hmm_pt.h
@@ -0,0 +1,342 @@
+/*
+ * Copyright 2014 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * Authors: Jérôme Glisse <[email protected]>
+ */
+/*
+ * This provide a set of helpers for HMM page table. See include/linux/hmm.h
+ * for a description of what HMM is.
+ *
+ * HMM page table rely on a locking mecanism similar to CPU page table for page
+ * table update. It use the spinlock embedded inside the struct page to protect
+ * change to page table directory which should minimize lock contention for
+ * concurrent update.
+ *
+ * It does also provide a directory tree protection mechanism. Unlike CPU page
+ * table there is no mmap semaphore to protect directory tree from removal and
+ * this is done intentionaly so that concurrent removal/insertion of directory
+ * inside the tree can happen.
+ *
+ * So anyone walking down the page table must protect directory it traverses so
+ * they are not free by some other thread. This is done by using a reference
+ * counter for each directory. Before traversing a directory a reference is
+ * taken and once traversal is done the reference is drop.
+ *
+ * A directory entry dereference and refcount increment of sub-directory page
+ * must happen in a critical rcu section so that directory page removal can
+ * gracefully wait for all possible other threads that might have dereferenced
+ * the directory.
+ */
+#ifndef _HMM_PT_H
+#define _HMM_PT_H
+
+/*
+ * The HMM page table entry does not reflect any specific hardware. It is just
+ * a common entry format use by HMM internal and expose to HMM user so they can
+ * extract information out of HMM page table.
+ *
+ * Device driver should only rely on the helpers and should not traverse the
+ * page table themself.
+ */
+#define HMM_PT_MAX_LEVEL 6
+
+#define HMM_PDE_VALID_BIT 0
+#define HMM_PDE_VALID (1 << HMM_PDE_VALID_BIT)
+#define HMM_PDE_PFN_MASK (~((dma_addr_t)((1 << PAGE_SHIFT) - 1)))
+
+static inline dma_addr_t hmm_pde_from_pfn(dma_addr_t pfn)
+{
+ return (pfn << PAGE_SHIFT) | HMM_PDE_VALID;
+}
+
+static inline unsigned long hmm_pde_pfn(dma_addr_t pde)
+{
+ return (pde & HMM_PDE_VALID) ? pde >> PAGE_SHIFT : 0;
+}
+
+
+/*
+ * The HMM_PTE_VALID_DMA_BIT is set for valid DMA mapped entry, while for pfn
+ * entry the HMM_PTE_VALID_PFN_BIT is set. If the hmm_device is associated with
+ * a valid struct device than device driver will be supplied with DMA mapped
+ * entry otherwise it will be supplied with pfn entry.
+ *
+ * In the first case the device driver must ignore any pfn entry as they might
+ * show as transient state while HMM is mapping the page.
+ */
+#define HMM_PTE_VALID_DMA_BIT 0
+#define HMM_PTE_VALID_PFN_BIT 1
+#define HMM_PTE_WRITE_BIT 2
+#define HMM_PTE_DIRTY_BIT 3
+/*
+ * Reserve some bits for device driver private flags. Note that thus can only
+ * be manipulated using the hmm_pte_*_bit() sets of helpers.
+ *
+ * WARNING ONLY SET/CLEAR THOSE FLAG ON PTE ENTRY THAT HAVE THE VALID BIT SET
+ * AS OTHERWISE ANY BIT SET BY THE DRIVER WILL BE OVERWRITTEN BY HMM.
+ */
+#define HMM_PTE_HW_SHIFT 4
+
+#define HMM_PTE_PFN_MASK (~((dma_addr_t)((1 << PAGE_SHIFT) - 1)))
+#define HMM_PTE_DMA_MASK (~((dma_addr_t)((1 << PAGE_SHIFT) - 1)))
+
+
+#ifdef __BIG_ENDIAN
+/*
+ * The dma_addr_t casting we do on little endian do not work on big endian. It
+ * would require some macro trickery to adjust the bit value depending on the
+ * number of bit unsigned long have in comparison to dma_addr_t. This is just
+ * low on the todo list for now.
+ */
+#error "HMM not supported on BIG_ENDIAN architecture.\n"
+#else /* __BIG_ENDIAN */
+static inline void hmm_pte_clear_bit(dma_addr_t *ptep, unsigned char bit)
+{
+ clear_bit(bit, (unsigned long *)ptep);
+}
+
+static inline void hmm_pte_set_bit(dma_addr_t *ptep, unsigned char bit)
+{
+ set_bit(bit, (unsigned long *)ptep);
+}
+
+static inline bool hmm_pte_test_bit(dma_addr_t *ptep, unsigned char bit)
+{
+ return !!test_bit(bit, (unsigned long *)ptep);
+}
+
+static inline bool hmm_pte_test_and_clear_bit(dma_addr_t *ptep,
+ unsigned char bit)
+{
+ return !!test_and_clear_bit(bit, (unsigned long *)ptep);
+}
+
+static inline bool hmm_pte_test_and_set_bit(dma_addr_t *ptep,
+ unsigned char bit)
+{
+ return !!test_and_set_bit(bit, (unsigned long *)ptep);
+}
+#endif /* __BIG_ENDIAN */
+
+
+#define HMM_PTE_CLEAR_BIT(name, bit)\
+ static inline void hmm_pte_clear_##name(dma_addr_t *ptep)\
+ {\
+ return hmm_pte_clear_bit(ptep, bit);\
+ }
+
+#define HMM_PTE_SET_BIT(name, bit)\
+ static inline void hmm_pte_set_##name(dma_addr_t *ptep)\
+ {\
+ return hmm_pte_set_bit(ptep, bit);\
+ }
+
+#define HMM_PTE_TEST_BIT(name, bit)\
+ static inline bool hmm_pte_test_##name(dma_addr_t *ptep)\
+ {\
+ return hmm_pte_test_bit(ptep, bit);\
+ }
+
+#define HMM_PTE_TEST_AND_CLEAR_BIT(name, bit)\
+ static inline bool hmm_pte_test_and_clear_##name(dma_addr_t *ptep)\
+ {\
+ return hmm_pte_test_and_clear_bit(ptep, bit);\
+ }
+
+#define HMM_PTE_TEST_AND_SET_BIT(name, bit)\
+ static inline bool hmm_pte_test_and_set_##name(dma_addr_t *ptep)\
+ {\
+ return hmm_pte_test_and_set_bit(ptep, bit);\
+ }
+
+#define HMM_PTE_BIT_HELPER(name, bit)\
+ HMM_PTE_CLEAR_BIT(name, bit)\
+ HMM_PTE_SET_BIT(name, bit)\
+ HMM_PTE_TEST_BIT(name, bit)\
+ HMM_PTE_TEST_AND_CLEAR_BIT(name, bit)\
+ HMM_PTE_TEST_AND_SET_BIT(name, bit)
+
+HMM_PTE_BIT_HELPER(valid_dma, HMM_PTE_VALID_DMA_BIT)
+HMM_PTE_BIT_HELPER(valid_pfn, HMM_PTE_VALID_PFN_BIT)
+HMM_PTE_BIT_HELPER(dirty, HMM_PTE_DIRTY_BIT)
+HMM_PTE_BIT_HELPER(write, HMM_PTE_WRITE_BIT)
+
+static inline dma_addr_t hmm_pte_from_pfn(dma_addr_t pfn)
+{
+ return (pfn << PAGE_SHIFT) | (1 << HMM_PTE_VALID_PFN_BIT);
+}
+
+static inline unsigned long hmm_pte_pfn(dma_addr_t pte)
+{
+ return hmm_pte_test_valid_pfn(&pte) ? pte >> PAGE_SHIFT : 0;
+}
+
+
+/* struct hmm_pt - HMM page table structure.
+ *
+ * @mask: Array of address mask value of each level.
+ * @directory_mask: Mask for directory index (see below).
+ * @last: Last valid address (inclusive).
+ * @pgd: page global directory (top first level of the directory tree).
+ * @lock: Share lock if spinlock_t does not fit in struct page.
+ * @shift: Array of address shift value of each level.
+ * @llevel: Last level.
+ *
+ * The index into each directory for a given address and level is :
+ * (address >> shift[level]) & directory_mask
+ *
+ * Only hmm_pt.last field needs to be set before calling hmm_pt_init().
+ */
+struct hmm_pt {
+ unsigned long mask[HMM_PT_MAX_LEVEL];
+ unsigned long directory_mask;
+ unsigned long last;
+ dma_addr_t *pgd;
+ spinlock_t lock;
+ unsigned char shift[HMM_PT_MAX_LEVEL];
+ unsigned char llevel;
+};
+
+int hmm_pt_init(struct hmm_pt *pt);
+void hmm_pt_fini(struct hmm_pt *pt);
+
+static inline unsigned hmm_pt_index(struct hmm_pt *pt,
+ unsigned long addr,
+ unsigned level)
+{
+ return (addr >> pt->shift[level]) & pt->directory_mask;
+}
+
+#if USE_SPLIT_PTE_PTLOCKS && !ALLOC_SPLIT_PTLOCKS
+static inline void hmm_pt_directory_lock(struct hmm_pt *pt,
+ struct page *ptd,
+ unsigned level)
+{
+ if (level)
+ spin_lock(&ptd->ptl);
+ else
+ spin_lock(&pt->lock);
+}
+
+static inline void hmm_pt_directory_unlock(struct hmm_pt *pt,
+ struct page *ptd,
+ unsigned level)
+{
+ if (level)
+ spin_unlock(&ptd->ptl);
+ else
+ spin_unlock(&pt->lock);
+}
+#else /* USE_SPLIT_PTE_PTLOCKS && !ALLOC_SPLIT_PTLOCKS */
+static inline void hmm_pt_directory_lock(struct hmm_pt *pt,
+ struct page *ptd,
+ unsigned level)
+{
+ spin_lock(&pt->lock);
+}
+
+static inline void hmm_pt_directory_unlock(struct hmm_pt *pt,
+ struct page *ptd,
+ unsigned level)
+{
+ spin_unlock(&pt->lock);
+}
+#endif
+
+static inline void hmm_pt_directory_ref(struct hmm_pt *pt,
+ struct page *ptd)
+{
+ if (!atomic_inc_not_zero(&ptd->_mapcount))
+ /* Illegal this should not happen. */
+ BUG();
+}
+
+static inline void hmm_pt_directory_unref(struct hmm_pt *pt,
+ struct page *ptd)
+{
+ if (atomic_dec_and_test(&ptd->_mapcount))
+ /* Illegal this should not happen. */
+ BUG();
+
+}
+
+
+/* struct hmm_pt_iter - page table iterator states.
+ *
+ * @ptd: Array of directory struct page pointer for each levels.
+ * @ptdp: Array of pointer to mapped directory levels.
+ * @dead_directories: List of directories that died while walking page table.
+ * @cur: Current address.
+ */
+struct hmm_pt_iter {
+ struct page *ptd[HMM_PT_MAX_LEVEL - 1];
+ dma_addr_t *ptdp[HMM_PT_MAX_LEVEL - 1];
+ struct hmm_pt *pt;
+ struct list_head dead_directories;
+ unsigned long cur;
+};
+
+void hmm_pt_iter_init(struct hmm_pt_iter *iter, struct hmm_pt *pt);
+void hmm_pt_iter_fini(struct hmm_pt_iter *iter);
+dma_addr_t *hmm_pt_iter_walk(struct hmm_pt_iter *iter,
+ unsigned long *addr,
+ unsigned long *next);
+dma_addr_t *hmm_pt_iter_lookup(struct hmm_pt_iter *iter,
+ unsigned long addr,
+ unsigned long *next);
+dma_addr_t *hmm_pt_iter_populate(struct hmm_pt_iter *iter,
+ unsigned long addr,
+ unsigned long *next);
+
+/* hmm_pt_protect_directory_ref() - reference current entry directory.
+ *
+ * @iter: Iterator states that currently protect the entry directory.
+ *
+ * This function will reference the current entry directory. Call this when
+ * you add a new valid entry to the entry directory.
+ */
+static inline void hmm_pt_iter_directory_ref(struct hmm_pt_iter *iter)
+{
+ BUG_ON(!iter->ptd[iter->pt->llevel - 1]);
+ hmm_pt_directory_ref(iter->pt, iter->ptd[iter->pt->llevel - 1]);
+}
+
+/* hmm_pt_protect_directory_unref() - unreference current entry directory.
+ *
+ * @iter: Iterator states that currently protect the entry directory.
+ *
+ * This function will unreference the current entry directory. Call this when
+ * you remove a valid entry from the entry directory.
+ */
+static inline void hmm_pt_iter_directory_unref(struct hmm_pt_iter *iter)
+{
+ BUG_ON(!iter->ptd[iter->pt->llevel - 1]);
+ hmm_pt_directory_unref(iter->pt, iter->ptd[iter->pt->llevel - 1]);
+}
+
+static inline void hmm_pt_iter_directory_lock(struct hmm_pt_iter *iter)
+{
+ struct hmm_pt *pt = iter->pt;
+
+ hmm_pt_directory_lock(pt, iter->ptd[pt->llevel - 1], pt->llevel);
+}
+
+static inline void hmm_pt_iter_directory_unlock(struct hmm_pt_iter *iter)
+{
+ struct hmm_pt *pt = iter->pt;
+
+ hmm_pt_directory_unlock(pt, iter->ptd[pt->llevel - 1], pt->llevel);
+}
+
+
+#endif /* _HMM_PT_H */
diff --git a/mm/Makefile b/mm/Makefile
index a8255cf..b1dc1e8 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -85,4 +85,4 @@ obj-$(CONFIG_USERFAULTFD) += userfaultfd.o
obj-$(CONFIG_IDLE_PAGE_TRACKING) += page_idle.o
obj-$(CONFIG_FRAME_VECTOR) += frame_vector.o
obj-$(CONFIG_DEBUG_PAGE_REF) += debug_page_ref.o
-obj-$(CONFIG_HMM) += hmm.o
+obj-$(CONFIG_HMM) += hmm.o hmm_pt.o
diff --git a/mm/hmm_pt.c b/mm/hmm_pt.c
new file mode 100644
index 0000000..ed766a0
--- /dev/null
+++ b/mm/hmm_pt.c
@@ -0,0 +1,603 @@
+/*
+ * Copyright 2014 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * Authors: Jérôme Glisse <[email protected]>
+ */
+/*
+ * This provide a set of helpers for HMM page table. See include/linux/hmm.h
+ * for a description of what HMM is and include/linux/hmm_pt.h.
+ */
+#include <linux/highmem.h>
+#include <linux/slab.h>
+#include <linux/hmm_pt.h>
+
+/* hmm_pt_init() - initialize HMM page table.
+ *
+ * @pt: HMM page table to initialize.
+ *
+ * This function will initialize HMM page table and allocate memory for global
+ * directory. Only the hmm_pt.last fields need to be set prior to calling this
+ * function.
+ */
+int hmm_pt_init(struct hmm_pt *pt)
+{
+ unsigned directory_shift, i = 0, npgd;
+
+ /* Align end address with end of page for current arch. */
+ pt->last |= (PAGE_SIZE - 1);
+ spin_lock_init(&pt->lock);
+ /*
+ * Directory shift is the number of bits that a single directory level
+ * represent. For instance if PAGE_SIZE is 4096 and each entry takes 8
+ * bytes (sizeof(dma_addr_t) == 8) then directory_shift = 9.
+ */
+ directory_shift = PAGE_SHIFT - ilog2(sizeof(dma_addr_t));
+ /*
+ * Level 0 is the root level of the page table. It might use less
+ * bits than directory_shift but all sub-directory level will use all
+ * directory_shift bits.
+ *
+ * For instance if hmm_pt.last == (1 << 48) - 1, PAGE_SHIFT == 12 and
+ * sizeof(dma_addr_t) == 8 then :
+ * directory_shift = 9
+ * shift[0] = 39
+ * shift[1] = 30
+ * shift[2] = 21
+ * shift[3] = 12
+ * llevel = 3
+ *
+ * Note that shift[llevel] == PAGE_SHIFT because the last level
+ * correspond to the page table entry level (ignoring the case of huge
+ * page).
+ */
+ pt->shift[0] = ((__fls(pt->last >> PAGE_SHIFT) / directory_shift) *
+ directory_shift) + PAGE_SHIFT;
+ while (pt->shift[i++] > PAGE_SHIFT)
+ pt->shift[i] = pt->shift[i - 1] - directory_shift;
+ pt->llevel = i - 1;
+ pt->directory_mask = (1 << directory_shift) - 1;
+
+ for (i = 0; i <= pt->llevel; ++i)
+ pt->mask[i] = ~((1UL << pt->shift[i]) - 1);
+
+ npgd = (pt->last >> pt->shift[0]) + 1;
+ pt->pgd = kcalloc(npgd, sizeof(dma_addr_t), GFP_KERNEL);
+ if (!pt->pgd)
+ return -ENOMEM;
+
+ return 0;
+}
+EXPORT_SYMBOL(hmm_pt_init);
+
+static void hmm_pt_fini_directory(struct hmm_pt *pt,
+ struct page *ptd,
+ unsigned level)
+{
+ dma_addr_t *ptdp;
+ unsigned i;
+
+ if (level == pt->llevel)
+ return;
+
+ ptdp = kmap(ptd);
+ for (i = 0; i <= pt->directory_mask; ++i) {
+ struct page *lptd;
+
+ if (!(ptdp[i] & HMM_PDE_VALID))
+ continue;
+ lptd = pfn_to_page(hmm_pde_pfn(ptdp[i]));
+ ptdp[i] = 0;
+ hmm_pt_fini_directory(pt, lptd, level + 1);
+ atomic_set(&lptd->_mapcount, -1);
+ __free_page(lptd);
+ }
+ kunmap(ptd);
+}
+
+/* hmm_pt_fini() - finalize HMM page table.
+ *
+ * @pt: HMM page table to finalize.
+ *
+ * This function will free all resources of a directory page table.
+ */
+void hmm_pt_fini(struct hmm_pt *pt)
+{
+ unsigned i;
+
+ /* Free all directory. */
+ for (i = 0; i <= (pt->last >> pt->shift[0]); ++i) {
+ struct page *ptd;
+
+ if (!(pt->pgd[i] & HMM_PDE_VALID))
+ continue;
+ ptd = pfn_to_page(hmm_pde_pfn(pt->pgd[i]));
+ pt->pgd[i] = 0;
+ hmm_pt_fini_directory(pt, ptd, 1);
+ atomic_set(&ptd->_mapcount, -1);
+ __free_page(ptd);
+ }
+
+ kfree(pt->pgd);
+ pt->pgd = NULL;
+}
+EXPORT_SYMBOL(hmm_pt_fini);
+
+/* hmm_pt_level_start() - Start (inclusive) address of directory at given level
+ *
+ * @pt: HMM page table.
+ * @addr: Address for which to get the directory start address.
+ * @level: Directory level.
+ *
+ * This return the start address of directory at given level for a given
+ * address. So using usual x86-64 example with :
+ * (hmm_pt.last == (1 << 48) - 1, PAGE_SHIFT == 12, sizeof(dma_addr_t) == 8)
+ * We have :
+ * llevel = 3 (which is the page table entry level)
+ * shift[0] = 39 mask[0] = ~((1 << 39) - 1)
+ * shift[1] = 30 mask[1] = ~((1 << 30) - 1)
+ * shift[2] = 21 mask[2] = ~((1 << 21) - 1)
+ * shift[3] = 12 mask[3] = ~((1 << 12) - 1)
+ * Which gives :
+ * start = hmm_pt_level_start(pt, addr, 3)
+ * = addr & pt->mask[3 - 1]
+ * = addr & ~((1 << 21) - 1)
+ */
+static inline unsigned long hmm_pt_level_start(struct hmm_pt *pt,
+ unsigned long addr,
+ unsigned level)
+{
+ return level ? addr & pt->mask[level - 1] : 0;
+}
+
+/* hmm_pt_level_end() - End address (inclusive) of directory at given level.
+ *
+ * @pt: HMM page table.
+ * @addr: Address for which to get the directory end address.
+ * @level: Directory level.
+ *
+ * This return the start address of directory at given level for a given
+ * address. So using usual x86-64 example with :
+ * (hmm_pt.last == (1 << 48) - 1, PAGE_SHIFT == 12, sizeof(dma_addr_t) == 8)
+ * We have :
+ * llevel = 3 (which is the page table entry level)
+ * shift[0] = 39 mask[0] = ~((1 << 39) - 1)
+ * shift[1] = 30 mask[1] = ~((1 << 30) - 1)
+ * shift[2] = 21 mask[2] = ~((1 << 21) - 1)
+ * shift[3] = 12 mask[3] = ~((1 << 12) - 1)
+ * Which gives :
+ * start = hmm_pt_level_end(pt, addr, 3)
+ * = addr | ~pt->mask[3 - 1]
+ * = addr | ((1 << 21) - 1)
+ */
+static inline unsigned long hmm_pt_level_end(struct hmm_pt *pt,
+ unsigned long addr,
+ unsigned level)
+{
+ return level ? (addr | (~pt->mask[level - 1])) : pt->last;
+}
+
+static inline dma_addr_t *hmm_pt_iter_ptdp(struct hmm_pt_iter *iter,
+ unsigned long addr)
+{
+ struct hmm_pt *pt = iter->pt;
+
+ BUG_ON(!iter->ptd[pt->llevel - 1] ||
+ addr < hmm_pt_level_start(pt, iter->cur, pt->llevel) ||
+ addr > hmm_pt_level_end(pt, iter->cur, pt->llevel));
+ return &iter->ptdp[pt->llevel - 1][hmm_pt_index(pt, addr, pt->llevel)];
+}
+
+/* hmm_pt_iter_init() - initialize iterator states.
+ *
+ * @iter: Iterator states.
+ *
+ * This function will initialize iterator states. It must always be pair with a
+ * call to hmm_pt_iter_fini().
+ */
+void hmm_pt_iter_init(struct hmm_pt_iter *iter, struct hmm_pt *pt)
+{
+ iter->pt = pt;
+ memset(iter->ptd, 0, sizeof(iter->ptd));
+ memset(iter->ptdp, 0, sizeof(iter->ptdp));
+ INIT_LIST_HEAD(&iter->dead_directories);
+}
+EXPORT_SYMBOL(hmm_pt_iter_init);
+
+/* hmm_pt_iter_directory_unref_safe() - unref a directory that is safe to free.
+ *
+ * @iter: Iterator states.
+ * @pt: HMM page table.
+ * @level: Level of the directory to unref.
+ *
+ * This function will unreference a directory and add it to dead list if
+ * directory no longer have any reference. It will also clear the entry to
+ * that directory into the upper level directory as well as dropping ref
+ * on the upper directory.
+ */
+static void hmm_pt_iter_directory_unref_safe(struct hmm_pt_iter *iter,
+ unsigned level)
+{
+ struct page *upper_ptd;
+ dma_addr_t *upper_ptdp;
+
+ /* Nothing to do for root level. */
+ if (!level)
+ return;
+
+ if (!atomic_dec_and_test(&iter->ptd[level - 1]->_mapcount))
+ return;
+
+ upper_ptd = level > 1 ? iter->ptd[level - 2] : NULL;
+ upper_ptdp = level > 1 ? iter->ptdp[level - 2] : iter->pt->pgd;
+ upper_ptdp = &upper_ptdp[hmm_pt_index(iter->pt, iter->cur, level - 1)];
+ hmm_pt_directory_lock(iter->pt, upper_ptd, level - 1);
+ /*
+ * There might be race btw decrementing reference count on a directory
+ * and another thread trying to fault in a new directory. To avoid
+ * erasing the new directory entry we need to check that the entry
+ * still correspond to the directory we are removing.
+ */
+ if (hmm_pde_pfn(*upper_ptdp) == page_to_pfn(iter->ptd[level - 1]))
+ *upper_ptdp = 0;
+ hmm_pt_directory_unlock(iter->pt, upper_ptd, level - 1);
+
+ /* Add it to delayed free list. */
+ list_add_tail(&iter->ptd[level - 1]->lru, &iter->dead_directories);
+
+ /*
+ * The upper directory is now safe to unref as we have an extra ref and
+ * thus refcount should not reach 0.
+ */
+ if (upper_ptd)
+ hmm_pt_directory_unref(iter->pt, upper_ptd);
+}
+
+static void hmm_pt_iter_unprotect_directory(struct hmm_pt_iter *iter,
+ unsigned level)
+{
+ if (!iter->ptd[level - 1])
+ return;
+ kunmap(iter->ptd[level - 1]);
+ hmm_pt_iter_directory_unref_safe(iter, level);
+ iter->ptd[level - 1] = NULL;
+}
+
+/* hmm_pt_iter_protect_directory() - protect a directory.
+ *
+ * @iter: Iterator states.
+ * @ptd: directory struct page to protect.
+ * @addr: Address of the directory.
+ * @level: Level of this directory (> 0).
+ * Returns -EINVAL on error, 1 if protection succeeded, 0 otherwise.
+ *
+ * This function will proctect a directory by taking a reference. It will also
+ * map the directory to allow cpu access.
+ *
+ * Call to this function must be made from inside the rcu read critical section
+ * that convert the table entry to the directory struct page. Doing so allow to
+ * support concurrent removal of directory because this function will take the
+ * reference inside the rcu critical section and thus rcu synchronization will
+ * garanty that we can safely free directory.
+ */
+static int hmm_pt_iter_protect_directory(struct hmm_pt_iter *iter,
+ struct page *ptd,
+ unsigned long addr,
+ unsigned level)
+{
+ /* This must be call inside rcu read section. */
+ BUG_ON(!rcu_read_lock_held());
+
+ if (!level || iter->ptd[level - 1]) {
+ rcu_read_unlock();
+ return -EINVAL;
+ }
+
+ if (!atomic_inc_not_zero(&ptd->_mapcount)) {
+ rcu_read_unlock();
+ return 0;
+ }
+
+ rcu_read_unlock();
+
+ iter->ptd[level - 1] = ptd;
+ iter->ptdp[level - 1] = kmap(ptd);
+ iter->cur = addr;
+
+ return 1;
+}
+
+/* hmm_pt_iter_walk() - Walk page table for a valid entry directory.
+ *
+ * @iter: Iterator states.
+ * @addr: Start address of the range, return address of the entry directory.
+ * @next: End address of the range, return address of next directory.
+ * Returns Entry directory pointer and associated address if a valid entry
+ * directory exist in the range, or NULL and empty (*addr=*next) range
+ * otherwise.
+ *
+ * This function will return the first valid entry directory over a range of
+ * address. It update the addr parameter with the entry address and the next
+ * parameter with the address of the end of that directory. So device driver
+ * can do :
+ *
+ * for (addr = start; addr < end;) {
+ * unsigned long next = end;
+ *
+ * for (ptep=hmm_pt_iter_walk(iter, &addr, &next); ptep; addr + PAGE_SIZE) {
+ * // Use ptep
+ * ptep++;
+ * }
+ * }
+ */
+dma_addr_t *hmm_pt_iter_walk(struct hmm_pt_iter *iter,
+ unsigned long *addr,
+ unsigned long *next)
+{
+ struct hmm_pt *pt = iter->pt;
+ int i;
+
+ *addr &= PAGE_MASK;
+
+ if (iter->ptd[pt->llevel - 1] &&
+ *addr >= hmm_pt_level_start(pt, iter->cur, pt->llevel) &&
+ *addr <= hmm_pt_level_end(pt, iter->cur, pt->llevel)) {
+ *next = min(*next, hmm_pt_level_end(pt, *addr, pt->llevel)+1);
+ return hmm_pt_iter_ptdp(iter, *addr);
+ }
+
+again:
+ /* First unprotect any directory that do not cover the address. */
+ for (i = pt->llevel; i >= 1; --i) {
+ if (!iter->ptd[i - 1])
+ continue;
+ if (*addr >= hmm_pt_level_start(pt, iter->cur, i) &&
+ *addr <= hmm_pt_level_end(pt, iter->cur, i))
+ break;
+ hmm_pt_iter_unprotect_directory(iter, i);
+ }
+
+ /* Walk down to last level of the directory tree. */
+ for (; i < pt->llevel; ++i) {
+ struct page *ptd;
+ dma_addr_t pte, *ptdp;
+
+ rcu_read_lock();
+ ptdp = i ? iter->ptdp[i - 1] : pt->pgd;
+ pte = ACCESS_ONCE(ptdp[hmm_pt_index(pt, *addr, i)]);
+ if (!(pte & HMM_PDE_VALID)) {
+ rcu_read_unlock();
+ *addr = hmm_pt_level_end(pt, iter->cur, i) + 1;
+ if (*addr > *next) {
+ *addr = *next;
+ return NULL;
+ }
+ goto again;
+ }
+ ptd = pfn_to_page(hmm_pde_pfn(pte));
+ /* RCU read unlock inside hmm_pt_iter_protect_directory(). */
+ if (hmm_pt_iter_protect_directory(iter, ptd,
+ *addr, i + 1) != 1) {
+ if (*addr > *next) {
+ *addr = *next;
+ return NULL;
+ }
+ goto again;
+ }
+ }
+
+ *next = min(*next, hmm_pt_level_end(pt, *addr, pt->llevel) + 1);
+ return hmm_pt_iter_ptdp(iter, *addr);
+}
+EXPORT_SYMBOL(hmm_pt_iter_walk);
+
+/* hmm_pt_iter_lookup() - Lookup entry directory for an address.
+ *
+ * @iter: Iterator states.
+ * @addr: Address of the entry directory to lookup.
+ * @next: End address up to which the entry directory is valid.
+ * Returns Entry directory pointer and its end address.
+ *
+ * This function will return the entry directory pointer for a given address as
+ * well as the end address of that directory (address of the next directory).
+ * Use patern is :
+ *
+ * for (addr = start; addr < end;) {
+ * unsigned long next;
+ *
+ * for (ptep=hmm_pt_iter_lookup(iter, addr, &next); ptep; addr+=PAGE_SIZE) {
+ * // Use ptep
+ * ptep++;
+ * }
+ * }
+ */
+dma_addr_t *hmm_pt_iter_lookup(struct hmm_pt_iter *iter,
+ unsigned long addr,
+ unsigned long *next)
+{
+ struct hmm_pt *pt = iter->pt;
+ int i;
+
+ addr &= PAGE_MASK;
+
+ if (iter->ptd[pt->llevel - 1] &&
+ addr >= hmm_pt_level_start(pt, iter->cur, pt->llevel) &&
+ addr <= hmm_pt_level_end(pt, iter->cur, pt->llevel)) {
+ *next = min(*next, hmm_pt_level_end(pt, addr, pt->llevel) + 1);
+ return hmm_pt_iter_ptdp(iter, addr);
+ }
+
+ /* First unprotect any directory that do not cover the address. */
+ for (i = pt->llevel; i >= 1; --i) {
+ if (!iter->ptd[i - 1])
+ continue;
+ if (addr >= hmm_pt_level_start(pt, iter->cur, i) &&
+ addr <= hmm_pt_level_end(pt, iter->cur, i))
+ break;
+ hmm_pt_iter_unprotect_directory(iter, i);
+ }
+
+ /* Walk down to last level of the directory tree. */
+ for (; i < pt->llevel; ++i) {
+ struct page *ptd;
+ dma_addr_t pte, *ptdp;
+
+ rcu_read_lock();
+ ptdp = i ? iter->ptdp[i - 1] : pt->pgd;
+ pte = ACCESS_ONCE(ptdp[hmm_pt_index(pt, addr, i)]);
+ if (!(pte & HMM_PDE_VALID)) {
+ rcu_read_unlock();
+ *next = min(*next,
+ hmm_pt_level_end(pt, iter->cur, i) + 1);
+ return NULL;
+ }
+ ptd = pfn_to_page(hmm_pde_pfn(pte));
+ /* RCU read unlock inside hmm_pt_iter_protect_directory(). */
+ if (hmm_pt_iter_protect_directory(iter, ptd, addr, i + 1) != 1) {
+ *next = min(*next,
+ hmm_pt_level_end(pt, iter->cur, i) + 1);
+ return NULL;
+ }
+ }
+
+ *next = min(*next, hmm_pt_level_end(pt, addr, pt->llevel) + 1);
+ return hmm_pt_iter_ptdp(iter, addr);
+}
+EXPORT_SYMBOL(hmm_pt_iter_lookup);
+
+/* hmm_pt_iter_populate() - Allocate entry directory for an address.
+ *
+ * @iter: Iterator states.
+ * @addr: Address of the entry directory to lookup.
+ * @next: End address up to which the entry directory is valid.
+ * Returns Entry directory pointer and its end address.
+ *
+ * This function will return the entry directory pointer (and allocate a new
+ * one if none exist) for a given address as well as the end address of that
+ * directory (address of the next directory). Use patern is :
+ *
+ * for (addr = start; addr < end;) {
+ * unsigned long next;
+ *
+ * ptep = hmm_pt_iter_populate(iter,addr,&next);
+ * if (!ptep) {
+ * // error handling.
+ * }
+ * for (; addr < next; addr += PAGE_SIZE, ptep++) {
+ * // Use ptep
+ * }
+ * }
+ */
+dma_addr_t *hmm_pt_iter_populate(struct hmm_pt_iter *iter,
+ unsigned long addr,
+ unsigned long *next)
+{
+ dma_addr_t *ptdp = hmm_pt_iter_lookup(iter, addr, next);
+ struct hmm_pt *pt = iter->pt;
+ struct page *new = NULL;
+ int i;
+
+ if (ptdp)
+ return ptdp;
+
+ /* Populate directory tree structures. */
+ for (i = 1, iter->cur = addr; i <= pt->llevel; ++i) {
+ struct page *upper_ptd;
+ dma_addr_t *upper_ptdp;
+
+ if (iter->ptd[i - 1])
+ continue;
+
+ new = new ? new : alloc_page(GFP_HIGHUSER | __GFP_ZERO);
+ if (!new)
+ return NULL;
+
+ upper_ptd = i > 1 ? iter->ptd[i - 2] : NULL;
+ upper_ptdp = i > 1 ? iter->ptdp[i - 2] : pt->pgd;
+ upper_ptdp = &upper_ptdp[hmm_pt_index(pt, addr, i - 1)];
+ hmm_pt_directory_lock(pt, upper_ptd, i - 1);
+ if (((*upper_ptdp) & HMM_PDE_VALID)) {
+ struct page *ptd;
+
+ ptd = pfn_to_page(hmm_pde_pfn(*upper_ptdp));
+ if (atomic_inc_not_zero(&ptd->_mapcount)) {
+ /* Already allocated by another thread. */
+ iter->ptd[i - 1] = ptd;
+ hmm_pt_directory_unlock(pt, upper_ptd, i - 1);
+ iter->ptdp[i - 1] = kmap(ptd);
+ continue;
+ }
+ /*
+ * Means we raced with removal of dead directory it is
+ * safe to overwritte *upper_ptdp entry with new entry.
+ */
+ }
+ /* Initialize struct page field for the directory. */
+ atomic_set(&new->_mapcount, 1);
+#if USE_SPLIT_PTE_PTLOCKS && !ALLOC_SPLIT_PTLOCKS
+ spin_lock_init(&new->ptl);
+#endif
+ *upper_ptdp = hmm_pde_from_pfn(page_to_pfn(new));
+ /* The pgd level is not refcounted. */
+ if (i > 1)
+ hmm_pt_directory_ref(pt, iter->ptd[i - 2]);
+ /* Unlock upper directory and map the new directory. */
+ hmm_pt_directory_unlock(pt, upper_ptd, i - 1);
+ iter->ptd[i - 1] = new;
+ iter->ptdp[i - 1] = kmap(new);
+ new = NULL;
+ }
+ if (new)
+ __free_page(new);
+ *next = min(*next, hmm_pt_level_end(pt, addr, pt->llevel) + 1);
+ return hmm_pt_iter_ptdp(iter, addr);
+}
+EXPORT_SYMBOL(hmm_pt_iter_populate);
+
+/* hmm_pt_iter_fini() - finalize iterator.
+ *
+ * @iter: Iterator states.
+ * @pt: HMM page table.
+ *
+ * This function will cleanup iterator by unmapping and unreferencing any
+ * directory still mapped and referenced. It will also free any dead directory.
+ */
+void hmm_pt_iter_fini(struct hmm_pt_iter *iter)
+{
+ struct page *ptd, *tmp;
+ unsigned i;
+
+ for (i = iter->pt->llevel; i >= 1; --i) {
+ if (!iter->ptd[i - 1])
+ continue;
+ hmm_pt_iter_unprotect_directory(iter, i);
+ }
+
+ /* Avoid useless synchronize_rcu() if there is no directory to free. */
+ if (list_empty(&iter->dead_directories))
+ return;
+
+ /*
+ * Some iterator may have dereferenced a dead directory entry and looked
+ * up the struct page but haven't check yet the reference count. As all
+ * the above happen in rcu read critical section we know that we need
+ * to wait for grace period before being able to free any of the dead
+ * directory page.
+ */
+ synchronize_rcu();
+ list_for_each_entry_safe(ptd, tmp, &iter->dead_directories, lru) {
+ list_del(&ptd->lru);
+ atomic_set(&ptd->_mapcount, -1);
+ __free_page(ptd);
+ }
+}
+EXPORT_SYMBOL(hmm_pt_iter_fini);
--
2.4.3

2016-03-08 19:47:43

by Jerome Glisse

[permalink] [raw]
Subject: [PATCH v12 05/29] HMM: introduce heterogeneous memory management v5.

This patch only introduce core HMM functions for registering a new
mirror and stopping a mirror as well as HMM device registering and
unregistering.

The lifecycle of HMM object is handled differently then the one of
mmu_notifier because unlike mmu_notifier there can be concurrent
call from both mm code to HMM code and/or from device driver code
to HMM code. Moreover lifetime of HMM can be uncorrelated from the
lifetime of the process that is being mirror (GPU might take longer
time to cleanup).

Changed since v1:
- Updated comment of hmm_device_register().

Changed since v2:
- Expose struct hmm for easy access to mm struct.
- Simplify hmm_mirror_register() arguments.
- Removed the device name.
- Refcount the mirror struct internaly to HMM allowing to get
rid of the srcu and making the device driver callback error
handling simpler.
- Safe to call several time hmm_mirror_unregister().
- Rework the mmu_notifier unregistration and release callback.

Changed since v3:
- Rework hmm_mirror lifetime rules.
- Synchronize with mmu_notifier srcu before droping mirror last
reference in hmm_mirror_unregister()
- Use spinlock for device's mirror list.
- Export mirror ref/unref functions.
- English syntax fixes.

Changed since v4:
- Properly reference existing hmm struct if any.

Signed-off-by: Jérôme Glisse <[email protected]>
Signed-off-by: Sherry Cheung <[email protected]>
Signed-off-by: Subhash Gutti <[email protected]>
Signed-off-by: Mark Hairgrove <[email protected]>
Signed-off-by: John Hubbard <[email protected]>
Signed-off-by: Jatin Kumar <[email protected]>
---
MAINTAINERS | 7 +
include/linux/hmm.h | 173 +++++++++++++++++++++
include/linux/mm.h | 11 ++
include/linux/mm_types.h | 14 ++
kernel/fork.c | 2 +
mm/Kconfig | 12 ++
mm/Makefile | 1 +
mm/hmm.c | 381 +++++++++++++++++++++++++++++++++++++++++++++++
8 files changed, 601 insertions(+)
create mode 100644 include/linux/hmm.h
create mode 100644 mm/hmm.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 3995455..0fc4c5f 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5039,6 +5039,13 @@ F: include/uapi/linux/if_hippi.h
F: net/802/hippi.c
F: drivers/net/hippi/

+HMM - Heterogeneous Memory Management
+M: Jérôme Glisse <[email protected]>
+L: [email protected]
+S: Maintained
+F: mm/hmm.c
+F: include/linux/hmm.h
+
HOST AP DRIVER
M: Jouni Malinen <[email protected]>
L: [email protected] (subscribers-only)
diff --git a/include/linux/hmm.h b/include/linux/hmm.h
new file mode 100644
index 0000000..b559c0b
--- /dev/null
+++ b/include/linux/hmm.h
@@ -0,0 +1,173 @@
+/*
+ * Copyright 2013 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * Authors: Jérôme Glisse <[email protected]>
+ */
+/* This is a heterogeneous memory management (hmm). In a nutshell this provide
+ * an API to mirror a process address on a device which has its own mmu using
+ * its own page table for the process. It supports everything except special
+ * vma.
+ *
+ * Mandatory hardware features :
+ * - An mmu with pagetable.
+ * - Read only flag per cpu page.
+ * - Page fault ie hardware must stop and wait for kernel to service fault.
+ *
+ * Optional hardware features :
+ * - Dirty bit per cpu page.
+ * - Access bit per cpu page.
+ *
+ * The hmm code handle all the interfacing with the core kernel mm code and
+ * provide a simple API. It does support migrating system memory to device
+ * memory and handle migration back to system memory on cpu page fault.
+ *
+ * Migrated memory is considered as swaped from cpu and core mm code point of
+ * view.
+ */
+#ifndef _HMM_H
+#define _HMM_H
+
+#ifdef CONFIG_HMM
+
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/atomic.h>
+#include <linux/mm_types.h>
+#include <linux/mmu_notifier.h>
+#include <linux/workqueue.h>
+#include <linux/mman.h>
+
+
+struct hmm_device;
+struct hmm_mirror;
+struct hmm;
+
+
+/* hmm_device - Each device must register one and only one hmm_device.
+ *
+ * The hmm_device is the link btw HMM and each device driver.
+ */
+
+/* struct hmm_device_operations - HMM device operation callback
+ */
+struct hmm_device_ops {
+ /* release() - mirror must stop using the address space.
+ *
+ * @mirror: The mirror that link process address space with the device.
+ *
+ * When this is called, device driver must kill all device thread using
+ * this mirror. It is call either from :
+ * - mm dying (all process using this mm exiting).
+ * - hmm_mirror_unregister() (if no other thread holds a reference)
+ * - outcome of some device error reported by any of the device
+ * callback against that mirror.
+ */
+ void (*release)(struct hmm_mirror *mirror);
+
+ /* free() - mirror can be freed.
+ *
+ * @mirror: The mirror that link process address space with the device.
+ *
+ * When this is called, device driver can free the underlying memory
+ * associated with that mirror. Note this is call from atomic context
+ * so device driver callback can not sleep.
+ */
+ void (*free)(struct hmm_mirror *mirror);
+};
+
+
+/* struct hmm - per mm_struct HMM states.
+ *
+ * @mm: The mm struct this hmm is associated with.
+ * @mirrors: List of all mirror for this mm (one per device).
+ * @vm_end: Last valid address for this mm (exclusive).
+ * @kref: Reference counter.
+ * @rwsem: Serialize the mirror list modifications.
+ * @mmu_notifier: The mmu_notifier of this mm.
+ * @rcu: For delayed cleanup call from mmu_notifier.release() callback.
+ *
+ * For each process address space (mm_struct) there is one and only one hmm
+ * struct. hmm functions will redispatch to each devices the change made to
+ * the process address space.
+ *
+ * Device driver must not access this structure other than for getting the
+ * mm pointer.
+ */
+struct hmm {
+ struct mm_struct *mm;
+ struct hlist_head mirrors;
+ unsigned long vm_end;
+ struct kref kref;
+ struct rw_semaphore rwsem;
+ struct mmu_notifier mmu_notifier;
+ struct rcu_head rcu;
+};
+
+
+/* struct hmm_device - per device HMM structure
+ *
+ * @dev: Linux device structure pointer.
+ * @ops: The hmm operations callback.
+ * @mirrors: List of all active mirrors for the device.
+ * @lock: Lock protecting mirrors list.
+ *
+ * Each device that want to mirror an address space must register one of this
+ * struct (only once per linux device).
+ */
+struct hmm_device {
+ struct device *dev;
+ const struct hmm_device_ops *ops;
+ struct list_head mirrors;
+ spinlock_t lock;
+};
+
+int hmm_device_register(struct hmm_device *device);
+int hmm_device_unregister(struct hmm_device *device);
+
+
+/* hmm_mirror - device specific mirroring functions.
+ *
+ * Each device that mirror a process has a uniq hmm_mirror struct associating
+ * the process address space with the device. Same process can be mirrored by
+ * several different devices at the same time.
+ */
+
+/* struct hmm_mirror - per device and per mm HMM structure
+ *
+ * @device: The hmm_device struct this hmm_mirror is associated to.
+ * @hmm: The hmm struct this hmm_mirror is associated to.
+ * @kref: Reference counter (private to HMM do not use).
+ * @dlist: List of all hmm_mirror for same device.
+ * @mlist: List of all hmm_mirror for same process.
+ *
+ * Each device that want to mirror an address space must register one of this
+ * struct for each of the address space it wants to mirror. Same device can
+ * mirror several different address space. As well same address space can be
+ * mirror by different devices.
+ */
+struct hmm_mirror {
+ struct hmm_device *device;
+ struct hmm *hmm;
+ struct kref kref;
+ struct list_head dlist;
+ struct hlist_node mlist;
+};
+
+int hmm_mirror_register(struct hmm_mirror *mirror);
+void hmm_mirror_unregister(struct hmm_mirror *mirror);
+struct hmm_mirror *hmm_mirror_ref(struct hmm_mirror *mirror);
+void hmm_mirror_unref(struct hmm_mirror **mirror);
+
+
+#endif /* CONFIG_HMM */
+#endif
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 619564b..f312210 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2386,5 +2386,16 @@ void __init setup_nr_node_ids(void);
static inline void setup_nr_node_ids(void) {}
#endif

+#ifdef CONFIG_HMM
+static inline void hmm_mm_init(struct mm_struct *mm)
+{
+ mm->hmm = NULL;
+}
+#else /* !CONFIG_HMM */
+static inline void hmm_mm_init(struct mm_struct *mm)
+{
+}
+#endif /* !CONFIG_HMM */
+
#endif /* __KERNEL__ */
#endif /* _LINUX_MM_H */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 624b78b..a9b51f7 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -15,6 +15,10 @@
#include <asm/page.h>
#include <asm/mmu.h>

+#ifdef CONFIG_HMM
+struct hmm;
+#endif
+
#ifndef AT_VECTOR_SIZE_ARCH
#define AT_VECTOR_SIZE_ARCH 0
#endif
@@ -473,6 +477,16 @@ struct mm_struct {
#ifdef CONFIG_MMU_NOTIFIER
struct mmu_notifier_mm *mmu_notifier_mm;
#endif
+#ifdef CONFIG_HMM
+ /*
+ * hmm always register an mmu_notifier we rely on mmu notifier to keep
+ * refcount on mm struct as well as forbiding registering hmm on a
+ * dying mm
+ *
+ * This field is set with mmap_sem held in write mode.
+ */
+ struct hmm *hmm;
+#endif
#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
pgtable_t pmd_huge_pte; /* protected by page_table_lock */
#endif
diff --git a/kernel/fork.c b/kernel/fork.c
index 93061d9..d3911a0 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -27,6 +27,7 @@
#include <linux/binfmts.h>
#include <linux/mman.h>
#include <linux/mmu_notifier.h>
+#include <linux/hmm.h>
#include <linux/fs.h>
#include <linux/mm.h>
#include <linux/vmacache.h>
@@ -613,6 +614,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
mm_init_aio(mm);
mm_init_owner(mm, p);
mmu_notifier_mm_init(mm);
+ hmm_mm_init(mm);
clear_tlb_flush_pending(mm);
#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
mm->pmd_huge_pte = NULL;
diff --git a/mm/Kconfig b/mm/Kconfig
index f2c1a07..2e4686c 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -671,3 +671,15 @@ config NR_ZONES_EXTENDED

config FRAME_VECTOR
bool
+
+config HMM
+ bool "Enable heterogeneous memory management (HMM)"
+ depends on MMU
+ select MMU_NOTIFIER
+ default n
+ help
+ Heterogeneous memory management provide infrastructure for a device
+ to mirror a process address space into an hardware mmu or into any
+ things supporting pagefault like event.
+
+ If unsure, say N to disable hmm.
diff --git a/mm/Makefile b/mm/Makefile
index ddeb632..a8255cf 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -85,3 +85,4 @@ obj-$(CONFIG_USERFAULTFD) += userfaultfd.o
obj-$(CONFIG_IDLE_PAGE_TRACKING) += page_idle.o
obj-$(CONFIG_FRAME_VECTOR) += frame_vector.o
obj-$(CONFIG_DEBUG_PAGE_REF) += debug_page_ref.o
+obj-$(CONFIG_HMM) += hmm.o
diff --git a/mm/hmm.c b/mm/hmm.c
new file mode 100644
index 0000000..8d861c4
--- /dev/null
+++ b/mm/hmm.c
@@ -0,0 +1,381 @@
+/*
+ * Copyright 2013 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * Authors: Jérôme Glisse <[email protected]>
+ */
+/* This is the core code for heterogeneous memory management (HMM). HMM intend
+ * to provide helper for mirroring a process address space on a device as well
+ * as allowing migration of data between system memory and device memory refer
+ * as remote memory from here on out.
+ *
+ * Refer to include/linux/hmm.h for further information on general design.
+ */
+#include <linux/export.h>
+#include <linux/bitmap.h>
+#include <linux/list.h>
+#include <linux/rculist.h>
+#include <linux/slab.h>
+#include <linux/mmu_notifier.h>
+#include <linux/mm.h>
+#include <linux/hugetlb.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/ksm.h>
+#include <linux/rmap.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
+#include <linux/mmu_context.h>
+#include <linux/memcontrol.h>
+#include <linux/hmm.h>
+#include <linux/wait.h>
+#include <linux/mman.h>
+#include <linux/delay.h>
+#include <linux/workqueue.h>
+
+#include "internal.h"
+
+static struct mmu_notifier_ops hmm_notifier_ops;
+
+
+/* hmm - core HMM functions.
+ *
+ * Core HMM functions that deal with all the process mm activities.
+ */
+
+static int hmm_init(struct hmm *hmm)
+{
+ hmm->mm = current->mm;
+ hmm->vm_end = TASK_SIZE;
+ kref_init(&hmm->kref);
+ INIT_HLIST_HEAD(&hmm->mirrors);
+ init_rwsem(&hmm->rwsem);
+
+ /* register notifier */
+ hmm->mmu_notifier.ops = &hmm_notifier_ops;
+ return __mmu_notifier_register(&hmm->mmu_notifier, current->mm);
+}
+
+static int hmm_add_mirror(struct hmm *hmm, struct hmm_mirror *mirror)
+{
+ struct hmm_mirror *tmp;
+
+ down_write(&hmm->rwsem);
+ hlist_for_each_entry(tmp, &hmm->mirrors, mlist)
+ if (tmp->device == mirror->device) {
+ /* Same device can mirror only once. */
+ up_write(&hmm->rwsem);
+ return -EINVAL;
+ }
+ hlist_add_head(&mirror->mlist, &hmm->mirrors);
+ hmm_mirror_ref(mirror);
+ up_write(&hmm->rwsem);
+
+ return 0;
+}
+
+static inline struct hmm *hmm_ref(struct hmm *hmm)
+{
+ if (!hmm || !kref_get_unless_zero(&hmm->kref))
+ return NULL;
+ return hmm;
+}
+
+static void hmm_destroy_delayed(struct rcu_head *rcu)
+{
+ struct hmm *hmm;
+
+ hmm = container_of(rcu, struct hmm, rcu);
+ kfree(hmm);
+}
+
+static void hmm_destroy(struct kref *kref)
+{
+ struct hmm *hmm;
+
+ hmm = container_of(kref, struct hmm, kref);
+ BUG_ON(!hlist_empty(&hmm->mirrors));
+
+ down_write(&hmm->mm->mmap_sem);
+ /* A new hmm might have been register before reaching that point. */
+ if (hmm->mm->hmm == hmm)
+ hmm->mm->hmm = NULL;
+ up_write(&hmm->mm->mmap_sem);
+
+ mmu_notifier_unregister_no_release(&hmm->mmu_notifier, hmm->mm);
+
+ mmu_notifier_call_srcu(&hmm->rcu, &hmm_destroy_delayed);
+}
+
+static inline struct hmm *hmm_unref(struct hmm *hmm)
+{
+ if (hmm)
+ kref_put(&hmm->kref, hmm_destroy);
+ return NULL;
+}
+
+
+/* hmm_notifier - HMM callback for mmu_notifier tracking change to process mm.
+ *
+ * HMM use use mmu notifier to track change made to process address space.
+ */
+static void hmm_notifier_release(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+ struct hmm *hmm;
+
+ hmm = hmm_ref(container_of(mn, struct hmm, mmu_notifier));
+ if (!hmm)
+ return;
+
+ down_write(&hmm->rwsem);
+ while (hmm->mirrors.first) {
+ struct hmm_mirror *mirror;
+
+ /*
+ * Here we are holding the mirror reference from the mirror
+ * list. As list removal is synchronized through rwsem, no
+ * other thread can assume it holds that reference.
+ */
+ mirror = hlist_entry(hmm->mirrors.first,
+ struct hmm_mirror,
+ mlist);
+ hlist_del_init(&mirror->mlist);
+ up_write(&hmm->rwsem);
+
+ mirror->device->ops->release(mirror);
+ hmm_mirror_unref(&mirror);
+
+ down_write(&hmm->rwsem);
+ }
+ up_write(&hmm->rwsem);
+
+ hmm_unref(hmm);
+}
+
+static struct mmu_notifier_ops hmm_notifier_ops = {
+ .release = hmm_notifier_release,
+};
+
+
+/* hmm_mirror - per device mirroring functions.
+ *
+ * Each device that mirror a process has a uniq hmm_mirror struct. A process
+ * can be mirror by several devices at the same time.
+ *
+ * Below are all the functions and their helpers use by device driver to mirror
+ * the process address space. Those functions either deals with updating the
+ * device page table (through hmm callback). Or provide helper functions use by
+ * the device driver to fault in range of memory in the device page table.
+ */
+struct hmm_mirror *hmm_mirror_ref(struct hmm_mirror *mirror)
+{
+ if (!mirror || !kref_get_unless_zero(&mirror->kref))
+ return NULL;
+ return mirror;
+}
+EXPORT_SYMBOL(hmm_mirror_ref);
+
+static void hmm_mirror_destroy(struct kref *kref)
+{
+ struct hmm_device *device;
+ struct hmm_mirror *mirror;
+
+ mirror = container_of(kref, struct hmm_mirror, kref);
+ device = mirror->device;
+
+ hmm_unref(mirror->hmm);
+
+ spin_lock(&device->lock);
+ list_del_init(&mirror->dlist);
+ device->ops->free(mirror);
+ spin_unlock(&device->lock);
+}
+
+void hmm_mirror_unref(struct hmm_mirror **mirror)
+{
+ struct hmm_mirror *tmp = mirror ? *mirror : NULL;
+
+ if (tmp) {
+ *mirror = NULL;
+ kref_put(&tmp->kref, hmm_mirror_destroy);
+ }
+}
+EXPORT_SYMBOL(hmm_mirror_unref);
+
+/* hmm_mirror_register() - register mirror against current process for a device.
+ *
+ * @mirror: The mirror struct being registered.
+ * Returns: 0 on success or -ENOMEM, -EINVAL on error.
+ *
+ * Call when device driver want to start mirroring a process address space. The
+ * HMM shim will register mmu_notifier and start monitoring process address
+ * space changes. Hence callback to device driver might happen even before this
+ * function return.
+ *
+ * The task device driver want to mirror must be current !
+ *
+ * Only one mirror per mm and hmm_device can be created, it will return NULL if
+ * the hmm_device already has an hmm_mirror for the the mm.
+ */
+int hmm_mirror_register(struct hmm_mirror *mirror)
+{
+ struct mm_struct *mm = current->mm;
+ struct hmm *hmm = NULL;
+ int ret = 0;
+
+ /* Sanity checks. */
+ BUG_ON(!mirror);
+ BUG_ON(!mirror->device);
+ BUG_ON(!mm);
+
+ /*
+ * Initialize the mirror struct fields, the mlist init and del dance is
+ * necessary to make the error path easier for driver and for hmm.
+ */
+ kref_init(&mirror->kref);
+ INIT_HLIST_NODE(&mirror->mlist);
+ INIT_LIST_HEAD(&mirror->dlist);
+ spin_lock(&mirror->device->lock);
+ list_add(&mirror->dlist, &mirror->device->mirrors);
+ spin_unlock(&mirror->device->lock);
+
+ down_write(&mm->mmap_sem);
+
+ hmm = hmm_ref(mm->hmm);
+ if (hmm == NULL) {
+ /* no hmm registered yet so register one */
+ hmm = kzalloc(sizeof(*mm->hmm), GFP_KERNEL);
+ if (hmm == NULL) {
+ up_write(&mm->mmap_sem);
+ ret = -ENOMEM;
+ goto error;
+ }
+
+ ret = hmm_init(hmm);
+ if (ret) {
+ up_write(&mm->mmap_sem);
+ kfree(hmm);
+ goto error;
+ }
+
+ mm->hmm = hmm;
+ }
+
+ mirror->hmm = hmm;
+ ret = hmm_add_mirror(hmm, mirror);
+ up_write(&mm->mmap_sem);
+ if (ret) {
+ mirror->hmm = NULL;
+ hmm_unref(hmm);
+ goto error;
+ }
+ return 0;
+
+error:
+ spin_lock(&mirror->device->lock);
+ list_del_init(&mirror->dlist);
+ spin_unlock(&mirror->device->lock);
+ return ret;
+}
+EXPORT_SYMBOL(hmm_mirror_register);
+
+static void hmm_mirror_kill(struct hmm_mirror *mirror)
+{
+ struct hmm_device *device = mirror->device;
+ struct hmm *hmm = hmm_ref(mirror->hmm);
+
+ if (!hmm)
+ return;
+
+ down_write(&hmm->rwsem);
+ if (!hlist_unhashed(&mirror->mlist)) {
+ hlist_del_init(&mirror->mlist);
+ up_write(&hmm->rwsem);
+ device->ops->release(mirror);
+ hmm_mirror_unref(&mirror);
+ } else
+ up_write(&hmm->rwsem);
+
+ hmm_unref(hmm);
+}
+
+/* hmm_mirror_unregister() - unregister a mirror.
+ *
+ * @mirror: The mirror that link process address space with the device.
+ *
+ * Driver can call this function when it wants to stop mirroring a process.
+ * This will trigger a call to the ->release() callback if it did not aleady
+ * happen.
+ *
+ * Note that caller must hold a reference on the mirror.
+ *
+ * THIS CAN NOT BE CALL FROM device->release() CALLBACK OR IT WILL DEADLOCK.
+ */
+void hmm_mirror_unregister(struct hmm_mirror *mirror)
+{
+ if (mirror == NULL)
+ return;
+
+ hmm_mirror_kill(mirror);
+ mmu_notifier_synchronize();
+ hmm_mirror_unref(&mirror);
+}
+EXPORT_SYMBOL(hmm_mirror_unregister);
+
+
+/* hmm_device - Each device driver must register one and only one hmm_device
+ *
+ * The hmm_device is the link btw HMM and each device driver.
+ */
+
+/* hmm_device_register() - register a device with HMM.
+ *
+ * @device: The hmm_device struct.
+ * Returns: 0 on success or -EINVAL otherwise.
+ *
+ *
+ * Call when device driver want to register itself with HMM. Device driver must
+ * only register once.
+ */
+int hmm_device_register(struct hmm_device *device)
+{
+ /* sanity check */
+ BUG_ON(!device);
+ BUG_ON(!device->ops);
+ BUG_ON(!device->ops->release);
+
+ spin_lock_init(&device->lock);
+ INIT_LIST_HEAD(&device->mirrors);
+
+ return 0;
+}
+EXPORT_SYMBOL(hmm_device_register);
+
+/* hmm_device_unregister() - unregister a device with HMM.
+ *
+ * @device: The hmm_device struct.
+ * Returns: 0 on success or -EBUSY otherwise.
+ *
+ * Call when device driver want to unregister itself with HMM. This will check
+ * that there is no any active mirror and returns -EBUSY if so.
+ */
+int hmm_device_unregister(struct hmm_device *device)
+{
+ spin_lock(&device->lock);
+ if (!list_empty(&device->mirrors)) {
+ spin_unlock(&device->lock);
+ return -EBUSY;
+ }
+ spin_unlock(&device->lock);
+ return 0;
+}
+EXPORT_SYMBOL(hmm_device_unregister);
--
2.4.3

2016-03-08 19:47:53

by Jerome Glisse

[permalink] [raw]
Subject: [PATCH v12 09/29] HMM: add mm page table iterator helpers.

Because inside the mmu_notifier callback we do not have access to the
vma nor do we know which lock we are holding (the mmap semaphore or
the i_mmap_lock) we can not rely on the regular page table walk (nor
do we want as we have to be carefull to not split huge page).

So this patch introduce an helper to iterate of the cpu page table
content in an efficient way for the situation we are in. Which is we
know that none of the page table entry might vanish from below us
and thus it is safe to walk the page table.

The only added value of the iterator is that it keeps the page table
entry level map accross call which fit well with the HMM mirror page
table update code.

Signed-off-by: Jérôme Glisse <[email protected]>
---
mm/hmm.c | 101 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 101 insertions(+)

diff --git a/mm/hmm.c b/mm/hmm.c
index a9bdab5..74e429a 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -406,6 +406,107 @@ static struct mmu_notifier_ops hmm_notifier_ops = {
};


+struct mm_pt_iter {
+ struct mm_struct *mm;
+ pte_t *ptep;
+ unsigned long addr;
+};
+
+static void mm_pt_iter_init(struct mm_pt_iter *pt_iter, struct mm_struct *mm)
+{
+ pt_iter->mm = mm;
+ pt_iter->ptep = NULL;
+ pt_iter->addr = -1UL;
+}
+
+static void mm_pt_iter_fini(struct mm_pt_iter *pt_iter)
+{
+ pte_unmap(pt_iter->ptep);
+ pt_iter->ptep = NULL;
+ pt_iter->addr = -1UL;
+ pt_iter->mm = NULL;
+}
+
+static inline bool mm_pt_iter_in_range(struct mm_pt_iter *pt_iter,
+ unsigned long addr)
+{
+ return (addr >= pt_iter->addr && addr < (pt_iter->addr + PMD_SIZE));
+}
+
+static struct page *mm_pt_iter_page(struct mm_pt_iter *pt_iter,
+ unsigned long addr)
+{
+ pgd_t *pgdp;
+ pud_t *pudp;
+ pmd_t *pmdp;
+
+again:
+ /*
+ * What we are doing here is only valid if we old either the mmap
+ * semaphore or the i_mmap_lock of vma->address_space the address
+ * belongs to. Sadly because we can not easily get the vma struct
+ * we can not sanity test that either of those lock is taken.
+ *
+ * We have to rely on people using this code knowing what they do.
+ */
+ if (mm_pt_iter_in_range(pt_iter, addr) && likely(pt_iter->ptep)) {
+ pte_t pte = *(pt_iter->ptep + pte_index(addr));
+ unsigned long pfn;
+
+ if (pte_none(pte) || !pte_present(pte))
+ return NULL;
+ if (unlikely(pte_special(pte)))
+ return NULL;
+
+ pfn = pte_pfn(pte);
+ if (is_zero_pfn(pfn))
+ return NULL;
+ return pfn_to_page(pfn);
+ }
+
+ if (pt_iter->ptep) {
+ pte_unmap(pt_iter->ptep);
+ pt_iter->ptep = NULL;
+ pt_iter->addr = -1UL;
+ }
+
+ pgdp = pgd_offset(pt_iter->mm, addr);
+ if (pgd_none_or_clear_bad(pgdp))
+ return NULL;
+ pudp = pud_offset(pgdp, addr);
+ if (pud_none_or_clear_bad(pudp))
+ return NULL;
+ pmdp = pmd_offset(pudp, addr);
+ /*
+ * Because we either have the mmap semaphore or the i_mmap_lock we know
+ * that pmd can not vanish from under us, thus if pmd exist then it is
+ * either a huge page or a valid pmd. It might also be in the splitting
+ * transitory state.
+ */
+ if (pmd_none(*pmdp) || unlikely(pmd_bad(*pmdp)))
+ return NULL;
+ if (pmd_trans_huge(*pmdp)) {
+ spinlock_t *ptl;
+
+ ptl = pmd_lock(pt_iter->mm, pmdp);
+ if (pmd_trans_huge(*pmdp)) {
+ struct page *page;
+
+ page = pmd_page(*pmdp) + pte_index(addr);
+ spin_unlock(ptl);
+ return page;
+ }
+ /* It was morphing from thp to regular, try again. */
+ spin_unlock(ptl);
+ goto again;
+ }
+ /* Regular pmd and it can not morph. */
+ pt_iter->ptep = pte_offset_map(pmdp, addr & PMD_MASK);
+ pt_iter->addr = addr & PMD_MASK;
+ goto again;
+}
+
+
/* hmm_mirror - per device mirroring functions.
*
* Each device that mirror a process has a uniq hmm_mirror struct. A process
--
2.4.3

2016-03-08 19:47:58

by Jerome Glisse

[permalink] [raw]
Subject: [PATCH v12 11/29] HMM: add discard range helper (to clear and free resources for a range).

A common use case is for device driver to stop caring for a range of
address long before said range is munmapped by userspace program. To
avoid keeping track of such range provide an helper function that will
free HMM resources for a range of address.

NOTE THAT DEVICE DRIVER MUST MAKE SURE THE HARDWARE WILL NO LONGER
ACCESS THE RANGE BECAUSE CALLING THIS HELPER !

Signed-off-by: Jérôme Glisse <[email protected]>
---
include/linux/hmm.h | 3 +++
mm/hmm.c | 24 ++++++++++++++++++++++++
2 files changed, 27 insertions(+)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index d819ec9..10e1558 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -265,6 +265,9 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror);
struct hmm_mirror *hmm_mirror_ref(struct hmm_mirror *mirror);
void hmm_mirror_unref(struct hmm_mirror **mirror);
int hmm_mirror_fault(struct hmm_mirror *mirror, struct hmm_event *event);
+void hmm_mirror_range_discard(struct hmm_mirror *mirror,
+ unsigned long start,
+ unsigned long end);


#endif /* CONFIG_HMM */
diff --git a/mm/hmm.c b/mm/hmm.c
index 7b6ba6a..548f0c5 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -921,6 +921,30 @@ out:
}
EXPORT_SYMBOL(hmm_mirror_fault);

+/* hmm_mirror_range_discard() - discard a range of address.
+ *
+ * @mirror: The mirror struct.
+ * @start: Start address of the range to discard (inclusive).
+ * @end: End address of the range to discard (exclusive).
+ *
+ * Call when device driver want to stop mirroring a range of address and free
+ * any HMM resources associated with that range (including dma mapping if any).
+ *
+ * THIS FUNCTION ASSUME THAT DRIVER ALREADY STOPPED USING THE RANGE OF ADDRESS
+ * AND THUS DO NOT PERFORM ANY SYNCHRONIZATION OR UPDATE WITH THE DRIVER TO
+ * INVALIDATE SAID RANGE.
+ */
+void hmm_mirror_range_discard(struct hmm_mirror *mirror,
+ unsigned long start,
+ unsigned long end)
+{
+ struct hmm_event event;
+
+ hmm_event_init(&event, mirror->hmm, start, end, HMM_MUNMAP);
+ hmm_mirror_update_pt(mirror, &event, NULL);
+}
+EXPORT_SYMBOL(hmm_mirror_range_discard);
+
/* hmm_mirror_register() - register mirror against current process for a device.
*
* @mirror: The mirror struct being registered.
--
2.4.3

2016-03-08 19:48:05

by Jerome Glisse

[permalink] [raw]
Subject: [PATCH v12 10/29] HMM: use CPU page table during invalidation.

Once we store the dma mapping inside the secondary page table we can
no longer easily find back the page backing an address. Instead use
the cpu page table which still has the proper information, except for
the invalidate_page() case which is handled by using the page passed
by the mmu_notifier layer.

Signed-off-by: Jérôme Glisse <[email protected]>
---
mm/hmm.c | 53 +++++++++++++++++++++++++++++++++++------------------
1 file changed, 35 insertions(+), 18 deletions(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index 74e429a..7b6ba6a 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -47,9 +47,11 @@
static struct mmu_notifier_ops hmm_notifier_ops;
static void hmm_mirror_kill(struct hmm_mirror *mirror);
static inline int hmm_mirror_update(struct hmm_mirror *mirror,
- struct hmm_event *event);
+ struct hmm_event *event,
+ struct page *page);
static void hmm_mirror_update_pt(struct hmm_mirror *mirror,
- struct hmm_event *event);
+ struct hmm_event *event,
+ struct page *page);


/* hmm_event - use to track information relating to an event.
@@ -223,7 +225,9 @@ again:
}
}

-static void hmm_update(struct hmm *hmm, struct hmm_event *event)
+static void hmm_update(struct hmm *hmm,
+ struct hmm_event *event,
+ struct page *page)
{
struct hmm_mirror *mirror;

@@ -236,7 +240,7 @@ static void hmm_update(struct hmm *hmm, struct hmm_event *event)
again:
down_read(&hmm->rwsem);
hlist_for_each_entry(mirror, &hmm->mirrors, mlist)
- if (hmm_mirror_update(mirror, event)) {
+ if (hmm_mirror_update(mirror, event, page)) {
mirror = hmm_mirror_ref(mirror);
up_read(&hmm->rwsem);
hmm_mirror_kill(mirror);
@@ -304,7 +308,7 @@ static void hmm_notifier_release(struct mmu_notifier *mn, struct mm_struct *mm)

/* Make sure everything is unmapped. */
hmm_event_init(&event, mirror->hmm, 0, -1UL, HMM_MUNMAP);
- hmm_mirror_update(mirror, &event);
+ hmm_mirror_update(mirror, &event, NULL);

mirror->device->ops->release(mirror);
hmm_mirror_unref(&mirror);
@@ -338,9 +342,10 @@ static void hmm_mmu_mprot_to_etype(struct mm_struct *mm,
*etype = HMM_NONE;
}

-static void hmm_notifier_invalidate_range_start(struct mmu_notifier *mn,
- struct mm_struct *mm,
- const struct mmu_notifier_range *range)
+static void hmm_notifier_invalidate(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ struct page *page,
+ const struct mmu_notifier_range *range)
{
struct hmm_event event;
unsigned long start = range->start, end = range->end;
@@ -382,7 +387,14 @@ static void hmm_notifier_invalidate_range_start(struct mmu_notifier *mn,

hmm_event_init(&event, hmm, start, end, event.etype);

- hmm_update(hmm, &event);
+ hmm_update(hmm, &event, page);
+}
+
+static void hmm_notifier_invalidate_range_start(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ const struct mmu_notifier_range *range)
+{
+ hmm_notifier_invalidate(mn, mm, NULL, range);
}

static void hmm_notifier_invalidate_page(struct mmu_notifier *mn,
@@ -396,7 +408,7 @@ static void hmm_notifier_invalidate_page(struct mmu_notifier *mn,
range.start = addr & PAGE_MASK;
range.end = range.start + PAGE_SIZE;
range.event = mmu_event;
- hmm_notifier_invalidate_range_start(mn, mm, &range);
+ hmm_notifier_invalidate(mn, mm, page, &range);
}

static struct mmu_notifier_ops hmm_notifier_ops = {
@@ -554,23 +566,27 @@ void hmm_mirror_unref(struct hmm_mirror **mirror)
EXPORT_SYMBOL(hmm_mirror_unref);

static inline int hmm_mirror_update(struct hmm_mirror *mirror,
- struct hmm_event *event)
+ struct hmm_event *event,
+ struct page *page)
{
struct hmm_device *device = mirror->device;
int ret = 0;

ret = device->ops->update(mirror, event);
- hmm_mirror_update_pt(mirror, event);
+ hmm_mirror_update_pt(mirror, event, page);
return ret;
}

static void hmm_mirror_update_pt(struct hmm_mirror *mirror,
- struct hmm_event *event)
+ struct hmm_event *event,
+ struct page *page)
{
unsigned long addr;
struct hmm_pt_iter iter;
+ struct mm_pt_iter mm_iter;

hmm_pt_iter_init(&iter, &mirror->pt);
+ mm_pt_iter_init(&mm_iter, mirror->hmm->mm);
for (addr = event->start; addr != event->end;) {
unsigned long next = event->end;
dma_addr_t *hmm_pte;
@@ -591,10 +607,10 @@ static void hmm_mirror_update_pt(struct hmm_mirror *mirror,
continue;
if (hmm_pte_test_and_clear_dirty(hmm_pte) &&
hmm_pte_test_write(hmm_pte)) {
- struct page *page;
-
- page = pfn_to_page(hmm_pte_pfn(*hmm_pte));
- set_page_dirty(page);
+ page = page ? : mm_pt_iter_page(&mm_iter, addr);
+ if (page)
+ set_page_dirty(page);
+ page = NULL;
}
*hmm_pte &= event->pte_mask;
if (hmm_pte_test_valid_pfn(hmm_pte))
@@ -604,6 +620,7 @@ static void hmm_mirror_update_pt(struct hmm_mirror *mirror,
hmm_pt_iter_directory_unlock(&iter);
}
hmm_pt_iter_fini(&iter);
+ mm_pt_iter_fini(&mm_iter);
}

static inline bool hmm_mirror_is_dead(struct hmm_mirror *mirror)
@@ -1004,7 +1021,7 @@ static void hmm_mirror_kill(struct hmm_mirror *mirror)

/* Make sure everything is unmapped. */
hmm_event_init(&event, mirror->hmm, 0, -1UL, HMM_MUNMAP);
- hmm_mirror_update(mirror, &event);
+ hmm_mirror_update(mirror, &event, NULL);

device->ops->release(mirror);
hmm_mirror_unref(&mirror);
--
2.4.3

2016-03-08 19:48:18

by Jerome Glisse

[permalink] [raw]
Subject: [PATCH v12 15/29] HMM: add documentation explaining HMM internals and how to use it.

This add documentation on how HMM works and a more in depth view of how it
should be use by device driver writers.

Signed-off-by: Jérôme Glisse <[email protected]>
---
Documentation/vm/hmm.txt | 219 +++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 219 insertions(+)
create mode 100644 Documentation/vm/hmm.txt

diff --git a/Documentation/vm/hmm.txt b/Documentation/vm/hmm.txt
new file mode 100644
index 0000000..febed50
--- /dev/null
+++ b/Documentation/vm/hmm.txt
@@ -0,0 +1,219 @@
+Heterogeneous Memory Management (HMM)
+-------------------------------------
+
+The raison d'?tre of HMM is to provide a common API for device driver that
+wants to mirror a process address space on there device and/or migrate system
+memory to device memory. Device driver can decide to only use one aspect of
+HMM (mirroring or memory migration), for instance some device can directly
+access process address space through hardware (for instance PCIe ATS/PASID),
+but still want to benefit from memory migration capabilities that HMM offer.
+
+While HMM rely on existing kernel infrastructure (namely mmu_notifier) some
+of its features (memory migration, atomic access) require integration with
+core mm kernel code. Having HMM as the common intermediary is more appealing
+than having each device driver hooking itself inside the common mm code.
+
+Moreover HMM as a layer allows integration with DMA API or page reclaimation.
+
+
+Mirroring address space on the device:
+--------------------------------------
+
+Device that can't directly access transparently the process address space, need
+to mirror the CPU page table into there own page table. HMM helps to keep the
+device page table synchronize with the CPU page table. It is not expected that
+the device will fully mirror the CPU page table but only mirror region that are
+actively accessed by the device. For that reasons HMM only helps populating and
+synchronizing device page table for range that the device driver explicitly ask
+for.
+
+Mirroring address space inside the device page table is easy with HMM :
+
+ /* Create a mirror for the current process for your device. */
+ your_hmm_mirror->hmm_mirror.device = your_hmm_device;
+ hmm_mirror_register(&your_hmm_mirror->hmm_mirror);
+
+ ...
+
+ /* Mirror memory (in read mode) between addressA and addressB */
+ your_hmm_event->hmm_event.start = addressA;
+ your_hmm_event->hmm_event.end = addressB;
+ your_hmm_event->hmm_event.etype = HMM_DEVICE_RFAULT;
+ hmm_mirror_fault(&your_hmm_mirror->hmm_mirror, &your_hmm_event->hmm_event);
+ /* HMM callback into your driver with the >update() callback. During the
+ * callback use the HMM page table to populate the device page table. You
+ * can only use the HMM page table to populate the device page table for
+ * the specified range during the >update() callback, at any other point in
+ * time the HMM page table content should be assume to be undefined.
+ */
+ your_hmm_device->update(mirror, event);
+
+ ...
+
+ /* Process is quiting or device done stop the mirroring and cleanup. */
+ hmm_mirror_unregister(&your_hmm_mirror->hmm_mirror);
+ /* Device driver can free your_hmm_mirror */
+
+
+HMM mirror page table:
+----------------------
+
+Each hmm_mirror object is associated with a mirror page table that HMM keeps
+synchronize with the CPU page table by using the mmu_notifier API. HMM is using
+its own generic page table format because it needs to store DMA address, which
+are bigger than long on some architecture, and have more flags per entry than
+radix tree allows.
+
+The HMM page table mostly mirror x86 page table layout. A page holds a global
+directory and each entry points to a lower level directory. Unlike regular CPU
+page table, directory level are more aggressively freed and remove from the HMM
+mirror page table. This means device driver needs to use the HMM helpers and to
+follow directive on when and how to access the mirror page table. HMM use the
+per page spinlock of directory page to synchronize update of directory ie update
+can happen on different directory concurently.
+
+As a rules the mirror page table can only be accessed by device driver from one
+of the HMM device callback. Any access from outside a callback is illegal and
+gives undertimed result.
+
+Accessing the mirror page table from a device callback needs to use the HMM
+page table helpers. A loop to access entry for a range of address looks like :
+
+ /* Initialize a HMM page table iterator. */
+ struct hmm_pt_iter iter;
+ hmm_pt_iter_init(&iter, &mirror->pt)
+
+ /* Get pointer to HMM page table entry for a given address. */
+ dma_addr_t *hmm_pte;
+ hmm_pte = hmm_pt_iter_walk(&iter, &addr, &next);
+
+If there is no valid entry directory for given range address then hmm_pte is
+NULL. If there is a valid entry directory then you can access the hmm_pte and
+the pointer will stay valid as long as you do not call hmm_pt_iter_walk() with
+the same iter struct for a different address or call hmm_pt_iter_fini().
+
+While the HMM page table entry pointer stays valid you can only modify the
+value it is pointing to by using one of HMM helpers (hmm_pte_*()) as other
+threads might be updating the same entry concurrently. The device driver only
+need to update an HMM page table entry to set the dirty bit, so driver should
+only be using hmm_pte_set_dirty().
+
+Similarly to extract information the device driver should use one of the helper
+like hmm_pte_dma_addr() or hmm_pte_pfn() (if HMM is not doing DMA mapping which
+is a device driver at initialization parameter).
+
+
+Migrating system memory to device memory:
+-----------------------------------------
+
+Device like discret GPU often have there own local memory which offer bigger
+bandwidth and smaller latency than access to system memory for the GPU. This
+local memory is not necessarily accessible by the CPU. Device local memory will
+remain revealent for the foreseeable future as bandwidth of GPU memory keep
+increasing faster than bandwidth of system memory and as latency of PCIe does
+not decrease.
+
+Thus to maximize use of device like GPU, program need to use the device memory.
+Userspace API wants to make this as transparent as it can be, so that there is
+no need for complex modification of applications.
+
+Transparent use of device memory for range of address of a process require core
+mm code modifications. Adding a new memory zone for devices memory did not make
+sense given that such memory is often only accessible by the device only. This
+is why we decided to use a special kind of swap, migrated memory is mark as a
+special swap entry inside the CPU page table.
+
+While HMM handles the migration process, it does not decide what range or when
+to migrate memory. The decision to perform such migration is under the control
+of the device driver. Migration back to system memory happens either because
+the CPU try to access the memory or because device driver decided to migrate
+the memory back.
+
+
+ /* Migrate system memory between addressA and addressB to device memory. */
+ your_hmm_event->hmm_event.start = addressA;
+ your_hmm_event->hmm_event.end = addressB;
+ your_hmm_event->hmm_event.etype = HMM_COPY_TO_DEVICE;
+ hmm_mirror_fault(&your_hmm_mirror->hmm_mirror, &your_hmm_event->hmm_event);
+ /* HMM callback into your driver with the >copy_to_device() callback.
+ * Device driver must allocate device memory, DMA system memory to device
+ * memory, update the device page table to point to device memory and
+ * return. See hmm.h for details instructions and how failure are handled.
+ */
+ your_hmm_device->copy_to_device(mirror, event, dst, addressA, addressB);
+
+
+Right now HMM only support migrating anonymous private memory. Migration of
+share memory and more generaly file mapped memory is on the road map.
+
+
+Locking consideration and overall design:
+-----------------------------------------
+
+As a rule HMM will handle proper locking on the behalf of the device driver,
+as such device driver does not need to take any mm lock before calling into
+the HMM code.
+
+HMM is also responsible of the hmm_device and hmm_mirror object lifetime. The
+device driver can only free those after calling hmm_device_unregister() or
+hmm_mirror_unregister() respectively.
+
+All the lock inside any of the HMM structure should never be use by the device
+driver. They are intended to be use only and only by HMM code. Below is short
+description of the 3 main locks that exist for HMM internal use. Educational
+purpose only.
+
+Each process mm has one and only one struct hmm associated with it. Each hmm
+struct can be use by several different mirror. There is one and only one mirror
+per mm and device pair. So in essence the hmm struct is the core that dispatch
+everything to every single mirror, each of them corresponding to a specific
+device. The list of mirror for an hmm struct is protected by a semaphore as it
+sees mostly read access.
+
+Each time a device fault a range of address it calls hmm_mirror_fault(), HMM
+keeps track, inside the hmm struct, of each range currently being faulted. It
+does that so it can synchronize with any CPU page table update. If there is a
+CPU page table update then a callback through mmu_notifier will happen and HMM
+will try to interrupt the device page fault that conflict (ie address range
+overlap with the range being updated) and wait for them to back off. This
+insure that at no point in time the device driver see transient page table
+information. The list of active fault is protected by a spinlock, query on
+that list should be short and quick (we haven't gather enough statistic on
+that side yet to have a good idea of the average access pattern).
+
+Each device driver wanting to use HMM must register one and only one hmm_device
+struct per physical device with HMM. The hmm_device struct have pointer to the
+device driver call back and keeps track of active mirrors for a given device.
+The active mirrors list is protected by a spinlock.
+
+
+Future work:
+------------
+
+Improved atomic access by the device to system memory. Some platform bus (PCIe)
+offer limited number of atomic memory operations, some platform do not even
+have any kind of atomic memory operations by a device. In order to allow such
+atomic operation we want to map page read only the CPU while the device perform
+its operation. For this we need a new case inside the CPU write fault code path
+to synchronize with the device.
+
+We want to allow program to lock a range of memory inside device memory and
+forbid CPU access while the memory is lock inside the device. Any CPU access
+to locked range would result in SIGBUS. We think that madvise() would be the
+right syscall into which we could plug that feature.
+
+In order to minimize kernel memory consumption and overhead of DMA mapping, we
+want to introduce new DMA API that allows to manage mapping on IOMMU directory
+page basis. This would allow to map/unmap/update DMA mapping in bulk and
+minimize IOMMU update and flushing overhead. Moreover this would allow to
+improve IOMMU bad access reporting for DMA address inside those directory.
+
+Because update to the device page table might require "heavy" synchronization
+with the device, the mmu_notifier callback might have to sleep while HMM is
+waiting for the device driver to report device page table update completion.
+This is especialy bad if this happens during page reclaimation, this might
+bring the system to pause. We want to mitigate this, either by maintaining a
+new intermediate lru level in which we put pages actively mirrored by a device
+or by some other mecanism. For time being we advice that device driver that
+use HMM explicitly explain this corner case so that user are aware that this
+can happens if there is memory pressure.
--
2.4.3

2016-03-08 19:48:29

by Jerome Glisse

[permalink] [raw]
Subject: [PATCH v12 16/29] fork: pass the dst vma to copy_page_range() and its sub-functions.

For HMM we will need to resort to the old way of allocating new page
for anonymous memory when that anonymous memory have been migrated
to device memory.

This does not impact any process that do not use HMM through some
device driver. Only process that migrate anonymous memory to device
memory with HMM will have to copy migrated page on fork.

We do not expect this to be a common or advised thing to do so we
resort to the simpler solution of allocating new page. If this kind
of usage turns out to be important we will revisit way to achieve
COW even for remote memory.

Signed-off-by: Jérôme Glisse <[email protected]>
---
include/linux/mm.h | 5 +++--
kernel/fork.c | 2 +-
mm/memory.c | 33 +++++++++++++++++++++------------
3 files changed, 25 insertions(+), 15 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f312210..c5c062e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1169,8 +1169,9 @@ int walk_page_range(unsigned long addr, unsigned long end,
int walk_page_vma(struct vm_area_struct *vma, struct mm_walk *walk);
void free_pgd_range(struct mmu_gather *tlb, unsigned long addr,
unsigned long end, unsigned long floor, unsigned long ceiling);
-int copy_page_range(struct mm_struct *dst, struct mm_struct *src,
- struct vm_area_struct *vma);
+int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
+ struct vm_area_struct *dst_vma,
+ struct vm_area_struct *vma);
void unmap_mapping_range(struct address_space *mapping,
loff_t const holebegin, loff_t const holelen, int even_cows);
int follow_pfn(struct vm_area_struct *vma, unsigned long address,
diff --git a/kernel/fork.c b/kernel/fork.c
index d3911a0..e8d0c14 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -508,7 +508,7 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
rb_parent = &tmp->vm_rb;

mm->map_count++;
- retval = copy_page_range(mm, oldmm, mpnt);
+ retval = copy_page_range(mm, oldmm, tmp, mpnt);

if (tmp->vm_ops && tmp->vm_ops->open)
tmp->vm_ops->open(tmp);
diff --git a/mm/memory.c b/mm/memory.c
index 532d80f..19de9ba 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -874,8 +874,10 @@ out_set_pte:
}

static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
- pmd_t *dst_pmd, pmd_t *src_pmd, struct vm_area_struct *vma,
- unsigned long addr, unsigned long end)
+ pmd_t *dst_pmd, pmd_t *src_pmd,
+ struct vm_area_struct *dst_vma,
+ struct vm_area_struct *vma,
+ unsigned long addr, unsigned long end)
{
pte_t *orig_src_pte, *orig_dst_pte;
pte_t *src_pte, *dst_pte;
@@ -936,9 +938,12 @@ again:
return 0;
}

-static inline int copy_pmd_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
- pud_t *dst_pud, pud_t *src_pud, struct vm_area_struct *vma,
- unsigned long addr, unsigned long end)
+static inline int copy_pmd_range(struct mm_struct *dst_mm,
+ struct mm_struct *src_mm,
+ pud_t *dst_pud, pud_t *src_pud,
+ struct vm_area_struct *dst_vma,
+ struct vm_area_struct *vma,
+ unsigned long addr, unsigned long end)
{
pmd_t *src_pmd, *dst_pmd;
unsigned long next;
@@ -963,15 +968,18 @@ static inline int copy_pmd_range(struct mm_struct *dst_mm, struct mm_struct *src
if (pmd_none_or_clear_bad(src_pmd))
continue;
if (copy_pte_range(dst_mm, src_mm, dst_pmd, src_pmd,
- vma, addr, next))
+ dst_vma, vma, addr, next))
return -ENOMEM;
} while (dst_pmd++, src_pmd++, addr = next, addr != end);
return 0;
}

-static inline int copy_pud_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
- pgd_t *dst_pgd, pgd_t *src_pgd, struct vm_area_struct *vma,
- unsigned long addr, unsigned long end)
+static inline int copy_pud_range(struct mm_struct *dst_mm,
+ struct mm_struct *src_mm,
+ pgd_t *dst_pgd, pgd_t *src_pgd,
+ struct vm_area_struct *dst_vma,
+ struct vm_area_struct *vma,
+ unsigned long addr, unsigned long end)
{
pud_t *src_pud, *dst_pud;
unsigned long next;
@@ -985,14 +993,15 @@ static inline int copy_pud_range(struct mm_struct *dst_mm, struct mm_struct *src
if (pud_none_or_clear_bad(src_pud))
continue;
if (copy_pmd_range(dst_mm, src_mm, dst_pud, src_pud,
- vma, addr, next))
+ dst_vma, vma, addr, next))
return -ENOMEM;
} while (dst_pud++, src_pud++, addr = next, addr != end);
return 0;
}

int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
- struct vm_area_struct *vma)
+ struct vm_area_struct *dst_vma,
+ struct vm_area_struct *vma)
{
pgd_t *src_pgd, *dst_pgd;
unsigned long next;
@@ -1046,7 +1055,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
if (pgd_none_or_clear_bad(src_pgd))
continue;
if (unlikely(copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd,
- vma, addr, next))) {
+ dst_vma, vma, addr, next))) {
ret = -ENOMEM;
break;
}
--
2.4.3

2016-03-08 19:48:38

by Jerome Glisse

[permalink] [raw]
Subject: [PATCH v12 17/29] HMM: add special swap filetype for memory migrated to device v2.

From: Jerome Glisse <[email protected]>

When migrating anonymous memory from system memory to device memory
CPU pte are replaced with special HMM swap entry so that page fault,
get user page (gup), fork, ... are properly redirected to HMM helpers.

This patch only add the new swap type entry and hooks HMM helpers
functions inside the page fault and fork code path.

Changed since v1:
- Fix name when of HMM CPU page fault function.

Signed-off-by: Jérôme Glisse <[email protected]>
Signed-off-by: Sherry Cheung <[email protected]>
Signed-off-by: Subhash Gutti <[email protected]>
Signed-off-by: Mark Hairgrove <[email protected]>
Signed-off-by: John Hubbard <[email protected]>
Signed-off-by: Jatin Kumar <[email protected]>
---
include/linux/hmm.h | 34 ++++++++++++++++++++++++++++++++++
include/linux/swap.h | 13 ++++++++++++-
include/linux/swapops.h | 43 ++++++++++++++++++++++++++++++++++++++++++-
mm/hmm.c | 21 +++++++++++++++++++++
mm/memory.c | 22 ++++++++++++++++++++++
5 files changed, 131 insertions(+), 2 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 4bc132a..7c66513 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -272,6 +272,40 @@ void hmm_mirror_range_dirty(struct hmm_mirror *mirror,
unsigned long start,
unsigned long end);

+int hmm_handle_cpu_fault(struct mm_struct *mm,
+ struct vm_area_struct *vma,
+ pmd_t *pmdp, unsigned long addr,
+ unsigned flags, pte_t orig_pte);
+
+int hmm_mm_fork(struct mm_struct *src_mm,
+ struct mm_struct *dst_mm,
+ struct vm_area_struct *dst_vma,
+ pmd_t *dst_pmd,
+ unsigned long start,
+ unsigned long end);
+
+#else /* CONFIG_HMM */
+
+static inline int hmm_handle_cpu_fault(struct mm_struct *mm,
+ struct vm_area_struct *vma,
+ pmd_t *pmdp, unsigned long addr,
+ unsigned flags, pte_t orig_pte)
+{
+ return VM_FAULT_SIGBUS;
+}
+
+static inline int hmm_mm_fork(struct mm_struct *src_mm,
+ struct mm_struct *dst_mm,
+ struct vm_area_struct *dst_vma,
+ pmd_t *dst_pmd,
+ unsigned long start,
+ unsigned long end)
+{
+ BUG();
+ return -ENOMEM;
+}

#endif /* CONFIG_HMM */
+
+
#endif
diff --git a/include/linux/swap.h b/include/linux/swap.h
index b14a2bb..336e0a1 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -70,8 +70,19 @@ static inline int current_is_kswapd(void)
#define SWP_HWPOISON_NUM 0
#endif

+/*
+ * HMM (heterogeneous memory management) used when data is in remote memory.
+ */
+#ifdef CONFIG_HMM
+#define SWP_HMM_NUM 1
+#define SWP_HMM (MAX_SWAPFILES + SWP_MIGRATION_NUM + SWP_HWPOISON_NUM)
+#else
+#define SWP_HMM_NUM 0
+#endif
+
#define MAX_SWAPFILES \
- ((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
+ ((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - \
+ SWP_HWPOISON_NUM - SWP_HMM_NUM)

/*
* Magic header for a swap area. The first part of the union is
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 5c3a5f3..8c6ba9f 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -227,7 +227,7 @@ static inline void num_poisoned_pages_inc(void)
}
#endif

-#if defined(CONFIG_MEMORY_FAILURE) || defined(CONFIG_MIGRATION)
+#if defined(CONFIG_MEMORY_FAILURE) || defined(CONFIG_MIGRATION) || defined(CONFIG_HMM)
static inline int non_swap_entry(swp_entry_t entry)
{
return swp_type(entry) >= MAX_SWAPFILES;
@@ -239,4 +239,45 @@ static inline int non_swap_entry(swp_entry_t entry)
}
#endif

+#ifdef CONFIG_HMM
+static inline swp_entry_t make_hmm_entry(void)
+{
+ /* We do not store anything inside the CPU page table entry (pte). */
+ return swp_entry(SWP_HMM, 0);
+}
+
+static inline swp_entry_t make_hmm_entry_locked(void)
+{
+ /* We do not store anything inside the CPU page table entry (pte). */
+ return swp_entry(SWP_HMM, 1);
+}
+
+static inline swp_entry_t make_hmm_entry_poisonous(void)
+{
+ /* We do not store anything inside the CPU page table entry (pte). */
+ return swp_entry(SWP_HMM, 2);
+}
+
+static inline int is_hmm_entry(swp_entry_t entry)
+{
+ return (swp_type(entry) == SWP_HMM);
+}
+
+static inline int is_hmm_entry_locked(swp_entry_t entry)
+{
+ return (swp_type(entry) == SWP_HMM) && (swp_offset(entry) == 1);
+}
+
+static inline int is_hmm_entry_poisonous(swp_entry_t entry)
+{
+ return (swp_type(entry) == SWP_HMM) && (swp_offset(entry) == 2);
+}
+#else /* CONFIG_HMM */
+static inline int is_hmm_entry(swp_entry_t swp)
+{
+ return 0;
+}
+#endif /* CONFIG_HMM */
+
+
#endif /* _LINUX_SWAPOPS_H */
diff --git a/mm/hmm.c b/mm/hmm.c
index ad44325..4c0d2c0 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -419,6 +419,27 @@ static struct mmu_notifier_ops hmm_notifier_ops = {
};


+int hmm_handle_cpu_fault(struct mm_struct *mm,
+ struct vm_area_struct *vma,
+ pmd_t *pmdp, unsigned long addr,
+ unsigned flags, pte_t orig_pte)
+{
+ return VM_FAULT_SIGBUS;
+}
+EXPORT_SYMBOL(hmm_handle_cpu_fault);
+
+int hmm_mm_fork(struct mm_struct *src_mm,
+ struct mm_struct *dst_mm,
+ struct vm_area_struct *dst_vma,
+ pmd_t *dst_pmd,
+ unsigned long start,
+ unsigned long end)
+{
+ return -ENOMEM;
+}
+EXPORT_SYMBOL(hmm_mm_fork);
+
+
struct mm_pt_iter {
struct mm_struct *mm;
pte_t *ptep;
diff --git a/mm/memory.c b/mm/memory.c
index 19de9ba..3cb3653 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -54,6 +54,7 @@
#include <linux/writeback.h>
#include <linux/memcontrol.h>
#include <linux/mmu_notifier.h>
+#include <linux/hmm.h>
#include <linux/kallsyms.h>
#include <linux/swapops.h>
#include <linux/elf.h>
@@ -882,9 +883,11 @@ static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
pte_t *orig_src_pte, *orig_dst_pte;
pte_t *src_pte, *dst_pte;
spinlock_t *src_ptl, *dst_ptl;
+ unsigned cnt_hmm_entry = 0;
int progress = 0;
int rss[NR_MM_COUNTERS];
swp_entry_t entry = (swp_entry_t){0};
+ unsigned long start;

again:
init_rss_vec(rss);
@@ -898,6 +901,7 @@ again:
orig_src_pte = src_pte;
orig_dst_pte = dst_pte;
arch_enter_lazy_mmu_mode();
+ start = addr;

do {
/*
@@ -914,6 +918,12 @@ again:
progress++;
continue;
}
+ if (unlikely(!pte_present(*src_pte))) {
+ entry = pte_to_swp_entry(*src_pte);
+
+ if (is_hmm_entry(entry))
+ cnt_hmm_entry++;
+ }
entry.val = copy_one_pte(dst_mm, src_mm, dst_pte, src_pte,
vma, addr, rss);
if (entry.val)
@@ -928,6 +938,15 @@ again:
pte_unmap_unlock(orig_dst_pte, dst_ptl);
cond_resched();

+ if (cnt_hmm_entry) {
+ int ret;
+
+ ret = hmm_mm_fork(src_mm, dst_mm, dst_vma,
+ dst_pmd, start, end);
+ if (ret)
+ return ret;
+ }
+
if (entry.val) {
if (add_swap_count_continuation(entry, GFP_KERNEL) < 0)
return -ENOMEM;
@@ -2489,6 +2508,9 @@ int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
migration_entry_wait(mm, pmd, address);
} else if (is_hwpoison_entry(entry)) {
ret = VM_FAULT_HWPOISON;
+ } else if (is_hmm_entry(entry)) {
+ ret = hmm_handle_cpu_fault(mm, vma, pmd, address,
+ flags, orig_pte);
} else {
print_bad_pte(vma, address, orig_pte, NULL);
ret = VM_FAULT_SIGBUS;
--
2.4.3

2016-03-08 19:48:47

by Jerome Glisse

[permalink] [raw]
Subject: [PATCH v12 19/29] HMM: add new HMM page table flag (select flag).

When migrating memory the same array for HMM page table entry might be
use with several different devices. Add a new select flag so current
device driver callback can know which entry are selected for the device.

Signed-off-by: Jérôme Glisse <[email protected]>
---
include/linux/hmm_pt.h | 6 ++++--
mm/hmm.c | 5 ++++-
2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/include/linux/hmm_pt.h b/include/linux/hmm_pt.h
index b017aa7..f745d6c 100644
--- a/include/linux/hmm_pt.h
+++ b/include/linux/hmm_pt.h
@@ -77,8 +77,9 @@ static inline unsigned long hmm_pde_pfn(dma_addr_t pde)
#define HMM_PTE_VALID_DEV_BIT 0
#define HMM_PTE_VALID_DMA_BIT 1
#define HMM_PTE_VALID_PFN_BIT 2
-#define HMM_PTE_WRITE_BIT 3
-#define HMM_PTE_DIRTY_BIT 4
+#define HMM_PTE_SELECT 3
+#define HMM_PTE_WRITE_BIT 4
+#define HMM_PTE_DIRTY_BIT 5
/*
* Reserve some bits for device driver private flags. Note that thus can only
* be manipulated using the hmm_pte_*_bit() sets of helpers.
@@ -170,6 +171,7 @@ static inline bool hmm_pte_test_and_set_bit(dma_addr_t *ptep,
HMM_PTE_BIT_HELPER(valid_dev, HMM_PTE_VALID_DEV_BIT)
HMM_PTE_BIT_HELPER(valid_dma, HMM_PTE_VALID_DMA_BIT)
HMM_PTE_BIT_HELPER(valid_pfn, HMM_PTE_VALID_PFN_BIT)
+HMM_PTE_BIT_HELPER(select, HMM_PTE_SELECT)
HMM_PTE_BIT_HELPER(dirty, HMM_PTE_DIRTY_BIT)
HMM_PTE_BIT_HELPER(write, HMM_PTE_WRITE_BIT)

diff --git a/mm/hmm.c b/mm/hmm.c
index 4c0d2c0..a5706d2 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -749,6 +749,7 @@ static int hmm_mirror_fault_hpmd(struct hmm_mirror *mirror,
BUG_ON(hmm_pte_pfn(hmm_pte[i]) != pfn);
if (pmd_write(*pmdp))
hmm_pte_set_write(&hmm_pte[i]);
+ hmm_pte_set_select(&hmm_pte[i]);
} while (addr += PAGE_SIZE, pfn++, i++, addr != next);
hmm_pt_iter_directory_unlock(iter);
mirror_fault->addr = addr;
@@ -825,6 +826,7 @@ static int hmm_mirror_fault_pmd(pmd_t *pmdp,
BUG_ON(hmm_pte_pfn(hmm_pte[i]) != pte_pfn(*ptep));
if (pte_write(*ptep))
hmm_pte_set_write(&hmm_pte[i]);
+ hmm_pte_set_select(&hmm_pte[i]);
} while (addr += PAGE_SIZE, ptep++, i++, addr != next);
hmm_pt_iter_directory_unlock(iter);
pte_unmap(ptep - 1);
@@ -916,7 +918,8 @@ static int hmm_mirror_dma_map(struct hmm_mirror *mirror,

again:
pte = ACCESS_ONCE(hmm_pte[i]);
- if (!hmm_pte_test_valid_pfn(&pte)) {
+ if (!hmm_pte_test_valid_pfn(&pte) ||
+ !hmm_pte_test_select(&pte)) {
if (!hmm_pte_test_valid_dma(&pte)) {
ret = -ENOENT;
break;
--
2.4.3

2016-03-08 19:49:04

by Jerome Glisse

[permalink] [raw]
Subject: [PATCH v12 21/29] HMM: mm add helper to update page table when migrating memory back v2.

To migrate memory back we first need to lock HMM special CPU page
table entry so we know no one else might try to migrate those entry
back. Helper also allocate new page where data will be copied back
from the device. Then we can proceed with the device DMA operation.

Once DMA is done we can update again the CPU page table to point to
the new page that holds the content copied back from device memory.

Note that we do not need to invalidate the range are we are only
modifying non present CPU page table entry.

Changed since v1:
- Save memcg against which each page is precharge as it might
change along the way.

Signed-off-by: Jérôme Glisse <[email protected]>
---
include/linux/mm.h | 12 +++
mm/memory.c | 257 +++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 269 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index c5c062e..1cd060f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2392,6 +2392,18 @@ static inline void hmm_mm_init(struct mm_struct *mm)
{
mm->hmm = NULL;
}
+
+int mm_hmm_migrate_back(struct mm_struct *mm,
+ struct vm_area_struct *vma,
+ pte_t *new_pte,
+ unsigned long start,
+ unsigned long end);
+void mm_hmm_migrate_back_cleanup(struct mm_struct *mm,
+ struct vm_area_struct *vma,
+ pte_t *new_pte,
+ dma_addr_t *hmm_pte,
+ unsigned long start,
+ unsigned long end);
#else /* !CONFIG_HMM */
static inline void hmm_mm_init(struct mm_struct *mm)
{
diff --git a/mm/memory.c b/mm/memory.c
index 3cb3653..d917911a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3513,6 +3513,263 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
}
EXPORT_SYMBOL_GPL(handle_mm_fault);

+
+#ifdef CONFIG_HMM
+/* mm_hmm_migrate_back() - lock HMM CPU page table entry and allocate new page.
+ *
+ * @mm: The mm struct.
+ * @vma: The vm area struct the range is in.
+ * @new_pte: Array of new CPU page table entry value.
+ * @start: Start address of the range (inclusive).
+ * @end: End address of the range (exclusive).
+ *
+ * This function will lock HMM page table entry and allocate new page for entry
+ * it successfully locked.
+ */
+int mm_hmm_migrate_back(struct mm_struct *mm,
+ struct vm_area_struct *vma,
+ pte_t *new_pte,
+ unsigned long start,
+ unsigned long end)
+{
+ pte_t hmm_entry = swp_entry_to_pte(make_hmm_entry_locked());
+ unsigned long addr, i;
+ int ret = 0;
+
+ VM_BUG_ON(vma->vm_ops || (vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)));
+
+ if (unlikely(anon_vma_prepare(vma)))
+ return -ENOMEM;
+
+ start &= PAGE_MASK;
+ end = PAGE_ALIGN(end);
+ memset(new_pte, 0, sizeof(pte_t) * ((end - start) >> PAGE_SHIFT));
+
+ for (addr = start; addr < end;) {
+ unsigned long cstart, next;
+ spinlock_t *ptl;
+ pgd_t *pgdp;
+ pud_t *pudp;
+ pmd_t *pmdp;
+ pte_t *ptep;
+
+ pgdp = pgd_offset(mm, addr);
+ pudp = pud_offset(pgdp, addr);
+ /*
+ * Some other thread might already have migrated back the entry
+ * and freed the page table. Unlikely thought.
+ */
+ if (unlikely(!pudp)) {
+ addr = min((addr + PUD_SIZE) & PUD_MASK, end);
+ continue;
+ }
+ pmdp = pmd_offset(pudp, addr);
+ if (unlikely(!pmdp || pmd_bad(*pmdp) || pmd_none(*pmdp) ||
+ pmd_trans_huge(*pmdp))) {
+ addr = min((addr + PMD_SIZE) & PMD_MASK, end);
+ continue;
+ }
+ ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+ for (cstart = addr, i = (addr - start) >> PAGE_SHIFT,
+ next = min((addr + PMD_SIZE) & PMD_MASK, end);
+ addr < next; addr += PAGE_SIZE, ptep++, i++) {
+ swp_entry_t entry;
+
+ entry = pte_to_swp_entry(*ptep);
+ if (pte_none(*ptep) || pte_present(*ptep) ||
+ !is_hmm_entry(entry) ||
+ is_hmm_entry_locked(entry))
+ continue;
+
+ set_pte_at(mm, addr, ptep, hmm_entry);
+ new_pte[i] = pte_mkspecial(pfn_pte(my_zero_pfn(addr),
+ vma->vm_page_prot));
+ }
+ pte_unmap_unlock(ptep - 1, ptl);
+
+ for (addr = cstart, i = (addr - start) >> PAGE_SHIFT;
+ addr < next; addr += PAGE_SIZE, i++) {
+ struct mem_cgroup *memcg;
+ struct page *page;
+
+ if (!pte_present(new_pte[i]))
+ continue;
+
+ page = alloc_zeroed_user_highpage_movable(vma, addr);
+ if (!page) {
+ ret = -ENOMEM;
+ break;
+ }
+ __SetPageUptodate(page);
+ if (mem_cgroup_try_charge(page, mm, GFP_KERNEL,
+ &memcg)) {
+ page_cache_release(page);
+ ret = -ENOMEM;
+ break;
+ }
+ /*
+ * We can safely reuse the s_mem/mapping field of page
+ * struct to store the memcg as the page is only seen
+ * by HMM at this point and we can clear it before it
+ * is public see mm_hmm_migrate_back_cleanup().
+ */
+ page->s_mem = memcg;
+ new_pte[i] = mk_pte(page, vma->vm_page_prot);
+ if (vma->vm_flags & VM_WRITE) {
+ new_pte[i] = pte_mkdirty(new_pte[i]);
+ new_pte[i] = pte_mkwrite(new_pte[i]);
+ }
+ }
+
+ if (!ret)
+ continue;
+
+ hmm_entry = swp_entry_to_pte(make_hmm_entry());
+ ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+ for (addr = cstart, i = (addr - start) >> PAGE_SHIFT;
+ addr < next; addr += PAGE_SIZE, ptep++, i++) {
+ unsigned long pfn = pte_pfn(new_pte[i]);
+
+ if (!pte_present(new_pte[i]) || !is_zero_pfn(pfn))
+ continue;
+
+ set_pte_at(mm, addr, ptep, hmm_entry);
+ pte_clear(mm, addr, &new_pte[i]);
+ }
+ pte_unmap_unlock(ptep - 1, ptl);
+ break;
+ }
+ return ret;
+}
+EXPORT_SYMBOL(mm_hmm_migrate_back);
+
+/* mm_hmm_migrate_back_cleanup() - set CPU page table entry to new page.
+ *
+ * @mm: The mm struct.
+ * @vma: The vm area struct the range is in.
+ * @new_pte: Array of new CPU page table entry value.
+ * @hmm_pte: Array of HMM table entry indicating if migration was successful.
+ * @start: Start address of the range (inclusive).
+ * @end: End address of the range (exclusive).
+ *
+ * This is call after mm_hmm_migrate_back() and after effective migration. It
+ * will set CPU page table entry to new value pointing to newly allocated page
+ * where the data was effectively copied back from device memory.
+ *
+ * Any failure will trigger a bug on.
+ *
+ * TODO: For copy failure we might simply set a new value for the HMM special
+ * entry indicating poisonous entry.
+ */
+void mm_hmm_migrate_back_cleanup(struct mm_struct *mm,
+ struct vm_area_struct *vma,
+ pte_t *new_pte,
+ dma_addr_t *hmm_pte,
+ unsigned long start,
+ unsigned long end)
+{
+ pte_t hmm_poison = swp_entry_to_pte(make_hmm_entry_poisonous());
+ unsigned long addr, i;
+
+ for (addr = start; addr < end;) {
+ unsigned long cstart, next, free_pages;
+ spinlock_t *ptl;
+ pgd_t *pgdp;
+ pud_t *pudp;
+ pmd_t *pmdp;
+ pte_t *ptep;
+
+ /*
+ * We know for certain that we did set special swap entry for
+ * the range and HMM entry are mark as locked so it means that
+ * no one beside us can modify them which apply that all level
+ * of the CPU page table are valid.
+ */
+ pgdp = pgd_offset(mm, addr);
+ pudp = pud_offset(pgdp, addr);
+ VM_BUG_ON(!pudp);
+ pmdp = pmd_offset(pudp, addr);
+ VM_BUG_ON(!pmdp || pmd_bad(*pmdp) || pmd_none(*pmdp) ||
+ pmd_trans_huge(*pmdp));
+
+ ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+ for (next = min((addr + PMD_SIZE) & PMD_MASK, end),
+ cstart = addr, i = (addr - start) >> PAGE_SHIFT,
+ free_pages = 0; addr < next; addr += PAGE_SIZE,
+ ptep++, i++) {
+ struct mem_cgroup *memcg;
+ swp_entry_t entry;
+ struct page *page;
+
+ if (!pte_present(new_pte[i]))
+ continue;
+
+ entry = pte_to_swp_entry(*ptep);
+
+ /*
+ * Sanity catch all the things that could go wrong but
+ * should not, no plan B here.
+ */
+ VM_BUG_ON(pte_none(*ptep));
+ VM_BUG_ON(pte_present(*ptep));
+ VM_BUG_ON(!is_hmm_entry_locked(entry));
+
+ if (!hmm_pte_test_valid_dma(&hmm_pte[i]) &&
+ !hmm_pte_test_valid_pfn(&hmm_pte[i])) {
+ set_pte_at(mm, addr, ptep, hmm_poison);
+ free_pages++;
+ continue;
+ }
+
+ page = pte_page(new_pte[i]);
+
+ /*
+ * Up to now the s_mem/mapping field stored the memcg
+ * against which the page was pre-charged. Save it and
+ * clear field so PageAnon() return false.
+ */
+ memcg = page->s_mem;
+ page->s_mem = NULL;
+
+ inc_mm_counter_fast(mm, MM_ANONPAGES);
+ page_add_new_anon_rmap(page, vma, addr);
+ mem_cgroup_commit_charge(page, memcg, false);
+ lru_cache_add_active_or_unevictable(page, vma);
+ set_pte_at(mm, addr, ptep, new_pte[i]);
+ update_mmu_cache(vma, addr, ptep);
+ pte_clear(mm, addr, &new_pte[i]);
+ }
+ pte_unmap_unlock(ptep - 1, ptl);
+
+ if (!free_pages)
+ continue;
+
+ for (addr = cstart, i = (addr - start) >> PAGE_SHIFT;
+ addr < next; addr += PAGE_SIZE, i++) {
+ struct mem_cgroup *memcg;
+ struct page *page;
+
+ if (!pte_present(new_pte[i]))
+ continue;
+
+ page = pte_page(new_pte[i]);
+
+ /*
+ * Up to now the s_mem/mapping field stored the memcg
+ * against which the page was pre-charged.
+ */
+ memcg = page->s_mem;
+ page->s_mem = NULL;
+
+ mem_cgroup_cancel_charge(page, memcg);
+ page_cache_release(page);
+ }
+ }
+}
+EXPORT_SYMBOL(mm_hmm_migrate_back_cleanup);
+#endif
+
+
#ifndef __PAGETABLE_PUD_FOLDED
/*
* Allocate page upper directory.
--
2.4.3

2016-03-08 19:49:12

by Jerome Glisse

[permalink] [raw]
Subject: [PATCH v12 22/29] HMM: mm add helper to update page table when migrating memory v3.

For doing memory migration to remote memory we need to unmap range
of anonymous memory from CPU page table and replace page table entry
with special HMM entry.

This is a multi-stage process, first we save and replace page table
entry with special HMM entry, also flushing tlb in the process. If
we run into non allocated entry we either use the zero page or we
allocate new page. For swaped entry we try to swap them in.

Once we have set the page table entry to the special entry we check
the page backing each of the address to make sure that only page
table mappings are holding reference on the page, which means we
can safely migrate the page to device memory. Because the CPU page
table entry are special entry, no get_user_pages() can reference
the page anylonger. So we are safe from race on that front. Note
that the page can still be referenced by get_user_pages() from
other process but in that case the page is write protected and
as we do not drop the mapcount nor the page count we know that
all user of get_user_pages() are only doing read only access (on
write access they would allocate a new page).

Once we have identified all the page that are safe to migrate the
first function return and let HMM schedule the migration with the
device driver.

Finaly there is a cleanup function that will drop the mapcount and
reference count on all page that have been successfully migrated,
or restore the page table entry otherwise.

Changed since v1:
- Fix pmd/pte allocation when migrating.
- Fix reverse logic on mm_forbids_zeropage()
- Add comment on why we add to lru list new page.

Changed since v2:
- Adapt to thp changes.

Signed-off-by: Jérôme Glisse <[email protected]>
---
include/linux/mm.h | 14 ++
mm/memory.c | 498 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 508 insertions(+), 4 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1cd060f..7ff15d9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2393,6 +2393,20 @@ static inline void hmm_mm_init(struct mm_struct *mm)
mm->hmm = NULL;
}

+int mm_hmm_migrate(struct mm_struct *mm,
+ struct vm_area_struct *vma,
+ pte_t *save_pte,
+ bool *backoff,
+ const void *mmu_notifier_exclude,
+ unsigned long start,
+ unsigned long end);
+void mm_hmm_migrate_cleanup(struct mm_struct *mm,
+ struct vm_area_struct *vma,
+ pte_t *save_pte,
+ dma_addr_t *hmm_pte,
+ unsigned long start,
+ unsigned long end);
+
int mm_hmm_migrate_back(struct mm_struct *mm,
struct vm_area_struct *vma,
pte_t *new_pte,
diff --git a/mm/memory.c b/mm/memory.c
index d917911a..dd7470e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -55,6 +55,7 @@
#include <linux/memcontrol.h>
#include <linux/mmu_notifier.h>
#include <linux/hmm.h>
+#include <linux/hmm_pt.h>
#include <linux/kallsyms.h>
#include <linux/swapops.h>
#include <linux/elf.h>
@@ -3602,7 +3603,7 @@ int mm_hmm_migrate_back(struct mm_struct *mm,
}
__SetPageUptodate(page);
if (mem_cgroup_try_charge(page, mm, GFP_KERNEL,
- &memcg)) {
+ &memcg, false)) {
page_cache_release(page);
ret = -ENOMEM;
break;
@@ -3732,8 +3733,8 @@ void mm_hmm_migrate_back_cleanup(struct mm_struct *mm,
page->s_mem = NULL;

inc_mm_counter_fast(mm, MM_ANONPAGES);
- page_add_new_anon_rmap(page, vma, addr);
- mem_cgroup_commit_charge(page, memcg, false);
+ page_add_new_anon_rmap(page, vma, addr, false);
+ mem_cgroup_commit_charge(page, memcg, false, false);
lru_cache_add_active_or_unevictable(page, vma);
set_pte_at(mm, addr, ptep, new_pte[i]);
update_mmu_cache(vma, addr, ptep);
@@ -3761,12 +3762,501 @@ void mm_hmm_migrate_back_cleanup(struct mm_struct *mm,
memcg = page->s_mem;
page->s_mem = NULL;

- mem_cgroup_cancel_charge(page, memcg);
+ mem_cgroup_cancel_charge(page, memcg, false);
page_cache_release(page);
}
}
}
EXPORT_SYMBOL(mm_hmm_migrate_back_cleanup);
+
+/* mm_hmm_migrate() - unmap range and set special HMM pte for it.
+ *
+ * @mm: The mm struct.
+ * @vma: The vm area struct the range is in.
+ * @save_pte: array where to save current CPU page table entry value.
+ * @backoff: Pointer toward a boolean indicating that we need to stop.
+ * @exclude: The mmu_notifier listener to exclude from mmu_notifier callback.
+ * @start: Start address of the range (inclusive).
+ * @end: End address of the range (exclusive).
+ * Returns: 0 on success, -EINVAL if some argument where invalid, -ENOMEM if
+ * it failed allocating memory for performing the operation, -EFAULT if some
+ * memory backing the range is in bad state, -EAGAIN if backoff flag turned
+ * to true.
+ *
+ * The process of memory migration is bit involve, first we must set all CPU
+ * page table entry to the special HMM locked entry ensuring us exclusive
+ * control over the page table entry (ie no other process can change the page
+ * table but us).
+ *
+ * While doing that we must handle empty and swaped entry. For empty entry we
+ * either use the zero page or allocate a new page. For swap entry we call
+ * __handle_mm_fault() to try to faultin the page (swap entry can be a number
+ * of thing).
+ *
+ * Once we have unmapped we need to check that we can effectively migrate the
+ * page, by testing that no one is holding a reference on the page beside the
+ * reference taken by each page mapping.
+ *
+ * On success every valid entry inside save_pte array is an entry that can be
+ * migrated.
+ *
+ * Note that this function does not free any of the page, nor does it updates
+ * the various memcg counter (exception being for accounting new allocation).
+ * This happen inside the mm_hmm_migrate_cleanup() function.
+ *
+ */
+int mm_hmm_migrate(struct mm_struct *mm,
+ struct vm_area_struct *vma,
+ pte_t *save_pte,
+ bool *backoff,
+ const void *mmu_notifier_exclude,
+ unsigned long start,
+ unsigned long end)
+{
+ pte_t hmm_entry = swp_entry_to_pte(make_hmm_entry_locked());
+ struct mmu_notifier_range range = {
+ .start = start,
+ .end = end,
+ .event = MMU_MIGRATE,
+ };
+ unsigned long addr = start, i;
+ struct mmu_gather tlb;
+ int ret = 0;
+
+ /* Only allow anonymous mapping and sanity check arguments. */
+ if (vma->vm_ops || unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)))
+ return -EINVAL;
+ start &= PAGE_MASK;
+ end = PAGE_ALIGN(end);
+ if (start >= end || end > vma->vm_end)
+ return -EINVAL;
+
+ /* Only need to test on the last address of the range. */
+ if (check_stack_guard_page(vma, end) < 0)
+ return -EFAULT;
+
+ /* Try to fail early on. */
+ if (unlikely(anon_vma_prepare(vma)))
+ return -ENOMEM;
+
+retry:
+ lru_add_drain();
+ tlb_gather_mmu(&tlb, mm, range.start, range.end);
+ update_hiwater_rss(mm);
+ mmu_notifier_invalidate_range_start_excluding(mm, &range,
+ mmu_notifier_exclude);
+ tlb_start_vma(&tlb, vma);
+ for (addr = range.start, i = 0; addr < end && !ret;) {
+ unsigned long cstart, next, npages = 0;
+ spinlock_t *ptl;
+ pgd_t *pgdp;
+ pud_t *pudp;
+ pmd_t *pmdp;
+ pte_t *ptep;
+
+ /*
+ * Pretty much the exact same logic as __handle_mm_fault(),
+ * exception being the handling of huge pmd.
+ */
+ pgdp = pgd_offset(mm, addr);
+ pudp = pud_alloc(mm, pgdp, addr);
+ if (!pudp) {
+ ret = -ENOMEM;
+ break;
+ }
+ pmdp = pmd_alloc(mm, pudp, addr);
+ if (!pmdp) {
+ ret = -ENOMEM;
+ break;
+ }
+ if (unlikely(pte_alloc(mm, pmdp, addr))) {
+ ret = -ENOMEM;
+ break;
+ }
+
+ /*
+ * If a huge pmd materialized under us just retry later. Use
+ * pmd_trans_unstable() instead of pmd_trans_huge() to ensure the pmd
+ * didn't become pmd_trans_huge under us and then back to pmd_none, as
+ * a result of MADV_DONTNEED running immediately after a huge pmd fault
+ * in a different thread of this mm, in turn leading to a misleading
+ * pmd_trans_huge() retval. All we have to ensure is that it is a
+ * regular pmd that we can walk with pte_offset_map() and we can do that
+ * through an atomic read in C, which is what pmd_trans_unstable()
+ * provides.
+ */
+ if (unlikely(pmd_trans_unstable(pmdp) || pmd_devmap(*pmdp))) {
+ ret = -EAGAIN;
+ break;
+ }
+
+ /*
+ * If an huge pmd materialized from under us split it and break
+ * out of the loop to retry.
+ */
+ if (pmd_trans_huge(*pmdp) || pmd_devmap(*pmdp)) {
+ split_huge_pmd(vma, addr, pmdp);
+ ret = -EAGAIN;
+ break;
+ }
+
+ /*
+ * A regular pmd is established and it can't morph into a huge pmd
+ * from under us anymore at this point because we hold the mmap_sem
+ * read mode and khugepaged takes it in write mode. So now it's
+ * safe to run pte_offset_map().
+ */
+ ptep = pte_offset_map(pmdp, addr);
+
+ /*
+ * A regular pmd is established and it can't morph into a huge
+ * pmd from under us anymore at this point because we hold the
+ * mmap_sem read mode and khugepaged takes it in write mode. So
+ * now it's safe to run pte_offset_map().
+ */
+ ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+ for (i = (addr - start) >> PAGE_SHIFT, cstart = addr,
+ next = min((addr + PMD_SIZE) & PMD_MASK, end);
+ addr < next; addr += PAGE_SIZE, ptep++, i++) {
+ save_pte[i] = ptep_get_and_clear(mm, addr, ptep);
+ tlb_remove_tlb_entry(&tlb, ptep, addr);
+ set_pte_at(mm, addr, ptep, hmm_entry);
+
+ if (pte_present(save_pte[i]))
+ continue;
+
+ if (!pte_none(save_pte[i])) {
+ set_pte_at(mm, addr, ptep, save_pte[i]);
+ ret = -ENOENT;
+ ptep++;
+ break;
+ }
+ /*
+ * TODO: This mm_forbids_zeropage() really does not
+ * apply to us. First it seems only S390 have it set,
+ * second we are not even using the zero page entry
+ * to populate the CPU page table, thought on error
+ * we might use the save_pte entry to set the CPU
+ * page table entry.
+ *
+ * Live with that oddity for now.
+ */
+ if (mm_forbids_zeropage(mm)) {
+ pte_clear(mm, addr, &save_pte[i]);
+ npages++;
+ continue;
+ }
+ save_pte[i] = pte_mkspecial(pfn_pte(my_zero_pfn(addr),
+ vma->vm_page_prot));
+ }
+ pte_unmap_unlock(ptep - 1, ptl);
+
+ /*
+ * So we must allocate pages before checking for error, which
+ * here indicate that one entry is a swap entry. We need to
+ * allocate first because otherwise there is no easy way to
+ * know on retry or in error code path wether the CPU page
+ * table locked HMM entry is ours or from some other thread.
+ */
+
+ if (!npages)
+ continue;
+
+ for (next = addr, addr = cstart,
+ i = (addr - start) >> PAGE_SHIFT;
+ addr < next; addr += PAGE_SIZE, i++) {
+ struct mem_cgroup *memcg;
+ struct page *page;
+
+ if (pte_present(save_pte[i]) || !pte_none(save_pte[i]))
+ continue;
+
+ page = alloc_zeroed_user_highpage_movable(vma, addr);
+ if (!page) {
+ ret = -ENOMEM;
+ break;
+ }
+ __SetPageUptodate(page);
+ if (mem_cgroup_try_charge(page, mm, GFP_KERNEL,
+ &memcg, false)) {
+ page_cache_release(page);
+ ret = -ENOMEM;
+ break;
+ }
+ save_pte[i] = mk_pte(page, vma->vm_page_prot);
+ if (vma->vm_flags & VM_WRITE)
+ save_pte[i] = pte_mkwrite(save_pte[i]);
+ inc_mm_counter_fast(mm, MM_ANONPAGES);
+ /*
+ * Because we set the page table entry to the special
+ * HMM locked entry we know no other process might do
+ * anything with it and thus we can safely account the
+ * page without holding any lock at this point.
+ */
+ page_add_new_anon_rmap(page, vma, addr, false);
+ mem_cgroup_commit_charge(page, memcg, false, false);
+ /*
+ * Add to active list so we know vmscan will not waste
+ * its time with that page while we are still using it.
+ */
+ lru_cache_add_active_or_unevictable(page, vma);
+ }
+ }
+ tlb_end_vma(&tlb, vma);
+ mmu_notifier_invalidate_range_end_excluding(mm, &range,
+ mmu_notifier_exclude);
+ tlb_finish_mmu(&tlb, range.start, range.end);
+
+ if (backoff && *backoff) {
+ /* Stick to the range we updated. */
+ ret = -EAGAIN;
+ end = addr;
+ goto out;
+ }
+
+ /* Check if something is missing or something went wrong. */
+ if (ret == -ENOENT) {
+ int flags = FAULT_FLAG_ALLOW_RETRY;
+
+ do {
+ /*
+ * Using __handle_mm_fault() as current->mm != mm ie we
+ * might have been call from a kernel thread on behalf
+ * of a driver and all accounting handle_mm_fault() is
+ * pointless in our case.
+ */
+ ret = __handle_mm_fault(mm, vma, addr, flags);
+ flags |= FAULT_FLAG_TRIED;
+ } while ((ret & VM_FAULT_RETRY));
+ if ((ret & VM_FAULT_ERROR)) {
+ /* Stick to the range we updated. */
+ end = addr;
+ ret = -EFAULT;
+ goto out;
+ }
+ range.start = addr;
+ goto retry;
+ }
+ if (ret == -EAGAIN) {
+ range.start = addr;
+ goto retry;
+ }
+ if (ret)
+ /* Stick to the range we updated. */
+ end = addr;
+
+ /*
+ * At this point no one else can take a reference on the page from this
+ * process CPU page table. So we can safely check wether we can migrate
+ * or not the page.
+ */
+
+out:
+ for (addr = start, i = 0; addr < end;) {
+ unsigned long next;
+ spinlock_t *ptl;
+ pgd_t *pgdp;
+ pud_t *pudp;
+ pmd_t *pmdp;
+ pte_t *ptep;
+
+ /*
+ * We know for certain that we did set special swap entry for
+ * the range and HMM entry are mark as locked so it means that
+ * no one beside us can modify them which apply that all level
+ * of the CPU page table are valid.
+ */
+ pgdp = pgd_offset(mm, addr);
+ pudp = pud_offset(pgdp, addr);
+ VM_BUG_ON(!pudp);
+ pmdp = pmd_offset(pudp, addr);
+ VM_BUG_ON(!pmdp || pmd_bad(*pmdp) || pmd_none(*pmdp) ||
+ pmd_trans_huge(*pmdp));
+
+ ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+ for (next = min((addr + PMD_SIZE) & PMD_MASK, end),
+ i = (addr - start) >> PAGE_SHIFT; addr < next;
+ addr += PAGE_SIZE, ptep++, i++) {
+ struct page *page;
+ swp_entry_t entry;
+ int swapped;
+
+ entry = pte_to_swp_entry(save_pte[i]);
+ if (is_hmm_entry(entry)) {
+ /*
+ * Logic here is pretty involve. If save_pte is
+ * an HMM special swap entry then it means that
+ * we failed to swap in that page so error must
+ * be set.
+ *
+ * If that's not the case than it means we are
+ * seriously screw.
+ */
+ VM_BUG_ON(!ret);
+ continue;
+ }
+
+ /*
+ * This can not happen, no one else can replace our
+ * special entry and as range end is re-ajusted on
+ * error.
+ */
+ entry = pte_to_swp_entry(*ptep);
+ VM_BUG_ON(!is_hmm_entry_locked(entry));
+
+ /* On error or backoff restore all the saved pte. */
+ if (ret)
+ goto restore;
+
+ page = vm_normal_page(vma, addr, save_pte[i]);
+ /* The zero page is fine to migrate. */
+ if (!page)
+ continue;
+
+ /*
+ * Check that only CPU mapping hold a reference on the
+ * page. To make thing simpler we just refuse bail out
+ * if page_mapcount() != page_count() (also accounting
+ * for swap cache).
+ *
+ * There is a small window here where wp_page_copy()
+ * might have decremented mapcount but have not yet
+ * decremented the page count. This is not an issue as
+ * we backoff in that case.
+ */
+ swapped = PageSwapCache(page);
+ if (page_mapcount(page) + swapped == page_count(page))
+ continue;
+
+restore:
+ /* Ok we have to restore that page. */
+ set_pte_at(mm, addr, ptep, save_pte[i]);
+ /*
+ * No need to invalidate - it was non-present
+ * before.
+ */
+ update_mmu_cache(vma, addr, ptep);
+ pte_clear(mm, addr, &save_pte[i]);
+ }
+ pte_unmap_unlock(ptep - 1, ptl);
+ }
+ return ret;
+}
+EXPORT_SYMBOL(mm_hmm_migrate);
+
+/* mm_hmm_migrate_cleanup() - unmap range cleanup.
+ *
+ * @mm: The mm struct.
+ * @vma: The vm area struct the range is in.
+ * @save_pte: Array where to save current CPU page table entry value.
+ * @hmm_pte: Array of HMM table entry indicating if migration was successful.
+ * @start: Start address of the range (inclusive).
+ * @end: End address of the range (exclusive).
+ *
+ * This is call after mm_hmm_migrate() and after effective migration. It will
+ * restore CPU page table entry for page that not been migrated or in case of
+ * failure.
+ *
+ * It will free pages that have been migrated and updates appropriate counters,
+ * it will also "unlock" special HMM pte entry.
+ */
+void mm_hmm_migrate_cleanup(struct mm_struct *mm,
+ struct vm_area_struct *vma,
+ pte_t *save_pte,
+ dma_addr_t *hmm_pte,
+ unsigned long start,
+ unsigned long end)
+{
+ pte_t hmm_entry = swp_entry_to_pte(make_hmm_entry());
+ struct page *pages[MMU_GATHER_BUNDLE];
+ unsigned long addr, c, i;
+
+ for (addr = start, i = 0; addr < end;) {
+ unsigned long next;
+ spinlock_t *ptl;
+ pgd_t *pgdp;
+ pud_t *pudp;
+ pmd_t *pmdp;
+ pte_t *ptep;
+
+ /*
+ * We know for certain that we did set special swap entry for
+ * the range and HMM entry are mark as locked so it means that
+ * no one beside us can modify them which apply that all level
+ * of the CPU page table are valid.
+ */
+ pgdp = pgd_offset(mm, addr);
+ pudp = pud_offset(pgdp, addr);
+ VM_BUG_ON(!pudp);
+ pmdp = pmd_offset(pudp, addr);
+ VM_BUG_ON(!pmdp || pmd_bad(*pmdp) || pmd_none(*pmdp) ||
+ pmd_trans_huge(*pmdp));
+
+ ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+ for (next = min((addr + PMD_SIZE) & PMD_MASK, end),
+ i = (addr - start) >> PAGE_SHIFT; addr < next;
+ addr += PAGE_SIZE, ptep++, i++) {
+ struct page *page;
+ swp_entry_t entry;
+
+ /*
+ * This can't happen no one else can replace our
+ * precious special entry.
+ */
+ entry = pte_to_swp_entry(*ptep);
+ VM_BUG_ON(!is_hmm_entry_locked(entry));
+
+ if (!hmm_pte_test_valid_dev(&hmm_pte[i])) {
+ /* Ok we have to restore that page. */
+ set_pte_at(mm, addr, ptep, save_pte[i]);
+ /*
+ * No need to invalidate - it was non-present
+ * before.
+ */
+ update_mmu_cache(vma, addr, ptep);
+ pte_clear(mm, addr, &save_pte[i]);
+ continue;
+ }
+
+ /* Set unlocked entry. */
+ set_pte_at(mm, addr, ptep, hmm_entry);
+ /*
+ * No need to invalidate - it was non-present
+ * before.
+ */
+ update_mmu_cache(vma, addr, ptep);
+
+ page = vm_normal_page(vma, addr, save_pte[i]);
+ /* The zero page is fine to migrate. */
+ if (!page)
+ continue;
+
+ page_remove_rmap(page, false);
+ dec_mm_counter_fast(mm, MM_ANONPAGES);
+ }
+ pte_unmap_unlock(ptep - 1, ptl);
+ }
+
+ /* Free pages. */
+ for (addr = start, i = 0, c = 0; addr < end; i++, addr += PAGE_SIZE) {
+ if (pte_none(save_pte[i]))
+ continue;
+ if (c >= MMU_GATHER_BUNDLE) {
+ /*
+ * TODO: What we really want to do is keep the memory
+ * accounted inside the memory group and inside rss
+ * while still freeing the page. So that migration
+ * back from device memory will not fail because we
+ * go over memory group limit.
+ */
+ free_pages_and_swap_cache(pages, c);
+ c = 0;
+ }
+ pages[c] = vm_normal_page(vma, addr, save_pte[i]);
+ c = pages[c] ? c + 1 : c;
+ }
+}
+EXPORT_SYMBOL(mm_hmm_migrate_cleanup);
#endif


--
2.4.3

2016-03-08 19:49:19

by Jerome Glisse

[permalink] [raw]
Subject: [PATCH v12 24/29] HMM: allow to get pointer to spinlock protecting a directory.

Several use case for getting pointer to spinlock protecting a directory.

Signed-off-by: Jérôme Glisse <[email protected]>
---
include/linux/hmm_pt.h | 25 +++++++++++++++++++++++++
1 file changed, 25 insertions(+)

diff --git a/include/linux/hmm_pt.h b/include/linux/hmm_pt.h
index f745d6c..22100a6 100644
--- a/include/linux/hmm_pt.h
+++ b/include/linux/hmm_pt.h
@@ -255,6 +255,16 @@ static inline void hmm_pt_directory_lock(struct hmm_pt *pt,
spin_lock(&pt->lock);
}

+static inline spinlock_t *hmm_pt_directory_lock_ptr(struct hmm_pt *pt,
+ struct page *ptd,
+ unsigned level)
+{
+ if (level)
+ return &ptd->ptl;
+ else
+ return &pt->lock;
+}
+
static inline void hmm_pt_directory_unlock(struct hmm_pt *pt,
struct page *ptd,
unsigned level)
@@ -272,6 +282,13 @@ static inline void hmm_pt_directory_lock(struct hmm_pt *pt,
spin_lock(&pt->lock);
}

+static inline spinlock_t *hmm_pt_directory_lock_ptr(struct hmm_pt *pt,
+ struct page *ptd,
+ unsigned level)
+{
+ return &pt->lock;
+}
+
static inline void hmm_pt_directory_unlock(struct hmm_pt *pt,
struct page *ptd,
unsigned level)
@@ -358,6 +375,14 @@ static inline void hmm_pt_iter_directory_lock(struct hmm_pt_iter *iter)
hmm_pt_directory_lock(pt, iter->ptd[pt->llevel - 1], pt->llevel);
}

+static inline spinlock_t *hmm_pt_iter_directory_lock_ptr(struct hmm_pt_iter *i)
+{
+ struct hmm_pt *pt = i->pt;
+
+ return hmm_pt_directory_lock_ptr(pt, i->ptd[pt->llevel - 1],
+ pt->llevel);
+}
+
static inline void hmm_pt_iter_directory_unlock(struct hmm_pt_iter *iter)
{
struct hmm_pt *pt = iter->pt;
--
2.4.3

2016-03-08 19:49:25

by Jerome Glisse

[permalink] [raw]
Subject: [PATCH v12 27/29] HMM: fork copy migrated memory into system memory for child process.

When forking if process being fork had any memory migrated to some
device memory, we need to make a system copy for the child process.
Latter patches can revisit this and use the same COW semantic for
device memory.

Signed-off-by: Jérôme Glisse <[email protected]>
---
mm/hmm.c | 38 +++++++++++++++++++++++++++++++++++++-
1 file changed, 37 insertions(+), 1 deletion(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index 435e376..4dcd98f 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -483,7 +483,37 @@ int hmm_mm_fork(struct mm_struct *src_mm,
unsigned long start,
unsigned long end)
{
- return -ENOMEM;
+ unsigned long npages = (end - start) >> PAGE_SHIFT;
+ struct hmm_event event;
+ dma_addr_t *dst;
+ struct hmm *hmm;
+ pte_t *new_pte;
+ int ret;
+
+ hmm = hmm_ref(src_mm->hmm);
+ if (!hmm)
+ return -EINVAL;
+
+
+ dst = kcalloc(npages, sizeof(*dst), GFP_KERNEL);
+ if (!dst) {
+ hmm_unref(hmm);
+ return -ENOMEM;
+ }
+ new_pte = kcalloc(npages, sizeof(*new_pte), GFP_KERNEL);
+ if (!new_pte) {
+ kfree(dst);
+ hmm_unref(hmm);
+ return -ENOMEM;
+ }
+
+ hmm_event_init(&event, hmm, start, end, HMM_FORK);
+ ret = hmm_migrate_back(hmm, &event, dst_mm, dst_vma, new_pte,
+ dst, start, end);
+ hmm_unref(hmm);
+ kfree(new_pte);
+ kfree(dst);
+ return ret;
}
EXPORT_SYMBOL(hmm_mm_fork);

@@ -665,6 +695,12 @@ static void hmm_mirror_update_pte(struct hmm_mirror *mirror,
}

if (hmm_pte_test_valid_dev(hmm_pte)) {
+ /*
+ * On fork device memory is duplicated so no need to write
+ * protect it.
+ */
+ if (event->etype == HMM_FORK)
+ return;
*hmm_pte &= event->pte_mask;
if (!hmm_pte_test_valid_dev(hmm_pte))
hmm_pt_iter_directory_unref(iter);
--
2.4.3

2016-03-08 19:49:33

by Jerome Glisse

[permalink] [raw]
Subject: [PATCH v12 29/29] HMM: add mirror fault support for system to device memory migration v3.

Migration to device memory is done as a special kind of device mirror
fault. Memory migration being initiated by device driver and never by
HMM (unless it is a migration back to system memory).

Changed since v1:
- Adapt to HMM page table changes.

Changed since v2:
- Fix error code path for migration, calling mm_hmm_migrate_cleanup()
is wrong.

Signed-off-by: Jérôme Glisse <[email protected]>
Signed-off-by: Sherry Cheung <[email protected]>
Signed-off-by: Subhash Gutti <[email protected]>
Signed-off-by: Mark Hairgrove <[email protected]>
Signed-off-by: John Hubbard <[email protected]>
Signed-off-by: Jatin Kumar <[email protected]>
---
mm/hmm.c | 170 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 170 insertions(+)

diff --git a/mm/hmm.c b/mm/hmm.c
index 38943a7..41637a3 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -53,6 +53,10 @@ static int hmm_mirror_migrate_back(struct hmm_mirror *mirror,
dma_addr_t *dst,
unsigned long start,
unsigned long end);
+static int hmm_mirror_migrate(struct hmm_mirror *mirror,
+ struct hmm_event *event,
+ struct vm_area_struct *vma,
+ struct hmm_pt_iter *iter);
static inline int hmm_mirror_update(struct hmm_mirror *mirror,
struct hmm_event *event,
struct page *page);
@@ -101,6 +105,12 @@ static inline int hmm_event_init(struct hmm_event *event,
return 0;
}

+static inline unsigned long hmm_event_npages(const struct hmm_event *event)
+{
+ return (PAGE_ALIGN(event->end) - (event->start & PAGE_MASK)) >>
+ PAGE_SHIFT;
+}
+

/* hmm - core HMM functions.
*
@@ -1255,6 +1265,9 @@ retry:
}

switch (event->etype) {
+ case HMM_COPY_TO_DEVICE:
+ ret = hmm_mirror_migrate(mirror, event, vma, &iter);
+ break;
case HMM_DEVICE_WFAULT:
if (!(vma->vm_flags & VM_WRITE)) {
ret = -EFAULT;
@@ -1392,6 +1405,163 @@ static int hmm_mirror_migrate_back(struct hmm_mirror *mirror,
return ret ? ret : r;
}

+static int hmm_mirror_migrate(struct hmm_mirror *mirror,
+ struct hmm_event *event,
+ struct vm_area_struct *vma,
+ struct hmm_pt_iter *iter)
+{
+ struct hmm_device *device = mirror->device;
+ struct hmm *hmm = mirror->hmm;
+ struct hmm_event invalidate;
+ unsigned long addr, npages;
+ struct hmm_mirror *tmp;
+ dma_addr_t *dst;
+ pte_t *save_pte;
+ int r = 0, ret;
+
+ /* Only allow migration of private anonymous memory. */
+ if (vma->vm_ops || unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)))
+ return -EINVAL;
+
+ /*
+ * TODO More advance loop for splitting migration into several chunk.
+ * For now limit the amount that can be migrated in one shot. Also we
+ * would need to see if we need rescheduling if this is happening as
+ * part of system call to the device driver.
+ */
+ npages = hmm_event_npages(event);
+ if (npages * max(sizeof(*dst), sizeof(*save_pte)) > PAGE_SIZE)
+ return -EINVAL;
+ dst = kcalloc(npages, sizeof(*dst), GFP_KERNEL);
+ if (dst == NULL)
+ return -ENOMEM;
+ save_pte = kcalloc(npages, sizeof(*save_pte), GFP_KERNEL);
+ if (save_pte == NULL) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ ret = mm_hmm_migrate(hmm->mm, vma, save_pte, &event->backoff,
+ &hmm->mmu_notifier, event->start, event->end);
+ if (ret == -EAGAIN)
+ goto out;
+ if (ret)
+ goto out;
+
+ /*
+ * Now invalidate for all other device, note that they can not race
+ * with us as the CPU page table is full of special entry.
+ */
+ hmm_event_init(&invalidate, mirror->hmm, event->start,
+ event->end, HMM_MIGRATE);
+again:
+ down_read(&hmm->rwsem);
+ hlist_for_each_entry(tmp, &hmm->mirrors, mlist) {
+ if (tmp == mirror)
+ continue;
+ if (hmm_mirror_update(tmp, &invalidate, NULL)) {
+ hmm_mirror_ref(tmp);
+ up_read(&hmm->rwsem);
+ hmm_mirror_kill(tmp);
+ hmm_mirror_unref(&tmp);
+ goto again;
+ }
+ }
+ up_read(&hmm->rwsem);
+
+ /*
+ * Populate the mirror page table with saved entry and also mark entry
+ * that can be migrated.
+ */
+ for (addr = event->start; addr < event->end;) {
+ unsigned long i, idx, next = event->end, npages;
+ dma_addr_t *hmm_pte;
+
+ hmm_pte = hmm_pt_iter_populate(iter, addr, &next);
+ if (!hmm_pte) {
+ ret = -ENOMEM;
+ goto out_cleanup;
+ }
+
+ npages = (next - addr) >> PAGE_SHIFT;
+ idx = (addr - event->start) >> PAGE_SHIFT;
+ hmm_pt_iter_directory_lock(iter);
+ for (i = 0; i < npages; i++, idx++) {
+ hmm_pte_clear_select(&hmm_pte[i]);
+ if (!pte_present(save_pte[idx]))
+ continue;
+ hmm_pte_set_select(&hmm_pte[i]);
+ /* This can not be a valid device entry here. */
+ VM_BUG_ON(hmm_pte_test_valid_dev(&hmm_pte[i]));
+ if (hmm_pte_test_valid_dma(&hmm_pte[i]))
+ continue;
+
+ if (hmm_pte_test_valid_pfn(&hmm_pte[i]))
+ continue;
+
+ hmm_pt_iter_directory_ref(iter);
+ hmm_pte[i] = hmm_pte_from_pfn(pte_pfn(save_pte[idx]));
+ if (pte_write(save_pte[idx]))
+ hmm_pte_set_write(&hmm_pte[i]);
+ hmm_pte_set_select(&hmm_pte[i]);
+ }
+ hmm_pt_iter_directory_unlock(iter);
+
+ if (device->dev) {
+ spinlock_t *lock;
+
+ lock = hmm_pt_iter_directory_lock_ptr(iter);
+ ret = hmm_mirror_dma_map_range(mirror, hmm_pte,
+ lock, npages);
+ /* Keep going only for entry that have been mapped. */
+ if (ret) {
+ for (i = 0; i < npages; ++i) {
+ if (!hmm_pte_test_select(&dst[i]))
+ continue;
+ if (hmm_pte_test_valid_dma(&dst[i]))
+ continue;
+ hmm_pte_clear_select(&hmm_pte[i]);
+ }
+ }
+ }
+ addr = next;
+ }
+
+ /* Now Waldo we can do the copy. */
+ r = device->ops->copy_to_device(mirror, event, vma, dst,
+ event->start, event->end);
+
+ /* Update mirror page table with successfully migrated entry. */
+ for (addr = event->start; addr < event->end;) {
+ unsigned long i, idx, next = event->end, npages;
+ dma_addr_t *hmm_pte;
+
+ hmm_pte = hmm_pt_iter_walk(iter, &addr, &next);
+ if (!hmm_pte)
+ continue;
+ npages = (next - addr) >> PAGE_SHIFT;
+ idx = (addr - event->start) >> PAGE_SHIFT;
+ hmm_pt_iter_directory_lock(iter);
+ for (i = 0; i < npages; i++, idx++) {
+ if (!hmm_pte_test_valid_dev(&dst[idx]))
+ continue;
+
+ VM_BUG_ON(!hmm_pte_test_select(&hmm_pte[i]));
+ hmm_pte[i] = dst[idx];
+ }
+ hmm_pt_iter_directory_unlock(iter);
+ addr = next;
+ }
+
+out_cleanup:
+ mm_hmm_migrate_cleanup(hmm->mm, vma, save_pte, dst,
+ event->start, event->end);
+out:
+ kfree(save_pte);
+ kfree(dst);
+ return ret ? ret : r;
+}
+
/* hmm_mirror_range_discard() - discard a range of address.
*
* @mirror: The mirror struct.
--
2.4.3

2016-03-08 19:48:54

by Jerome Glisse

[permalink] [raw]
Subject: [PATCH v12 20/29] HMM: handle HMM device page table entry on mirror page table fault and update.

When faulting or updating the device page table properly handle the case of
device memory entry.

Signed-off-by: Jérôme Glisse <[email protected]>
---
mm/hmm.c | 13 +++++++++++++
1 file changed, 13 insertions(+)

diff --git a/mm/hmm.c b/mm/hmm.c
index a5706d2..9455443 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -616,6 +616,13 @@ static void hmm_mirror_update_pte(struct hmm_mirror *mirror,
goto out;
}

+ if (hmm_pte_test_valid_dev(hmm_pte)) {
+ *hmm_pte &= event->pte_mask;
+ if (!hmm_pte_test_valid_dev(hmm_pte))
+ hmm_pt_iter_directory_unref(iter);
+ return;
+ }
+
if (!hmm_pte_test_valid_dma(hmm_pte))
return;

@@ -808,6 +815,12 @@ static int hmm_mirror_fault_pmd(pmd_t *pmdp,
ptep = pte_offset_map(pmdp, start);
hmm_pt_iter_directory_lock(iter);
do {
+ if (hmm_pte_test_valid_dev(&hmm_pte[i])) {
+ if (write)
+ hmm_pte_set_write(&hmm_pte[i]);
+ continue;
+ }
+
if (!pte_present(*ptep) ||
(write && !pte_write(*ptep)) ||
pte_protnone(*ptep)) {
--
2.4.3

2016-03-08 19:50:57

by Jerome Glisse

[permalink] [raw]
Subject: [PATCH v12 28/29] HMM: CPU page fault on migrated memory.

When CPU try to access memory that have been migrated to device memory
we have to copy it back to system memory. This patch implement the CPU
page fault handler for special HMM pte swap entry.

Signed-off-by: Jérôme Glisse <[email protected]>
---
mm/hmm.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 53 insertions(+), 1 deletion(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index 4dcd98f..38943a7 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -472,7 +472,59 @@ int hmm_handle_cpu_fault(struct mm_struct *mm,
pmd_t *pmdp, unsigned long addr,
unsigned flags, pte_t orig_pte)
{
- return VM_FAULT_SIGBUS;
+ unsigned long start, end;
+ struct hmm_event event;
+ swp_entry_t entry;
+ struct hmm *hmm;
+ dma_addr_t dst;
+ pte_t new_pte;
+ int ret;
+
+ /* First check for poisonous entry. */
+ entry = pte_to_swp_entry(orig_pte);
+ if (is_hmm_entry_poisonous(entry))
+ return VM_FAULT_SIGBUS;
+
+ hmm = hmm_ref(mm->hmm);
+ if (!hmm) {
+ pte_t poison = swp_entry_to_pte(make_hmm_entry_poisonous());
+ spinlock_t *ptl;
+ pte_t *ptep;
+
+ /* Check if cpu pte is already updated. */
+ ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+ if (!pte_same(*ptep, orig_pte)) {
+ pte_unmap_unlock(ptep, ptl);
+ return 0;
+ }
+ set_pte_at(mm, addr, ptep, poison);
+ pte_unmap_unlock(ptep, ptl);
+ return VM_FAULT_SIGBUS;
+ }
+
+ /*
+ * TODO we likely want to migrate more then one page at a time, we need
+ * to call into the device driver to get good hint on the range to copy
+ * back to system memory.
+ *
+ * For now just live with the one page at a time solution.
+ */
+ start = addr & PAGE_MASK;
+ end = start + PAGE_SIZE;
+ hmm_event_init(&event, hmm, start, end, HMM_COPY_FROM_DEVICE);
+
+ ret = hmm_migrate_back(hmm, &event, mm, vma, &new_pte,
+ &dst, start, end);
+ hmm_unref(hmm);
+ switch (ret) {
+ case 0:
+ return VM_FAULT_MAJOR;
+ case -ENOMEM:
+ return VM_FAULT_OOM;
+ case -EINVAL:
+ default:
+ return VM_FAULT_SIGBUS;
+ }
}
EXPORT_SYMBOL(hmm_handle_cpu_fault);

--
2.4.3

2016-03-08 19:51:23

by Jerome Glisse

[permalink] [raw]
Subject: [PATCH v12 26/29] HMM: add helpers for migration back to system memory v3.

This patch add all necessary functions and helpers for migration
from device memory back to system memory. They are 3 differents
case that would use that code :
- CPU page fault
- fork
- device driver request

Note that this patch use regular memory accounting this means that
migration can fail as a result of memory cgroup resource exhaustion.
Latter patches will modify memcg to allow to keep remote memory
accounted as regular memory thus removing this point of failure.

Changed since v1:
- Fixed logic in dma unmap code path on migration error.

Changed since v2:
- Adapt to HMM page table changes.
- Fix bug in migration failure code path.

Signed-off-by: Jérôme Glisse <[email protected]>
Signed-off-by: Sherry Cheung <[email protected]>
Signed-off-by: Subhash Gutti <[email protected]>
Signed-off-by: Mark Hairgrove <[email protected]>
Signed-off-by: John Hubbard <[email protected]>
Signed-off-by: Jatin Kumar <[email protected]>
---
mm/hmm.c | 151 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 151 insertions(+)

diff --git a/mm/hmm.c b/mm/hmm.c
index 07f1ab6..435e376 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -47,6 +47,12 @@

static struct mmu_notifier_ops hmm_notifier_ops;
static void hmm_mirror_kill(struct hmm_mirror *mirror);
+static int hmm_mirror_migrate_back(struct hmm_mirror *mirror,
+ struct hmm_event *event,
+ pte_t *new_pte,
+ dma_addr_t *dst,
+ unsigned long start,
+ unsigned long end);
static inline int hmm_mirror_update(struct hmm_mirror *mirror,
struct hmm_event *event,
struct page *page);
@@ -421,6 +427,46 @@ static struct mmu_notifier_ops hmm_notifier_ops = {
};


+static int hmm_migrate_back(struct hmm *hmm,
+ struct hmm_event *event,
+ struct mm_struct *mm,
+ struct vm_area_struct *vma,
+ pte_t *new_pte,
+ dma_addr_t *dst,
+ unsigned long start,
+ unsigned long end)
+{
+ struct hmm_mirror *mirror;
+ int r, ret;
+
+ /*
+ * Do not return right away on error, as there might be valid page we
+ * can migrate.
+ */
+ ret = mm_hmm_migrate_back(mm, vma, new_pte, start, end);
+
+again:
+ down_read(&hmm->rwsem);
+ hlist_for_each_entry(mirror, &hmm->mirrors, mlist) {
+ r = hmm_mirror_migrate_back(mirror, event, new_pte,
+ dst, start, end);
+ if (r) {
+ ret = ret ? ret : r;
+ mirror = hmm_mirror_ref(mirror);
+ BUG_ON(!mirror);
+ up_read(&hmm->rwsem);
+ hmm_mirror_kill(mirror);
+ hmm_mirror_unref(&mirror);
+ goto again;
+ }
+ }
+ up_read(&hmm->rwsem);
+
+ mm_hmm_migrate_back_cleanup(mm, vma, new_pte, dst, start, end);
+
+ return ret;
+}
+
int hmm_handle_cpu_fault(struct mm_struct *mm,
struct vm_area_struct *vma,
pmd_t *pmdp, unsigned long addr,
@@ -1153,6 +1199,111 @@ out:
}
EXPORT_SYMBOL(hmm_mirror_fault);

+static int hmm_mirror_migrate_back(struct hmm_mirror *mirror,
+ struct hmm_event *event,
+ pte_t *new_pte,
+ dma_addr_t *dst,
+ unsigned long start,
+ unsigned long end)
+{
+ unsigned long addr, i, npages = (end - start) >> PAGE_SHIFT;
+ struct hmm_device *device = mirror->device;
+ struct device *dev = mirror->device->dev;
+ struct hmm_pt_iter iter;
+ int r, ret = 0;
+
+ hmm_pt_iter_init(&iter, &mirror->pt);
+ for (addr = start, i = 0; addr < end; addr += PAGE_SIZE, ++i) {
+ unsigned long next = end;
+ dma_addr_t *hmm_pte;
+
+ hmm_pte_clear_select(&dst[i]);
+
+ if (!pte_present(new_pte[i]))
+ continue;
+ hmm_pte = hmm_pt_iter_lookup(&iter, addr, &next);
+ if (!hmm_pte)
+ continue;
+
+ if (!hmm_pte_test_valid_dev(hmm_pte))
+ continue;
+
+ dst[i] = hmm_pte_from_pfn(pte_pfn(new_pte[i]));
+ hmm_pte_set_select(&dst[i]);
+ hmm_pte_set_write(&dst[i]);
+ }
+
+ if (dev) {
+ ret = hmm_mirror_dma_map_range(mirror, dst, NULL, npages);
+ if (ret) {
+ for (i = 0; i < npages; ++i) {
+ if (!hmm_pte_test_select(&dst[i]))
+ continue;
+ if (hmm_pte_test_valid_dma(&dst[i]))
+ continue;
+ dst[i] = 0;
+ }
+ }
+ }
+
+ r = device->ops->copy_from_device(mirror, event, dst, start, end);
+
+ /* Update mirror page table with successfully migrated entry. */
+ for (addr = start; addr < end;) {
+ unsigned long idx, next = end, npages;
+ dma_addr_t *hmm_pte;
+
+ hmm_pte = hmm_pt_iter_walk(&iter, &addr, &next);
+ if (!hmm_pte)
+ continue;
+ idx = (addr - event->start) >> PAGE_SHIFT;
+ npages = (next - addr) >> PAGE_SHIFT;
+ hmm_pt_iter_directory_lock(&iter);
+ for (i = 0; i < npages; i++, idx++) {
+ if (!hmm_pte_test_valid_pfn(&dst[idx]) &&
+ !hmm_pte_test_valid_dma(&dst[idx])) {
+ if (hmm_pte_test_valid_dev(&hmm_pte[i])) {
+ hmm_pte[i] = 0;
+ hmm_pt_iter_directory_unref(&iter);
+ }
+ continue;
+ }
+
+ VM_BUG_ON(!hmm_pte_test_select(&dst[idx]));
+ VM_BUG_ON(!hmm_pte_test_valid_dev(&hmm_pte[i]));
+ hmm_pte[i] = dst[idx];
+ }
+ hmm_pt_iter_directory_unlock(&iter);
+
+ /* DMA unmap failed migrate entry. */
+ if (dev) {
+ idx = (addr - event->start) >> PAGE_SHIFT;
+ for (i = 0; i < npages; i++, idx++) {
+ dma_addr_t dma_addr;
+
+ /*
+ * Failed entry have the valid bit clear but
+ * the select bit remain set.
+ */
+ if (!hmm_pte_test_select(&dst[idx]) ||
+ hmm_pte_test_valid_dma(&dst[i]))
+ continue;
+
+ hmm_pte_set_valid_dma(&dst[idx]);
+ dma_addr = hmm_pte_dma_addr(dst[idx]);
+ dma_unmap_page(dev, dma_addr, PAGE_SIZE,
+ DMA_BIDIRECTIONAL);
+ dst[idx] = 0;
+ }
+ }
+
+ addr = next;
+ }
+ hmm_pt_iter_fini(&iter);
+
+ return ret ? ret : r;
+}
+
/* hmm_mirror_range_discard() - discard a range of address.
*
* @mirror: The mirror struct.
--
2.4.3

2016-03-08 19:51:40

by Jerome Glisse

[permalink] [raw]
Subject: [PATCH v12 25/29] HMM: split DMA mapping function in two.

To be able to reuse the DMA mapping logic, split it in two functions.

Signed-off-by: Jérôme Glisse <[email protected]>
---
mm/hmm.c | 120 ++++++++++++++++++++++++++++++++++-----------------------------
1 file changed, 65 insertions(+), 55 deletions(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index d26abe4..07f1ab6 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -910,76 +910,86 @@ static int hmm_mirror_fault_hugetlb_entry(pte_t *ptep,
return 0;
}

+static int hmm_mirror_dma_map_range(struct hmm_mirror *mirror,
+ dma_addr_t *hmm_pte,
+ spinlock_t *lock,
+ unsigned long npages)
+{
+ struct device *dev = mirror->device->dev;
+ unsigned long i;
+ int ret = 0;
+
+ for (i = 0; i < npages; i++) {
+ dma_addr_t dma_addr, pte;
+ struct page *page;
+
+again:
+ pte = ACCESS_ONCE(hmm_pte[i]);
+ if (!hmm_pte_test_valid_pfn(&pte) || !hmm_pte_test_select(&pte))
+ continue;
+
+ page = pfn_to_page(hmm_pte_pfn(pte));
+ VM_BUG_ON(!page);
+ dma_addr = dma_map_page(dev, page, 0, PAGE_SIZE,
+ DMA_BIDIRECTIONAL);
+ if (dma_mapping_error(dev, dma_addr)) {
+ ret = -ENOMEM;
+ break;
+ }
+
+ /*
+ * Make sure we transfer the dirty bit. Note that there
+ * might still be a window for another thread to set
+ * the dirty bit before we check for pte equality. This
+ * will just lead to a useless retry so it is not the
+ * end of the world here.
+ */
+ if (lock)
+ spin_lock(lock);
+ if (hmm_pte_test_dirty(&hmm_pte[i]))
+ hmm_pte_set_dirty(&pte);
+ if (ACCESS_ONCE(hmm_pte[i]) != pte) {
+ if (lock)
+ spin_unlock(lock);
+ dma_unmap_page(dev, dma_addr, PAGE_SIZE,
+ DMA_BIDIRECTIONAL);
+ if (hmm_pte_test_valid_pfn(&hmm_pte[i]))
+ goto again;
+ continue;
+ }
+ hmm_pte[i] = hmm_pte_from_dma_addr(dma_addr);
+ if (hmm_pte_test_write(&pte))
+ hmm_pte_set_write(&hmm_pte[i]);
+ if (hmm_pte_test_dirty(&pte))
+ hmm_pte_set_dirty(&hmm_pte[i]);
+ if (lock)
+ spin_unlock(lock);
+ }
+
+ return ret;
+}
+
static int hmm_mirror_dma_map(struct hmm_mirror *mirror,
struct hmm_pt_iter *iter,
unsigned long start,
unsigned long end)
{
- struct device *dev = mirror->device->dev;
unsigned long addr;
int ret;

for (ret = 0, addr = start; !ret && addr < end;) {
- unsigned long i = 0, next = end;
+ unsigned long next = end, npages;
dma_addr_t *hmm_pte;
+ spinlock_t *lock;

hmm_pte = hmm_pt_iter_populate(iter, addr, &next);
if (!hmm_pte)
return -ENOENT;

- do {
- dma_addr_t dma_addr, pte;
- struct page *page;
-
-again:
- pte = ACCESS_ONCE(hmm_pte[i]);
- if (!hmm_pte_test_valid_pfn(&pte) ||
- !hmm_pte_test_select(&pte)) {
- if (!hmm_pte_test_valid_dma(&pte)) {
- ret = -ENOENT;
- break;
- }
- continue;
- }
-
- page = pfn_to_page(hmm_pte_pfn(pte));
- VM_BUG_ON(!page);
- dma_addr = dma_map_page(dev, page, 0, PAGE_SIZE,
- DMA_BIDIRECTIONAL);
- if (dma_mapping_error(dev, dma_addr)) {
- ret = -ENOMEM;
- break;
- }
-
- hmm_pt_iter_directory_lock(iter);
- /*
- * Make sure we transfer the dirty bit. Note that there
- * might still be a window for another thread to set
- * the dirty bit before we check for pte equality. This
- * will just lead to a useless retry so it is not the
- * end of the world here.
- */
- if (hmm_pte_test_dirty(&hmm_pte[i]))
- hmm_pte_set_dirty(&pte);
- if (ACCESS_ONCE(hmm_pte[i]) != pte) {
- hmm_pt_iter_directory_unlock(iter);
- dma_unmap_page(dev, dma_addr, PAGE_SIZE,
- DMA_BIDIRECTIONAL);
- if (hmm_pte_test_valid_pfn(&pte))
- goto again;
- if (!hmm_pte_test_valid_dma(&pte)) {
- ret = -ENOENT;
- break;
- }
- } else {
- hmm_pte[i] = hmm_pte_from_dma_addr(dma_addr);
- if (hmm_pte_test_write(&pte))
- hmm_pte_set_write(&hmm_pte[i]);
- if (hmm_pte_test_dirty(&pte))
- hmm_pte_set_dirty(&hmm_pte[i]);
- hmm_pt_iter_directory_unlock(iter);
- }
- } while (addr += PAGE_SIZE, i++, addr != next && !ret);
+ npages = (next - addr) >> PAGE_SHIFT;
+ lock = hmm_pt_iter_directory_lock_ptr(iter);
+ ret = hmm_mirror_dma_map_range(mirror, hmm_pte, lock, npages);
+ addr = next;
}

return ret;
--
2.4.3

2016-03-08 19:51:47

by Jerome Glisse

[permalink] [raw]
Subject: [PATCH v12 23/29] HMM: new callback for copying memory from and to device memory v2.

From: Jerome Glisse <[email protected]>

This patch only adds the new callback device driver must implement
to copy memory from and to device memory.

Changed since v1:
- Pass down the vma to the copy function.

Signed-off-by: Jérôme Glisse <[email protected]>
Signed-off-by: Sherry Cheung <[email protected]>
Signed-off-by: Subhash Gutti <[email protected]>
Signed-off-by: Mark Hairgrove <[email protected]>
Signed-off-by: John Hubbard <[email protected]>
Signed-off-by: Jatin Kumar <[email protected]>
---
include/linux/hmm.h | 105 ++++++++++++++++++++++++++++++++++++++++++++++++++++
mm/hmm.c | 2 +
2 files changed, 107 insertions(+)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 7c66513..9fbfc07 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -65,6 +65,8 @@ enum hmm_etype {
HMM_DEVICE_RFAULT,
HMM_DEVICE_WFAULT,
HMM_WRITE_PROTECT,
+ HMM_COPY_FROM_DEVICE,
+ HMM_COPY_TO_DEVICE,
};

/* struct hmm_event - memory event information.
@@ -170,6 +172,109 @@ struct hmm_device_ops {
*/
int (*update)(struct hmm_mirror *mirror,
struct hmm_event *event);
+
+ /* copy_from_device() - copy from device memory to system memory.
+ *
+ * @mirror: The mirror that link process address space with the device.
+ * @event: The event that triggered the copy.
+ * @dst: Array containing hmm_pte of destination memory.
+ * @start: Start address of the range (sub-range of event) to copy.
+ * @end: End address of the range (sub-range of event) to copy.
+ * Returns: 0 on success, error code otherwise {-ENOMEM, -EIO}.
+ *
+ * Called when migrating memory from device memory to system memory.
+ * The dst array contains valid DMA address for the device of the page
+ * to copy to (or pfn of page if hmm_device.device == NULL).
+ *
+ * If event.etype == HMM_FORK then device driver only need to schedule
+ * a copy to the system pages given in the dst hmm_pte array. Do not
+ * update the device page, and do not pause/stop the device threads
+ * that are using this address space. Just copy memory.
+ *
+ * If event.type == HMM_COPY_FROM_DEVICE then device driver must first
+ * write protect the range then schedule the copy, then update its page
+ * table to use the new system memory given the dst array. Some device
+ * can perform all this in an atomic fashion from device point of view.
+ * The device driver must also free the device memory once the copy is
+ * done.
+ *
+ * Device driver must not fail lightly, any failure result in device
+ * process being kill and CPU page table set to HWPOISON entry.
+ *
+ * Note that device driver must clear the valid bit of the dst entry it
+ * failed to copy.
+ *
+ * On failure the mirror will be kill by HMM which will do a HMM_MUNMAP
+ * invalidation of all the memory when this happen the device driver
+ * can free the device memory.
+ *
+ * Note also that there can be hole in the range being copied ie some
+ * entry of dst array will not have the valid bit set, device driver
+ * must simply ignore non valid entry.
+ *
+ * Finaly device driver must set the dirty bit for each page that was
+ * modified since it was copied inside the device memory. This must be
+ * conservative ie if device can not determine that with certainty then
+ * it must set the dirty bit unconditionally.
+ *
+ * Return 0 on success, error value otherwise :
+ * -ENOMEM Not enough memory for performing the operation.
+ * -EIO Some input/output error with the device.
+ *
+ * All other return value trigger warning and are transformed to -EIO.
+ */
+ int (*copy_from_device)(struct hmm_mirror *mirror,
+ const struct hmm_event *event,
+ dma_addr_t *dst,
+ unsigned long start,
+ unsigned long end);
+
+ /* copy_to_device() - copy to device memory from system memory.
+ *
+ * @mirror: The mirror that link process address space with the device.
+ * @event: The event that triggered the copy.
+ * @vma: The vma corresponding to the range.
+ * @dst: Array containing hmm_pte of destination memory.
+ * @start: Start address of the range (sub-range of event) to copy.
+ * @end: End address of the range (sub-range of event) to copy.
+ * Returns: 0 on success, error code otherwise {-ENOMEM, -EIO}.
+ *
+ * Called when migrating memory from system memory to device memory.
+ * The dst array is empty, all of its entry are equal to zero. Device
+ * driver must allocate the device memory and populate each entry using
+ * hmm_pte_from_device_pfn() only the valid device bit and hardware
+ * specific bit will be preserve (write and dirty will be taken from
+ * the original entry inside the mirror page table). It is advice to
+ * set the device pfn to match the physical address of device memory
+ * being use. The event.etype will be equals to HMM_COPY_TO_DEVICE.
+ *
+ * Device driver that can atomically copy a page and update its page
+ * table entry to point to the device memory can do that. Partial
+ * failure is allowed, entry that have not been migrated must have
+ * the HMM_PTE_VALID_DEV bit clear inside the dst array. HMM will
+ * update the CPU page table of failed entry to point back to the
+ * system page.
+ *
+ * Note that device driver is responsible for allocating and freeing
+ * the device memory and properly updating to dst array entry with
+ * the allocated device memory.
+ *
+ * Return 0 on success, error value otherwise :
+ * -ENOMEM Not enough memory for performing the operation.
+ * -EIO Some input/output error with the device.
+ *
+ * All other return value trigger warning and are transformed to -EIO.
+ * Errors means that the migration is aborted. So in case of partial
+ * failure if device do not want to fully abort it must return 0.
+ * Device driver can update device page table only if it knows it will
+ * not return failure.
+ */
+ int (*copy_to_device)(struct hmm_mirror *mirror,
+ const struct hmm_event *event,
+ struct vm_area_struct *vma,
+ dma_addr_t *dst,
+ unsigned long start,
+ unsigned long end);
};


diff --git a/mm/hmm.c b/mm/hmm.c
index 9455443..d26abe4 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -78,6 +78,8 @@ static inline int hmm_event_init(struct hmm_event *event,
switch (etype) {
case HMM_DEVICE_RFAULT:
case HMM_DEVICE_WFAULT:
+ case HMM_COPY_TO_DEVICE:
+ case HMM_COPY_FROM_DEVICE:
break;
case HMM_FORK:
case HMM_WRITE_PROTECT:
--
2.4.3

2016-03-08 19:52:33

by Jerome Glisse

[permalink] [raw]
Subject: [PATCH v12 18/29] HMM: add new HMM page table flag (valid device memory).

For memory migrated to device we need a new type of memory entry.

Signed-off-by: Jérôme Glisse <[email protected]>
Signed-off-by: Sherry Cheung <[email protected]>
Signed-off-by: Subhash Gutti <[email protected]>
Signed-off-by: Mark Hairgrove <[email protected]>
Signed-off-by: John Hubbard <[email protected]>
Signed-off-by: Jatin Kumar <[email protected]>
---
include/linux/hmm_pt.h | 24 +++++++++++++++++++-----
1 file changed, 19 insertions(+), 5 deletions(-)

diff --git a/include/linux/hmm_pt.h b/include/linux/hmm_pt.h
index 8a59a75..b017aa7 100644
--- a/include/linux/hmm_pt.h
+++ b/include/linux/hmm_pt.h
@@ -74,10 +74,11 @@ static inline unsigned long hmm_pde_pfn(dma_addr_t pde)
* In the first case the device driver must ignore any pfn entry as they might
* show as transient state while HMM is mapping the page.
*/
-#define HMM_PTE_VALID_DMA_BIT 0
-#define HMM_PTE_VALID_PFN_BIT 1
-#define HMM_PTE_WRITE_BIT 2
-#define HMM_PTE_DIRTY_BIT 3
+#define HMM_PTE_VALID_DEV_BIT 0
+#define HMM_PTE_VALID_DMA_BIT 1
+#define HMM_PTE_VALID_PFN_BIT 2
+#define HMM_PTE_WRITE_BIT 3
+#define HMM_PTE_DIRTY_BIT 4
/*
* Reserve some bits for device driver private flags. Note that thus can only
* be manipulated using the hmm_pte_*_bit() sets of helpers.
@@ -85,7 +86,7 @@ static inline unsigned long hmm_pde_pfn(dma_addr_t pde)
* WARNING ONLY SET/CLEAR THOSE FLAG ON PTE ENTRY THAT HAVE THE VALID BIT SET
* AS OTHERWISE ANY BIT SET BY THE DRIVER WILL BE OVERWRITTEN BY HMM.
*/
-#define HMM_PTE_HW_SHIFT 4
+#define HMM_PTE_HW_SHIFT 8

#define HMM_PTE_PFN_MASK (~((dma_addr_t)((1 << PAGE_SHIFT) - 1)))
#define HMM_PTE_DMA_MASK (~((dma_addr_t)((1 << PAGE_SHIFT) - 1)))
@@ -166,6 +167,7 @@ static inline bool hmm_pte_test_and_set_bit(dma_addr_t *ptep,
HMM_PTE_TEST_AND_CLEAR_BIT(name, bit)\
HMM_PTE_TEST_AND_SET_BIT(name, bit)

+HMM_PTE_BIT_HELPER(valid_dev, HMM_PTE_VALID_DEV_BIT)
HMM_PTE_BIT_HELPER(valid_dma, HMM_PTE_VALID_DMA_BIT)
HMM_PTE_BIT_HELPER(valid_pfn, HMM_PTE_VALID_PFN_BIT)
HMM_PTE_BIT_HELPER(dirty, HMM_PTE_DIRTY_BIT)
@@ -176,11 +178,23 @@ static inline dma_addr_t hmm_pte_from_pfn(dma_addr_t pfn)
return (pfn << PAGE_SHIFT) | (1 << HMM_PTE_VALID_PFN_BIT);
}

+static inline dma_addr_t hmm_pte_from_dev_addr(dma_addr_t dma_addr)
+{
+ return (dma_addr & HMM_PTE_DMA_MASK) | (1 << HMM_PTE_VALID_DEV_BIT);
+}
+
static inline dma_addr_t hmm_pte_from_dma_addr(dma_addr_t dma_addr)
{
return (dma_addr & HMM_PTE_DMA_MASK) | (1 << HMM_PTE_VALID_DMA_BIT);
}

+static inline dma_addr_t hmm_pte_dev_addr(dma_addr_t pte)
+{
+ /* FIXME Use max dma addr instead of 0 ? */
+ return hmm_pte_test_valid_dev(&pte) ? (pte & HMM_PTE_DMA_MASK) :
+ (dma_addr_t)-1UL;
+}
+
static inline dma_addr_t hmm_pte_dma_addr(dma_addr_t pte)
{
/* FIXME Use max dma addr instead of 0 ? */
--
2.4.3

2016-03-08 19:53:34

by Jerome Glisse

[permalink] [raw]
Subject: [PATCH v12 14/29] HMM: Add support for hugetlb.

Support hugetlb vma allmost like other vma. Exception being that we
will not support migration of hugetlb memory.

Signed-off-by: Jérôme Glisse <[email protected]>
---
mm/hmm.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 61 insertions(+), 1 deletion(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index 7cab6cb..ad44325 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -813,6 +813,65 @@ static int hmm_mirror_fault_pmd(pmd_t *pmdp,
return ret;
}

+static int hmm_mirror_fault_hugetlb_entry(pte_t *ptep,
+ unsigned long hmask,
+ unsigned long addr,
+ unsigned long end,
+ struct mm_walk *walk)
+{
+#ifdef CONFIG_HUGETLB_PAGE
+ struct hmm_mirror_fault *mirror_fault = walk->private;
+ struct hmm_event *event = mirror_fault->event;
+ struct hmm_pt_iter *iter = mirror_fault->iter;
+ bool write = (event->etype == HMM_DEVICE_WFAULT);
+ unsigned long pfn, next;
+ dma_addr_t *hmm_pte;
+ pte_t pte;
+
+ /*
+ * Hugepages under user process are always in RAM and never
+ * swapped out, but theoretically it needs to be checked.
+ */
+ if (!ptep)
+ return -ENOENT;
+
+ pte = huge_ptep_get(ptep);
+ pfn = pte_pfn(pte);
+ if (!huge_pte_none(pte) || (write && !huge_pte_write(pte)))
+ return -ENOENT;
+
+ hmm_pte = hmm_pt_iter_populate(iter, addr, &next);
+ if (!hmm_pte)
+ return -ENOMEM;
+ hmm_pt_iter_directory_lock(iter);
+ for (; addr != end; addr += PAGE_SIZE, ++pfn, ++hmm_pte) {
+ /* Switch to another HMM page table directory ? */
+ if (addr == next) {
+ hmm_pt_iter_directory_unlock(iter);
+ hmm_pte = hmm_pt_iter_populate(iter, addr, &next);
+ if (!hmm_pte)
+ return -ENOMEM;
+ hmm_pt_iter_directory_lock(iter);
+ }
+
+ if (hmm_pte_test_valid_dma(hmm_pte))
+ continue;
+
+ if (!hmm_pte_test_valid_pfn(hmm_pte)) {
+ *hmm_pte = hmm_pte_from_pfn(pfn);
+ hmm_pt_iter_directory_ref(iter);
+ }
+ BUG_ON(hmm_pte_pfn(*hmm_pte) != pfn);
+ if (write)
+ hmm_pte_set_write(hmm_pte);
+ }
+ hmm_pt_iter_directory_unlock(iter);
+#else
+ BUG();
+#endif
+ return 0;
+}
+
static int hmm_mirror_dma_map(struct hmm_mirror *mirror,
struct hmm_pt_iter *iter,
unsigned long start,
@@ -920,6 +979,7 @@ static int hmm_mirror_handle_fault(struct hmm_mirror *mirror,
walk.mm = mirror->hmm->mm;
walk.private = &mirror_fault;
walk.pmd_entry = hmm_mirror_fault_pmd;
+ walk.hugetlb_entry = hmm_mirror_fault_hugetlb_entry;
walk.pte_hole = hmm_pte_hole;
ret = walk_page_range(addr, event->end, &walk);
if (ret)
@@ -1006,7 +1066,7 @@ retry:
goto out;
}
event->end = min(event->end, vma->vm_end) & PAGE_MASK;
- if ((vma->vm_flags & (VM_IO | VM_PFNMAP | VM_MIXEDMAP | VM_HUGETLB))) {
+ if ((vma->vm_flags & (VM_IO | VM_PFNMAP | VM_MIXEDMAP))) {
ret = -EFAULT;
goto out;
}
--
2.4.3

2016-03-08 19:53:54

by Jerome Glisse

[permalink] [raw]
Subject: [PATCH v12 13/29] HMM: DMA map memory on behalf of device driver v2.

Do the DMA mapping on behalf of the device as HMM is a good place
to perform this common task. Moreover in the future we hope to
add new infrastructure that would make DMA mapping more efficient
(lower overhead per page) by leveraging HMM data structure.

Changed since v1:
- Adapt to HMM page table changes.

Signed-off-by: Jérôme Glisse <[email protected]>
---
include/linux/hmm_pt.h | 11 +++
mm/hmm.c | 202 +++++++++++++++++++++++++++++++++++++++----------
2 files changed, 174 insertions(+), 39 deletions(-)

diff --git a/include/linux/hmm_pt.h b/include/linux/hmm_pt.h
index 4a8beb1..8a59a75 100644
--- a/include/linux/hmm_pt.h
+++ b/include/linux/hmm_pt.h
@@ -176,6 +176,17 @@ static inline dma_addr_t hmm_pte_from_pfn(dma_addr_t pfn)
return (pfn << PAGE_SHIFT) | (1 << HMM_PTE_VALID_PFN_BIT);
}

+static inline dma_addr_t hmm_pte_from_dma_addr(dma_addr_t dma_addr)
+{
+ return (dma_addr & HMM_PTE_DMA_MASK) | (1 << HMM_PTE_VALID_DMA_BIT);
+}
+
+static inline dma_addr_t hmm_pte_dma_addr(dma_addr_t pte)
+{
+ /* FIXME Use max dma addr instead of 0 ? */
+ return hmm_pte_test_valid_dma(&pte) ? (pte & HMM_PTE_DMA_MASK) : 0;
+}
+
static inline unsigned long hmm_pte_pfn(dma_addr_t pte)
{
return hmm_pte_test_valid_pfn(&pte) ? pte >> PAGE_SHIFT : 0;
diff --git a/mm/hmm.c b/mm/hmm.c
index dc37e49..7cab6cb 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -41,6 +41,7 @@
#include <linux/mman.h>
#include <linux/delay.h>
#include <linux/workqueue.h>
+#include <linux/dma-mapping.h>

#include "internal.h"

@@ -577,6 +578,46 @@ static inline int hmm_mirror_update(struct hmm_mirror *mirror,
return ret;
}

+static void hmm_mirror_update_pte(struct hmm_mirror *mirror,
+ struct hmm_event *event,
+ struct hmm_pt_iter *iter,
+ struct mm_pt_iter *mm_iter,
+ struct page *page,
+ dma_addr_t *hmm_pte,
+ unsigned long addr)
+{
+ bool dirty = hmm_pte_test_and_clear_dirty(hmm_pte);
+
+ if (hmm_pte_test_valid_pfn(hmm_pte)) {
+ *hmm_pte &= event->pte_mask;
+ if (!hmm_pte_test_valid_pfn(hmm_pte))
+ hmm_pt_iter_directory_unref(iter);
+ goto out;
+ }
+
+ if (!hmm_pte_test_valid_dma(hmm_pte))
+ return;
+
+ if (!hmm_pte_test_valid_dma(&event->pte_mask)) {
+ struct device *dev = mirror->device->dev;
+ dma_addr_t dma_addr;
+
+ dma_addr = hmm_pte_dma_addr(*hmm_pte);
+ dma_unmap_page(dev, dma_addr, PAGE_SIZE, DMA_BIDIRECTIONAL);
+ }
+
+ *hmm_pte &= event->pte_mask;
+ if (!hmm_pte_test_valid_dma(hmm_pte))
+ hmm_pt_iter_directory_unref(iter);
+
+out:
+ if (dirty) {
+ page = page ? : mm_pt_iter_page(mm_iter, addr);
+ if (page)
+ set_page_dirty(page);
+ }
+}
+
static void hmm_mirror_update_pt(struct hmm_mirror *mirror,
struct hmm_event *event,
struct page *page)
@@ -603,19 +644,9 @@ static void hmm_mirror_update_pt(struct hmm_mirror *mirror,
*/
hmm_pt_iter_directory_lock(&iter);
do {
- if (!hmm_pte_test_valid_pfn(hmm_pte))
- continue;
- if (hmm_pte_test_and_clear_dirty(hmm_pte) &&
- hmm_pte_test_write(hmm_pte)) {
- page = page ? : mm_pt_iter_page(&mm_iter, addr);
- if (page)
- set_page_dirty(page);
- page = NULL;
- }
- *hmm_pte &= event->pte_mask;
- if (hmm_pte_test_valid_pfn(hmm_pte))
- continue;
- hmm_pt_iter_directory_unref(&iter);
+ hmm_mirror_update_pte(mirror, event, &iter, &mm_iter,
+ page, hmm_pte, addr);
+ page = NULL;
} while (addr += PAGE_SIZE, hmm_pte++, addr != next);
hmm_pt_iter_directory_unlock(&iter);
}
@@ -687,6 +718,9 @@ static int hmm_mirror_fault_hpmd(struct hmm_mirror *mirror,
*/
hmm_pt_iter_directory_lock(iter);
do {
+ if (hmm_pte_test_valid_dma(&hmm_pte[i]))
+ continue;
+
if (!hmm_pte_test_valid_pfn(&hmm_pte[i])) {
hmm_pte[i] = hmm_pte_from_pfn(pfn);
hmm_pt_iter_directory_ref(iter);
@@ -760,6 +794,9 @@ static int hmm_mirror_fault_pmd(pmd_t *pmdp,
break;
}

+ if (hmm_pte_test_valid_dma(&hmm_pte[i]))
+ continue;
+
if (!hmm_pte_test_valid_pfn(&hmm_pte[i])) {
hmm_pte[i] = hmm_pte_from_pfn(pte_pfn(*ptep));
hmm_pt_iter_directory_ref(iter);
@@ -776,6 +813,80 @@ static int hmm_mirror_fault_pmd(pmd_t *pmdp,
return ret;
}

+static int hmm_mirror_dma_map(struct hmm_mirror *mirror,
+ struct hmm_pt_iter *iter,
+ unsigned long start,
+ unsigned long end)
+{
+ struct device *dev = mirror->device->dev;
+ unsigned long addr;
+ int ret;
+
+ for (ret = 0, addr = start; !ret && addr < end;) {
+ unsigned long i = 0, next = end;
+ dma_addr_t *hmm_pte;
+
+ hmm_pte = hmm_pt_iter_populate(iter, addr, &next);
+ if (!hmm_pte)
+ return -ENOENT;
+
+ do {
+ dma_addr_t dma_addr, pte;
+ struct page *page;
+
+again:
+ pte = ACCESS_ONCE(hmm_pte[i]);
+ if (!hmm_pte_test_valid_pfn(&pte)) {
+ if (!hmm_pte_test_valid_dma(&pte)) {
+ ret = -ENOENT;
+ break;
+ }
+ continue;
+ }
+
+ page = pfn_to_page(hmm_pte_pfn(pte));
+ VM_BUG_ON(!page);
+ dma_addr = dma_map_page(dev, page, 0, PAGE_SIZE,
+ DMA_BIDIRECTIONAL);
+ if (dma_mapping_error(dev, dma_addr)) {
+ ret = -ENOMEM;
+ break;
+ }
+
+ hmm_pt_iter_directory_lock(iter);
+ /*
+ * Make sure we transfer the dirty bit. Note that there
+ * might still be a window for another thread to set
+ * the dirty bit before we check for pte equality. This
+ * will just lead to a useless retry so it is not the
+ * end of the world here.
+ */
+ if (hmm_pte_test_dirty(&hmm_pte[i]))
+ hmm_pte_set_dirty(&pte);
+ if (ACCESS_ONCE(hmm_pte[i]) != pte) {
+ hmm_pt_iter_directory_unlock(iter);
+ dma_unmap_page(dev, dma_addr, PAGE_SIZE,
+ DMA_BIDIRECTIONAL);
+ if (hmm_pte_test_valid_pfn(&pte))
+ goto again;
+ if (!hmm_pte_test_valid_dma(&pte)) {
+ ret = -ENOENT;
+ break;
+ }
+ } else {
+ hmm_pte[i] = hmm_pte_from_dma_addr(dma_addr);
+ if (hmm_pte_test_write(&pte))
+ hmm_pte_set_write(&hmm_pte[i]);
+ if (hmm_pte_test_dirty(&pte))
+ hmm_pte_set_dirty(&hmm_pte[i]);
+ hmm_pt_iter_directory_unlock(iter);
+ }
+ } while (addr += PAGE_SIZE, i++, addr != next && !ret);
+ }
+
+ return ret;
+}
+
static int hmm_mirror_handle_fault(struct hmm_mirror *mirror,
struct hmm_event *event,
struct vm_area_struct *vma,
@@ -784,7 +895,7 @@ static int hmm_mirror_handle_fault(struct hmm_mirror *mirror,
struct hmm_mirror_fault mirror_fault;
unsigned long addr = event->start;
struct mm_walk walk = {0};
- int ret = 0;
+ int ret;

if ((event->etype == HMM_DEVICE_WFAULT) && !(vma->vm_flags & VM_WRITE))
return -EACCES;
@@ -793,33 +904,45 @@ static int hmm_mirror_handle_fault(struct hmm_mirror *mirror,
if (ret)
return ret;

-again:
- if (event->backoff) {
- ret = -EAGAIN;
- goto out;
- }
- if (addr >= event->end)
- goto out;
+ do {
+ if (event->backoff) {
+ ret = -EAGAIN;
+ break;
+ }
+ if (addr >= event->end)
+ break;
+
+ mirror_fault.event = event;
+ mirror_fault.mirror = mirror;
+ mirror_fault.vma = vma;
+ mirror_fault.addr = addr;
+ mirror_fault.iter = iter;
+ walk.mm = mirror->hmm->mm;
+ walk.private = &mirror_fault;
+ walk.pmd_entry = hmm_mirror_fault_pmd;
+ walk.pte_hole = hmm_pte_hole;
+ ret = walk_page_range(addr, event->end, &walk);
+ if (ret)
+ break;
+
+ if (event->backoff) {
+ ret = -EAGAIN;
+ break;
+ }

- mirror_fault.event = event;
- mirror_fault.mirror = mirror;
- mirror_fault.vma = vma;
- mirror_fault.addr = addr;
- mirror_fault.iter = iter;
- walk.mm = mirror->hmm->mm;
- walk.private = &mirror_fault;
- walk.pmd_entry = hmm_mirror_fault_pmd;
- walk.pte_hole = hmm_pte_hole;
- ret = walk_page_range(addr, event->end, &walk);
- if (!ret) {
- ret = mirror->device->ops->update(mirror, event);
- if (!ret) {
- addr = mirror_fault.addr;
- goto again;
+ if (mirror->device->dev) {
+ ret = hmm_mirror_dma_map(mirror, iter,
+ addr, event->end);
+ if (ret)
+ break;
}
- }

-out:
+ ret = mirror->device->ops->update(mirror, event);
+ if (ret)
+ break;
+ addr = mirror_fault.addr;
+ } while (1);
+
hmm_device_fault_end(mirror->hmm, event);
if (ret == -ENOENT) {
ret = hmm_mm_fault(mirror->hmm, event, vma, addr);
@@ -973,7 +1096,8 @@ void hmm_mirror_range_dirty(struct hmm_mirror *mirror,

hmm_pte = hmm_pt_iter_walk(&iter, &addr, &next);
for (; hmm_pte && addr != next; hmm_pte++, addr += PAGE_SIZE) {
- if (!hmm_pte_test_valid_pfn(hmm_pte) ||
+ if ((!hmm_pte_test_valid_pfn(hmm_pte) &&
+ !hmm_pte_test_valid_dma(hmm_pte)) ||
!hmm_pte_test_write(hmm_pte))
continue;
hmm_pte_set_dirty(hmm_pte);
--
2.4.3

2016-03-08 19:54:02

by Jerome Glisse

[permalink] [raw]
Subject: [PATCH v12 12/29] HMM: add dirty range helper (toggle dirty bit inside mirror page table) v2.

Device driver must properly toggle the dirty inside the mirror page table
so dirtyness is properly accounted when core mm code needs to know. Provide
a simple helper to toggle that bit for a range of address.

Changed since v1:
- Adapt to HMM page table changes.

Signed-off-by: Jérôme Glisse <[email protected]>
---
include/linux/hmm.h | 3 +++
mm/hmm.c | 38 ++++++++++++++++++++++++++++++++++++++
2 files changed, 41 insertions(+)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 10e1558..4bc132a 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -268,6 +268,9 @@ int hmm_mirror_fault(struct hmm_mirror *mirror, struct hmm_event *event);
void hmm_mirror_range_discard(struct hmm_mirror *mirror,
unsigned long start,
unsigned long end);
+void hmm_mirror_range_dirty(struct hmm_mirror *mirror,
+ unsigned long start,
+ unsigned long end);


#endif /* CONFIG_HMM */
diff --git a/mm/hmm.c b/mm/hmm.c
index 548f0c5..dc37e49 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -945,6 +945,44 @@ void hmm_mirror_range_discard(struct hmm_mirror *mirror,
}
EXPORT_SYMBOL(hmm_mirror_range_discard);

+/* hmm_mirror_range_dirty() - toggle dirty bit for a range of address.
+ *
+ * @mirror: The mirror struct.
+ * @start: Start address of the range to discard (inclusive).
+ * @end: End address of the range to discard (exclusive).
+ *
+ * Call when device driver want to toggle the dirty bit for a range of address.
+ * Useful when the device driver just want to toggle the bit for whole range
+ * without walking the mirror page table itself.
+ *
+ * Note this function does not directly dirty the page behind an address, but
+ * this will happen once address is invalidated or discard by device driver or
+ * core mm code.
+ */
+void hmm_mirror_range_dirty(struct hmm_mirror *mirror,
+ unsigned long start,
+ unsigned long end)
+{
+ struct hmm_pt_iter iter;
+ unsigned long addr;
+
+ hmm_pt_iter_init(&iter, &mirror->pt);
+ for (addr = start; addr != end;) {
+ unsigned long next = end;
+ dma_addr_t *hmm_pte;
+
+ hmm_pte = hmm_pt_iter_walk(&iter, &addr, &next);
+ for (; hmm_pte && addr != next; hmm_pte++, addr += PAGE_SIZE) {
+ if (!hmm_pte_test_valid_pfn(hmm_pte) ||
+ !hmm_pte_test_write(hmm_pte))
+ continue;
+ hmm_pte_set_dirty(hmm_pte);
+ }
+ }
+ hmm_pt_iter_fini(&iter);
+}
+EXPORT_SYMBOL(hmm_mirror_range_dirty);
+
/* hmm_mirror_register() - register mirror against current process for a device.
*
* @mirror: The mirror struct being registered.
--
2.4.3

2016-03-08 19:54:29

by Jerome Glisse

[permalink] [raw]
Subject: [PATCH v12 08/29] HMM: add device page fault support v6.

This patch add helper for device page fault. Thus helpers will fill
the mirror page table using the CPU page table and synchronizing
with any update to CPU page table.

Changed since v1:
- Add comment about directory lock.

Changed since v2:
- Check for mirror->hmm in hmm_mirror_fault()

Changed since v3:
- Adapt to HMM page table changes.

Changed since v4:
- Fix PROT_NONE, ie do not populate from protnone pte.
- Fix huge pmd handling (start address may != pmd start address)
- Fix missing entry case.

Signed-off-by: Jérôme Glisse <[email protected]>
Signed-off-by: Sherry Cheung <[email protected]>
Signed-off-by: Subhash Gutti <[email protected]>
Signed-off-by: Mark Hairgrove <[email protected]>
Signed-off-by: John Hubbard <[email protected]>
Signed-off-by: Jatin Kumar <[email protected]>
---
include/linux/hmm.h | 15 ++
mm/hmm.c | 386 +++++++++++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 400 insertions(+), 1 deletion(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 5488fa9..d819ec9 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -85,6 +85,12 @@ struct hmm_event {
bool backoff;
};

+static inline bool hmm_event_overlap(const struct hmm_event *a,
+ const struct hmm_event *b)
+{
+ return !((a->end <= b->start) || (a->start >= b->end));
+}
+

/* hmm_device - Each device must register one and only one hmm_device.
*
@@ -176,6 +182,10 @@ struct hmm_device_ops {
* @rwsem: Serialize the mirror list modifications.
* @mmu_notifier: The mmu_notifier of this mm.
* @rcu: For delayed cleanup call from mmu_notifier.release() callback.
+ * @device_faults: List of all active device page faults.
+ * @ndevice_faults: Number of active device page faults.
+ * @wait_queue: Wait queue for event synchronization.
+ * @lock: Serialize device_faults list modification.
*
* For each process address space (mm_struct) there is one and only one hmm
* struct. hmm functions will redispatch to each devices the change made to
@@ -192,6 +202,10 @@ struct hmm {
struct rw_semaphore rwsem;
struct mmu_notifier mmu_notifier;
struct rcu_head rcu;
+ struct list_head device_faults;
+ unsigned ndevice_faults;
+ wait_queue_head_t wait_queue;
+ spinlock_t lock;
};


@@ -250,6 +264,7 @@ int hmm_mirror_register(struct hmm_mirror *mirror);
void hmm_mirror_unregister(struct hmm_mirror *mirror);
struct hmm_mirror *hmm_mirror_ref(struct hmm_mirror *mirror);
void hmm_mirror_unref(struct hmm_mirror **mirror);
+int hmm_mirror_fault(struct hmm_mirror *mirror, struct hmm_event *event);


#endif /* CONFIG_HMM */
diff --git a/mm/hmm.c b/mm/hmm.c
index c172a49..a9bdab5 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -67,7 +67,7 @@ static inline int hmm_event_init(struct hmm_event *event,
enum hmm_etype etype)
{
event->start = start & PAGE_MASK;
- event->end = min(end, hmm->vm_end);
+ event->end = PAGE_ALIGN(min(end, hmm->vm_end));
if (event->start >= event->end)
return -EINVAL;
event->etype = etype;
@@ -103,6 +103,10 @@ static int hmm_init(struct hmm *hmm)
kref_init(&hmm->kref);
INIT_HLIST_HEAD(&hmm->mirrors);
init_rwsem(&hmm->rwsem);
+ INIT_LIST_HEAD(&hmm->device_faults);
+ hmm->ndevice_faults = 0;
+ init_waitqueue_head(&hmm->wait_queue);
+ spin_lock_init(&hmm->lock);

/* register notifier */
hmm->mmu_notifier.ops = &hmm_notifier_ops;
@@ -167,6 +171,58 @@ static inline struct hmm *hmm_unref(struct hmm *hmm)
return NULL;
}

+static int hmm_device_fault_start(struct hmm *hmm, struct hmm_event *event)
+{
+ int ret = 0;
+
+ mmu_notifier_range_wait_active(hmm->mm, event->start, event->end);
+
+ spin_lock(&hmm->lock);
+ if (mmu_notifier_range_inactive(hmm->mm, event->start, event->end)) {
+ list_add_tail(&event->list, &hmm->device_faults);
+ hmm->ndevice_faults++;
+ event->backoff = false;
+ } else
+ ret = -EAGAIN;
+ spin_unlock(&hmm->lock);
+
+ wake_up(&hmm->wait_queue);
+
+ return ret;
+}
+
+static void hmm_device_fault_end(struct hmm *hmm, struct hmm_event *event)
+{
+ spin_lock(&hmm->lock);
+ list_del_init(&event->list);
+ hmm->ndevice_faults--;
+ spin_unlock(&hmm->lock);
+
+ wake_up(&hmm->wait_queue);
+}
+
+static void hmm_wait_device_fault(struct hmm *hmm, struct hmm_event *ievent)
+{
+ struct hmm_event *fevent;
+ unsigned long wait_for = 0;
+
+again:
+ spin_lock(&hmm->lock);
+ list_for_each_entry(fevent, &hmm->device_faults, list) {
+ if (!hmm_event_overlap(fevent, ievent))
+ continue;
+ fevent->backoff = true;
+ wait_for = hmm->ndevice_faults;
+ }
+ spin_unlock(&hmm->lock);
+
+ if (wait_for > 0) {
+ wait_event(hmm->wait_queue, wait_for != hmm->ndevice_faults);
+ wait_for = 0;
+ goto again;
+ }
+}
+
static void hmm_update(struct hmm *hmm, struct hmm_event *event)
{
struct hmm_mirror *mirror;
@@ -175,6 +231,8 @@ static void hmm_update(struct hmm *hmm, struct hmm_event *event)
if (hmm->mm->hmm != hmm)
return;

+ hmm_wait_device_fault(hmm, event);
+
again:
down_read(&hmm->rwsem);
hlist_for_each_entry(mirror, &hmm->mirrors, mlist)
@@ -186,6 +244,33 @@ again:
goto again;
}
up_read(&hmm->rwsem);
+
+ wake_up(&hmm->wait_queue);
+}
+
+static int hmm_mm_fault(struct hmm *hmm,
+ struct hmm_event *event,
+ struct vm_area_struct *vma,
+ unsigned long addr)
+{
+ unsigned flags = FAULT_FLAG_ALLOW_RETRY;
+ struct mm_struct *mm = vma->vm_mm;
+ int r;
+
+ flags |= (event->etype == HMM_DEVICE_WFAULT) ? FAULT_FLAG_WRITE : 0;
+ for (addr &= PAGE_MASK; addr < event->end; addr += PAGE_SIZE) {
+
+ r = handle_mm_fault(mm, vma, addr, flags);
+ if (r & VM_FAULT_RETRY)
+ return -EBUSY;
+ if (r & VM_FAULT_ERROR) {
+ if (r & VM_FAULT_OOM)
+ return -ENOMEM;
+ /* Same error code for all other cases. */
+ return -EFAULT;
+ }
+ }
+ return 0;
}


@@ -228,6 +313,7 @@ static void hmm_notifier_release(struct mmu_notifier *mn, struct mm_struct *mm)
}
up_write(&hmm->rwsem);

+ wake_up(&hmm->wait_queue);
hmm_unref(hmm);
}

@@ -419,6 +505,304 @@ static void hmm_mirror_update_pt(struct hmm_mirror *mirror,
hmm_pt_iter_fini(&iter);
}

+static inline bool hmm_mirror_is_dead(struct hmm_mirror *mirror)
+{
+ if (hlist_unhashed(&mirror->mlist) || list_empty(&mirror->dlist))
+ return true;
+ return false;
+}
+
+struct hmm_mirror_fault {
+ struct hmm_mirror *mirror;
+ struct hmm_event *event;
+ struct vm_area_struct *vma;
+ unsigned long addr;
+ struct hmm_pt_iter *iter;
+};
+
+static int hmm_mirror_fault_hpmd(struct hmm_mirror *mirror,
+ struct hmm_event *event,
+ struct vm_area_struct *vma,
+ struct hmm_pt_iter *iter,
+ pmd_t *pmdp,
+ struct hmm_mirror_fault *mirror_fault,
+ unsigned long start,
+ unsigned long end)
+{
+ struct page *page;
+ unsigned long addr, pfn;
+ unsigned flags = FOLL_TOUCH;
+ spinlock_t *ptl;
+ int ret;
+
+ ptl = pmd_lock(mirror->hmm->mm, pmdp);
+ if (unlikely(!pmd_trans_huge(*pmdp))) {
+ spin_unlock(ptl);
+ return -EAGAIN;
+ }
+ flags |= event->etype == HMM_DEVICE_WFAULT ? FOLL_WRITE : 0;
+ page = follow_trans_huge_pmd(vma, start, pmdp, flags);
+ pfn = page_to_pfn(page);
+ spin_unlock(ptl);
+
+ /* Just fault in the whole PMD. */
+ start &= PMD_MASK;
+ end = start + PMD_SIZE - 1;
+
+ if (!pmd_write(*pmdp) && event->etype == HMM_DEVICE_WFAULT)
+ return -ENOENT;
+
+ for (ret = 0, addr = start; !ret && addr < end;) {
+ unsigned long i, next = end;
+ dma_addr_t *hmm_pte;
+
+ hmm_pte = hmm_pt_iter_populate(iter, addr, &next);
+ if (!hmm_pte)
+ return -ENOMEM;
+
+ i = hmm_pt_index(&mirror->pt, addr, mirror->pt.llevel);
+
+ /*
+ * The directory lock protect against concurrent clearing of
+ * page table bit flags. Exceptions being the dirty bit and
+ * the device driver private flags.
+ */
+ hmm_pt_iter_directory_lock(iter);
+ do {
+ if (!hmm_pte_test_valid_pfn(&hmm_pte[i])) {
+ hmm_pte[i] = hmm_pte_from_pfn(pfn);
+ hmm_pt_iter_directory_ref(iter);
+ }
+ BUG_ON(hmm_pte_pfn(hmm_pte[i]) != pfn);
+ if (pmd_write(*pmdp))
+ hmm_pte_set_write(&hmm_pte[i]);
+ } while (addr += PAGE_SIZE, pfn++, i++, addr != next);
+ hmm_pt_iter_directory_unlock(iter);
+ mirror_fault->addr = addr;
+ }
+
+ return 0;
+}
+
+static int hmm_pte_hole(unsigned long addr,
+ unsigned long next,
+ struct mm_walk *walk)
+{
+ return -ENOENT;
+}
+
+static int hmm_mirror_fault_pmd(pmd_t *pmdp,
+ unsigned long start,
+ unsigned long end,
+ struct mm_walk *walk)
+{
+ struct hmm_mirror_fault *mirror_fault = walk->private;
+ struct hmm_mirror *mirror = mirror_fault->mirror;
+ struct hmm_event *event = mirror_fault->event;
+ struct hmm_pt_iter *iter = mirror_fault->iter;
+ bool write = (event->etype == HMM_DEVICE_WFAULT);
+ unsigned long addr;
+ int ret = 0;
+
+ /* Make sure there was no gap. */
+ if (start != mirror_fault->addr)
+ return -ENOENT;
+
+ if (event->backoff)
+ return -EAGAIN;
+
+ if (pmd_none(*pmdp))
+ return -ENOENT;
+
+ if (pmd_trans_huge(*pmdp))
+ return hmm_mirror_fault_hpmd(mirror, event, mirror_fault->vma,
+ iter, pmdp, mirror_fault, start,
+ end);
+
+ if (pmd_none_or_trans_huge_or_clear_bad(pmdp))
+ return -EFAULT;
+
+ for (ret = 0, addr = start; !ret && addr < end;) {
+ unsigned long i = 0, next = end;
+ dma_addr_t *hmm_pte;
+ pte_t *ptep;
+
+ hmm_pte = hmm_pt_iter_populate(iter, addr, &next);
+ if (!hmm_pte)
+ return -ENOMEM;
+
+ ptep = pte_offset_map(pmdp, start);
+ hmm_pt_iter_directory_lock(iter);
+ do {
+ if (!pte_present(*ptep) ||
+ (write && !pte_write(*ptep)) ||
+ pte_protnone(*ptep)) {
+ ret = -ENOENT;
+ ptep++;
+ break;
+ }
+
+ if (!hmm_pte_test_valid_pfn(&hmm_pte[i])) {
+ hmm_pte[i] = hmm_pte_from_pfn(pte_pfn(*ptep));
+ hmm_pt_iter_directory_ref(iter);
+ }
+ BUG_ON(hmm_pte_pfn(hmm_pte[i]) != pte_pfn(*ptep));
+ if (pte_write(*ptep))
+ hmm_pte_set_write(&hmm_pte[i]);
+ } while (addr += PAGE_SIZE, ptep++, i++, addr != next);
+ hmm_pt_iter_directory_unlock(iter);
+ pte_unmap(ptep - 1);
+ mirror_fault->addr = addr;
+ }
+
+ return ret;
+}
+
+static int hmm_mirror_handle_fault(struct hmm_mirror *mirror,
+ struct hmm_event *event,
+ struct vm_area_struct *vma,
+ struct hmm_pt_iter *iter)
+{
+ struct hmm_mirror_fault mirror_fault;
+ unsigned long addr = event->start;
+ struct mm_walk walk = {0};
+ int ret = 0;
+
+ if ((event->etype == HMM_DEVICE_WFAULT) && !(vma->vm_flags & VM_WRITE))
+ return -EACCES;
+
+ ret = hmm_device_fault_start(mirror->hmm, event);
+ if (ret)
+ return ret;
+
+again:
+ if (event->backoff) {
+ ret = -EAGAIN;
+ goto out;
+ }
+ if (addr >= event->end)
+ goto out;
+
+ mirror_fault.event = event;
+ mirror_fault.mirror = mirror;
+ mirror_fault.vma = vma;
+ mirror_fault.addr = addr;
+ mirror_fault.iter = iter;
+ walk.mm = mirror->hmm->mm;
+ walk.private = &mirror_fault;
+ walk.pmd_entry = hmm_mirror_fault_pmd;
+ walk.pte_hole = hmm_pte_hole;
+ ret = walk_page_range(addr, event->end, &walk);
+ if (!ret) {
+ ret = mirror->device->ops->update(mirror, event);
+ if (!ret) {
+ addr = mirror_fault.addr;
+ goto again;
+ }
+ }
+
+out:
+ hmm_device_fault_end(mirror->hmm, event);
+ if (ret == -ENOENT) {
+ ret = hmm_mm_fault(mirror->hmm, event, vma, addr);
+ ret = ret ? ret : -EAGAIN;
+ }
+ return ret;
+}
+
+int hmm_mirror_fault(struct hmm_mirror *mirror, struct hmm_event *event)
+{
+ struct vm_area_struct *vma;
+ struct hmm_pt_iter iter;
+ int ret = 0;
+
+ mirror = hmm_mirror_ref(mirror);
+ if (!mirror)
+ return -ENODEV;
+ if (event->start >= mirror->hmm->vm_end) {
+ hmm_mirror_unref(&mirror);
+ return -EINVAL;
+ }
+ if (hmm_event_init(event, mirror->hmm, event->start,
+ event->end, event->etype)) {
+ hmm_mirror_unref(&mirror);
+ return -EINVAL;
+ }
+ hmm_pt_iter_init(&iter, &mirror->pt);
+
+retry:
+ if (hmm_mirror_is_dead(mirror)) {
+ hmm_mirror_unref(&mirror);
+ return -ENODEV;
+ }
+
+ /*
+ * So synchronization with the cpu page table is the most important
+ * and tedious aspect of device page fault. There must be a strong
+ * ordering btw call to device->update() for device page fault and
+ * device->update() for cpu page table invalidation/update.
+ *
+ * Page that are exposed to device driver must stay valid while the
+ * callback is in progress ie any cpu page table invalidation that
+ * render those pages obsolete must call device->update() after the
+ * device->update() call that faulted those pages.
+ *
+ * To achieve this we rely on few things. First the mmap_sem insure
+ * us that any munmap() syscall will serialize with us. So issue are
+ * with unmap_mapping_range() and with migrate or merge page. For this
+ * hmm keep track of affected range of address and block device page
+ * fault that hit overlapping range.
+ */
+ down_read(&mirror->hmm->mm->mmap_sem);
+ vma = find_vma_intersection(mirror->hmm->mm, event->start, event->end);
+ if (!vma) {
+ ret = -EFAULT;
+ goto out;
+ }
+ if (vma->vm_start > event->start) {
+ event->end = vma->vm_start;
+ ret = -EFAULT;
+ goto out;
+ }
+ event->end = min(event->end, vma->vm_end) & PAGE_MASK;
+ if ((vma->vm_flags & (VM_IO | VM_PFNMAP | VM_MIXEDMAP | VM_HUGETLB))) {
+ ret = -EFAULT;
+ goto out;
+ }
+
+ switch (event->etype) {
+ case HMM_DEVICE_WFAULT:
+ if (!(vma->vm_flags & VM_WRITE)) {
+ ret = -EFAULT;
+ goto out;
+ }
+ /* fallthrough */
+ case HMM_DEVICE_RFAULT:
+ /* Handle the PROT_NONE case early on. */
+ if (!(vma->vm_flags & (VM_WRITE | VM_READ))) {
+ ret = -EFAULT;
+ goto out;
+ }
+ ret = hmm_mirror_handle_fault(mirror, event, vma, &iter);
+ break;
+ default:
+ ret = -EINVAL;
+ break;
+ }
+
+out:
+ /* Drop the mmap_sem so anyone waiting on it have a chance. */
+ if (ret != -EBUSY)
+ up_read(&mirror->hmm->mm->mmap_sem);
+ wake_up(&mirror->hmm->wait_queue);
+ if (ret == -EAGAIN)
+ goto retry;
+ hmm_pt_iter_fini(&iter);
+ hmm_mirror_unref(&mirror);
+ return ret;
+}
+EXPORT_SYMBOL(hmm_mirror_fault);
+
/* hmm_mirror_register() - register mirror against current process for a device.
*
* @mirror: The mirror struct being registered.
--
2.4.3

2016-03-21 11:28:01

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH v12 21/29] HMM: mm add helper to update page table when migrating memory back v2.

Jérôme Glisse <[email protected]> writes:

> [ text/plain ]
> To migrate memory back we first need to lock HMM special CPU page
> table entry so we know no one else might try to migrate those entry
> back. Helper also allocate new page where data will be copied back
> from the device. Then we can proceed with the device DMA operation.
>
> Once DMA is done we can update again the CPU page table to point to
> the new page that holds the content copied back from device memory.
>
> Note that we do not need to invalidate the range are we are only
> modifying non present CPU page table entry.
>
> Changed since v1:
> - Save memcg against which each page is precharge as it might
> change along the way.
>
> Signed-off-by: Jérôme Glisse <[email protected]>
> ---
> include/linux/mm.h | 12 +++
> mm/memory.c | 257 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 269 insertions(+)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index c5c062e..1cd060f 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2392,6 +2392,18 @@ static inline void hmm_mm_init(struct mm_struct *mm)
> {
> mm->hmm = NULL;
> }
> +
> +int mm_hmm_migrate_back(struct mm_struct *mm,
> + struct vm_area_struct *vma,
> + pte_t *new_pte,
> + unsigned long start,
> + unsigned long end);
> +void mm_hmm_migrate_back_cleanup(struct mm_struct *mm,
> + struct vm_area_struct *vma,
> + pte_t *new_pte,
> + dma_addr_t *hmm_pte,
> + unsigned long start,
> + unsigned long end);
> #else /* !CONFIG_HMM */
> static inline void hmm_mm_init(struct mm_struct *mm)
> {
> diff --git a/mm/memory.c b/mm/memory.c
> index 3cb3653..d917911a 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3513,6 +3513,263 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> }
> EXPORT_SYMBOL_GPL(handle_mm_fault);
>
> +
> +#ifdef CONFIG_HMM
> +/* mm_hmm_migrate_back() - lock HMM CPU page table entry and allocate new page.
> + *
> + * @mm: The mm struct.
> + * @vma: The vm area struct the range is in.
> + * @new_pte: Array of new CPU page table entry value.
> + * @start: Start address of the range (inclusive).
> + * @end: End address of the range (exclusive).
> + *
> + * This function will lock HMM page table entry and allocate new page for entry
> + * it successfully locked.
> + */


Can you add more comments around this ?

> +int mm_hmm_migrate_back(struct mm_struct *mm,
> + struct vm_area_struct *vma,
> + pte_t *new_pte,
> + unsigned long start,
> + unsigned long end)
> +{
> + pte_t hmm_entry = swp_entry_to_pte(make_hmm_entry_locked());
> + unsigned long addr, i;
> + int ret = 0;
> +
> + VM_BUG_ON(vma->vm_ops || (vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)));
> +
> + if (unlikely(anon_vma_prepare(vma)))
> + return -ENOMEM;
> +
> + start &= PAGE_MASK;
> + end = PAGE_ALIGN(end);
> + memset(new_pte, 0, sizeof(pte_t) * ((end - start) >> PAGE_SHIFT));
> +
> + for (addr = start; addr < end;) {
> + unsigned long cstart, next;
> + spinlock_t *ptl;
> + pgd_t *pgdp;
> + pud_t *pudp;
> + pmd_t *pmdp;
> + pte_t *ptep;
> +
> + pgdp = pgd_offset(mm, addr);
> + pudp = pud_offset(pgdp, addr);
> + /*
> + * Some other thread might already have migrated back the entry
> + * and freed the page table. Unlikely thought.
> + */
> + if (unlikely(!pudp)) {
> + addr = min((addr + PUD_SIZE) & PUD_MASK, end);
> + continue;
> + }
> + pmdp = pmd_offset(pudp, addr);
> + if (unlikely(!pmdp || pmd_bad(*pmdp) || pmd_none(*pmdp) ||
> + pmd_trans_huge(*pmdp))) {
> + addr = min((addr + PMD_SIZE) & PMD_MASK, end);
> + continue;
> + }
> + ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
> + for (cstart = addr, i = (addr - start) >> PAGE_SHIFT,
> + next = min((addr + PMD_SIZE) & PMD_MASK, end);
> + addr < next; addr += PAGE_SIZE, ptep++, i++) {
> + swp_entry_t entry;
> +
> + entry = pte_to_swp_entry(*ptep);
> + if (pte_none(*ptep) || pte_present(*ptep) ||
> + !is_hmm_entry(entry) ||
> + is_hmm_entry_locked(entry))
> + continue;
> +
> + set_pte_at(mm, addr, ptep, hmm_entry);
> + new_pte[i] = pte_mkspecial(pfn_pte(my_zero_pfn(addr),
> + vma->vm_page_prot));
> + }
> + pte_unmap_unlock(ptep - 1, ptl);


I guess this is fixing all the ptes in the cpu page table mapping a pmd
entry. But then what is below ?


> +
> + for (addr = cstart, i = (addr - start) >> PAGE_SHIFT;
> + addr < next; addr += PAGE_SIZE, i++) {

Your use of vairable addr with multiple loops updating then is also
making it complex. We should definitely add more comments here. I guess
we are going through the same range we iterated above here.

> + struct mem_cgroup *memcg;
> + struct page *page;
> +
> + if (!pte_present(new_pte[i]))
> + continue;

What is that checking for ?. We set that using pte_mkspecial above ?

> +
> + page = alloc_zeroed_user_highpage_movable(vma, addr);
> + if (!page) {
> + ret = -ENOMEM;
> + break;
> + }
> + __SetPageUptodate(page);
> + if (mem_cgroup_try_charge(page, mm, GFP_KERNEL,
> + &memcg)) {
> + page_cache_release(page);
> + ret = -ENOMEM;
> + break;
> + }
> + /*
> + * We can safely reuse the s_mem/mapping field of page
> + * struct to store the memcg as the page is only seen
> + * by HMM at this point and we can clear it before it
> + * is public see mm_hmm_migrate_back_cleanup().
> + */
> + page->s_mem = memcg;
> + new_pte[i] = mk_pte(page, vma->vm_page_prot);
> + if (vma->vm_flags & VM_WRITE) {
> + new_pte[i] = pte_mkdirty(new_pte[i]);
> + new_pte[i] = pte_mkwrite(new_pte[i]);
> + }

Why mark it dirty if vm_flags is VM_WRITE ?

> + }
> +
> + if (!ret)
> + continue;
> +
> + hmm_entry = swp_entry_to_pte(make_hmm_entry());
> + ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);


Again we loop through the same range ?

> + for (addr = cstart, i = (addr - start) >> PAGE_SHIFT;
> + addr < next; addr += PAGE_SIZE, ptep++, i++) {
> + unsigned long pfn = pte_pfn(new_pte[i]);
> +
> + if (!pte_present(new_pte[i]) || !is_zero_pfn(pfn))
> + continue;


What is that checking for ?
> +
> + set_pte_at(mm, addr, ptep, hmm_entry);
> + pte_clear(mm, addr, &new_pte[i]);

what is that pte_clear for ?. Handling of new_pte needs more code comments.

> + }
> + pte_unmap_unlock(ptep - 1, ptl);
> + break;
> + }
> + return ret;
> +}
> +EXPORT_SYMBOL(mm_hmm_migrate_back);
> +


-aneesh

2016-03-21 12:03:19

by Jerome Glisse

[permalink] [raw]
Subject: Re: [PATCH v12 21/29] HMM: mm add helper to update page table when migrating memory back v2.

On Mon, Mar 21, 2016 at 04:57:32PM +0530, Aneesh Kumar K.V wrote:
> J?r?me Glisse <[email protected]> writes:

[...]

> > +
> > +#ifdef CONFIG_HMM
> > +/* mm_hmm_migrate_back() - lock HMM CPU page table entry and allocate new page.
> > + *
> > + * @mm: The mm struct.
> > + * @vma: The vm area struct the range is in.
> > + * @new_pte: Array of new CPU page table entry value.
> > + * @start: Start address of the range (inclusive).
> > + * @end: End address of the range (exclusive).
> > + *
> > + * This function will lock HMM page table entry and allocate new page for entry
> > + * it successfully locked.
> > + */
>
>
> Can you add more comments around this ?

I should describe the process a bit more i guess. It is multi-step, first we update
CPU page table with special HMM "lock" entry, this is to exclude concurrent migration
happening on same page. Once we have "locked" the CPU page table entry we allocate
the proper number of pages. Then we schedule the dma from the GPU to this pages and
once it is done we update the CPU page table to point to this pages. This is why we
are going over the page table so many times. This should answer most of your questions
below but i still provide answer for each of them.

>
> > +int mm_hmm_migrate_back(struct mm_struct *mm,
> > + struct vm_area_struct *vma,
> > + pte_t *new_pte,
> > + unsigned long start,
> > + unsigned long end)
> > +{
> > + pte_t hmm_entry = swp_entry_to_pte(make_hmm_entry_locked());
> > + unsigned long addr, i;
> > + int ret = 0;
> > +
> > + VM_BUG_ON(vma->vm_ops || (vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)));
> > +
> > + if (unlikely(anon_vma_prepare(vma)))
> > + return -ENOMEM;
> > +
> > + start &= PAGE_MASK;
> > + end = PAGE_ALIGN(end);
> > + memset(new_pte, 0, sizeof(pte_t) * ((end - start) >> PAGE_SHIFT));
> > +
> > + for (addr = start; addr < end;) {
> > + unsigned long cstart, next;
> > + spinlock_t *ptl;
> > + pgd_t *pgdp;
> > + pud_t *pudp;
> > + pmd_t *pmdp;
> > + pte_t *ptep;
> > +
> > + pgdp = pgd_offset(mm, addr);
> > + pudp = pud_offset(pgdp, addr);
> > + /*
> > + * Some other thread might already have migrated back the entry
> > + * and freed the page table. Unlikely thought.
> > + */
> > + if (unlikely(!pudp)) {
> > + addr = min((addr + PUD_SIZE) & PUD_MASK, end);
> > + continue;
> > + }
> > + pmdp = pmd_offset(pudp, addr);
> > + if (unlikely(!pmdp || pmd_bad(*pmdp) || pmd_none(*pmdp) ||
> > + pmd_trans_huge(*pmdp))) {
> > + addr = min((addr + PMD_SIZE) & PMD_MASK, end);
> > + continue;
> > + }
> > + ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
> > + for (cstart = addr, i = (addr - start) >> PAGE_SHIFT,
> > + next = min((addr + PMD_SIZE) & PMD_MASK, end);
> > + addr < next; addr += PAGE_SIZE, ptep++, i++) {
> > + swp_entry_t entry;
> > +
> > + entry = pte_to_swp_entry(*ptep);
> > + if (pte_none(*ptep) || pte_present(*ptep) ||
> > + !is_hmm_entry(entry) ||
> > + is_hmm_entry_locked(entry))
> > + continue;
> > +
> > + set_pte_at(mm, addr, ptep, hmm_entry);
> > + new_pte[i] = pte_mkspecial(pfn_pte(my_zero_pfn(addr),
> > + vma->vm_page_prot));
> > + }
> > + pte_unmap_unlock(ptep - 1, ptl);
>
>
> I guess this is fixing all the ptes in the cpu page table mapping a pmd
> entry. But then what is below ?

Because we are dealing with special swap entry we know we can not have huge pages.
So we only care about HMM special swap entry. We record entry we want to migrate
in the new_pte array. The loop above is under pmd spin lock, the loop below does
memory allocation and we do not want to hold any spin lock while doing allocation.

>
> > +
> > + for (addr = cstart, i = (addr - start) >> PAGE_SHIFT;
> > + addr < next; addr += PAGE_SIZE, i++) {
>
> Your use of vairable addr with multiple loops updating then is also
> making it complex. We should definitely add more comments here. I guess
> we are going through the same range we iterated above here.

Correct we are going over the exact same range, i am keeping the addr only
for alloc_zeroed_user_highpage_movable() purpose.

>
> > + struct mem_cgroup *memcg;
> > + struct page *page;
> > +
> > + if (!pte_present(new_pte[i]))
> > + continue;
>
> What is that checking for ?. We set that using pte_mkspecial above ?

Not all entry in the range might match the criteria (ie special unlocked HMM swap
entry). We want to allocate pages only for entry that match the criteria.

>
> > +
> > + page = alloc_zeroed_user_highpage_movable(vma, addr);
> > + if (!page) {
> > + ret = -ENOMEM;
> > + break;
> > + }
> > + __SetPageUptodate(page);
> > + if (mem_cgroup_try_charge(page, mm, GFP_KERNEL,
> > + &memcg)) {
> > + page_cache_release(page);
> > + ret = -ENOMEM;
> > + break;
> > + }
> > + /*
> > + * We can safely reuse the s_mem/mapping field of page
> > + * struct to store the memcg as the page is only seen
> > + * by HMM at this point and we can clear it before it
> > + * is public see mm_hmm_migrate_back_cleanup().
> > + */
> > + page->s_mem = memcg;
> > + new_pte[i] = mk_pte(page, vma->vm_page_prot);
> > + if (vma->vm_flags & VM_WRITE) {
> > + new_pte[i] = pte_mkdirty(new_pte[i]);
> > + new_pte[i] = pte_mkwrite(new_pte[i]);
> > + }
>
> Why mark it dirty if vm_flags is VM_WRITE ?

It is a left over of some debuging i was doing, i missed it.

>
> > + }
> > +
> > + if (!ret)
> > + continue;
> > +
> > + hmm_entry = swp_entry_to_pte(make_hmm_entry());
> > + ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
>
>
> Again we loop through the same range ?

Yes but this is the out of memory code path here, ie we have to split the migration
into several pass. So what happen here is we clear the new_pte array for entry we
failed to allocate a page for.

>
> > + for (addr = cstart, i = (addr - start) >> PAGE_SHIFT;
> > + addr < next; addr += PAGE_SIZE, ptep++, i++) {
> > + unsigned long pfn = pte_pfn(new_pte[i]);
> > +
> > + if (!pte_present(new_pte[i]) || !is_zero_pfn(pfn))
> > + continue;
>
>
> What is that checking for ?

If new_pte entry is not present then it is not something we want to migrate. If it
is present but does not point to zero pfn then it is an entry for which we allocated
a page so we want to keep it.

> > +
> > + set_pte_at(mm, addr, ptep, hmm_entry);
> > + pte_clear(mm, addr, &new_pte[i]);
>
> what is that pte_clear for ?. Handling of new_pte needs more code comments.
>

Entry for which we failed to allocate memory we clear the special HMM swap entry
as well as the new_pte entry so that migration code knows it does not have to do
anything here.

Hopes this clarify this code.

Cheers,
J?r?me

2016-03-21 13:49:49

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH v12 21/29] HMM: mm add helper to update page table when migrating memory back v2.

Jerome Glisse <[email protected]> writes:

> [ text/plain ]
> On Mon, Mar 21, 2016 at 04:57:32PM +0530, Aneesh Kumar K.V wrote:
>> Jérôme Glisse <[email protected]> writes:
>
> [...]
>
>> > +
>> > +#ifdef CONFIG_HMM
>> > +/* mm_hmm_migrate_back() - lock HMM CPU page table entry and allocate new page.
>> > + *
>> > + * @mm: The mm struct.
>> > + * @vma: The vm area struct the range is in.
>> > + * @new_pte: Array of new CPU page table entry value.
>> > + * @start: Start address of the range (inclusive).
>> > + * @end: End address of the range (exclusive).
>> > + *
>> > + * This function will lock HMM page table entry and allocate new page for entry
>> > + * it successfully locked.
>> > + */
>>
>>
>> Can you add more comments around this ?
>
> I should describe the process a bit more i guess. It is multi-step, first we update
> CPU page table with special HMM "lock" entry, this is to exclude concurrent migration
> happening on same page. Once we have "locked" the CPU page table entry we allocate
> the proper number of pages. Then we schedule the dma from the GPU to this pages and
> once it is done we update the CPU page table to point to this pages. This is why we
> are going over the page table so many times. This should answer most of your questions
> below but i still provide answer for each of them.
>
>>
>> > +int mm_hmm_migrate_back(struct mm_struct *mm,
>> > + struct vm_area_struct *vma,
>> > + pte_t *new_pte,
>> > + unsigned long start,
>> > + unsigned long end)
>> > +{
>> > + pte_t hmm_entry = swp_entry_to_pte(make_hmm_entry_locked());
>> > + unsigned long addr, i;
>> > + int ret = 0;
>> > +
>> > + VM_BUG_ON(vma->vm_ops || (vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)));
>> > +
>> > + if (unlikely(anon_vma_prepare(vma)))
>> > + return -ENOMEM;
>> > +
>> > + start &= PAGE_MASK;
>> > + end = PAGE_ALIGN(end);
>> > + memset(new_pte, 0, sizeof(pte_t) * ((end - start) >> PAGE_SHIFT));
>> > +
>> > + for (addr = start; addr < end;) {
>> > + unsigned long cstart, next;
>> > + spinlock_t *ptl;
>> > + pgd_t *pgdp;
>> > + pud_t *pudp;
>> > + pmd_t *pmdp;
>> > + pte_t *ptep;
>> > +
>> > + pgdp = pgd_offset(mm, addr);
>> > + pudp = pud_offset(pgdp, addr);
>> > + /*
>> > + * Some other thread might already have migrated back the entry
>> > + * and freed the page table. Unlikely thought.
>> > + */
>> > + if (unlikely(!pudp)) {
>> > + addr = min((addr + PUD_SIZE) & PUD_MASK, end);
>> > + continue;
>> > + }
>> > + pmdp = pmd_offset(pudp, addr);
>> > + if (unlikely(!pmdp || pmd_bad(*pmdp) || pmd_none(*pmdp) ||
>> > + pmd_trans_huge(*pmdp))) {
>> > + addr = min((addr + PMD_SIZE) & PMD_MASK, end);
>> > + continue;
>> > + }
>> > + ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
>> > + for (cstart = addr, i = (addr - start) >> PAGE_SHIFT,
>> > + next = min((addr + PMD_SIZE) & PMD_MASK, end);
>> > + addr < next; addr += PAGE_SIZE, ptep++, i++) {
>> > + swp_entry_t entry;
>> > +
>> > + entry = pte_to_swp_entry(*ptep);
>> > + if (pte_none(*ptep) || pte_present(*ptep) ||
>> > + !is_hmm_entry(entry) ||
>> > + is_hmm_entry_locked(entry))
>> > + continue;
>> > +
>> > + set_pte_at(mm, addr, ptep, hmm_entry);
>> > + new_pte[i] = pte_mkspecial(pfn_pte(my_zero_pfn(addr),
>> > + vma->vm_page_prot));
>> > + }
>> > + pte_unmap_unlock(ptep - 1, ptl);
>>
>>
>> I guess this is fixing all the ptes in the cpu page table mapping a pmd
>> entry. But then what is below ?
>
> Because we are dealing with special swap entry we know we can not have huge pages.
> So we only care about HMM special swap entry. We record entry we want to migrate
> in the new_pte array. The loop above is under pmd spin lock, the loop below does
> memory allocation and we do not want to hold any spin lock while doing allocation.
>

Can this go as code comment ?

>>
>> > +
>> > + for (addr = cstart, i = (addr - start) >> PAGE_SHIFT;
>> > + addr < next; addr += PAGE_SIZE, i++) {
>>
>> Your use of vairable addr with multiple loops updating then is also
>> making it complex. We should definitely add more comments here. I guess
>> we are going through the same range we iterated above here.
>
> Correct we are going over the exact same range, i am keeping the addr only
> for alloc_zeroed_user_highpage_movable() purpose.
>

Can we use a different variable name there ?

>>
>> > + struct mem_cgroup *memcg;
>> > + struct page *page;
>> > +
>> > + if (!pte_present(new_pte[i]))
>> > + continue;
>>
>> What is that checking for ?. We set that using pte_mkspecial above ?
>
> Not all entry in the range might match the criteria (ie special unlocked HMM swap
> entry). We want to allocate pages only for entry that match the criteria.
>

Since we did in the beginning,
memset(new_pte, 0, sizeof(pte_t) * ((end - start) >> PAGE_SHIFT));

we should not find present bit set ? using present there is confusing,
may be pte_none(). Also with comments around explaining the details ?

>>
>> > +
>> > + page = alloc_zeroed_user_highpage_movable(vma, addr);
>> > + if (!page) {
>> > + ret = -ENOMEM;
>> > + break;
>> > + }
>> > + __SetPageUptodate(page);
>> > + if (mem_cgroup_try_charge(page, mm, GFP_KERNEL,
>> > + &memcg)) {
>> > + page_cache_release(page);
>> > + ret = -ENOMEM;
>> > + break;
>> > + }
>> > + /*
>> > + * We can safely reuse the s_mem/mapping field of page
>> > + * struct to store the memcg as the page is only seen
>> > + * by HMM at this point and we can clear it before it
>> > + * is public see mm_hmm_migrate_back_cleanup().
>> > + */
>> > + page->s_mem = memcg;
>> > + new_pte[i] = mk_pte(page, vma->vm_page_prot);
>> > + if (vma->vm_flags & VM_WRITE) {
>> > + new_pte[i] = pte_mkdirty(new_pte[i]);
>> > + new_pte[i] = pte_mkwrite(new_pte[i]);
>> > + }
>>
>> Why mark it dirty if vm_flags is VM_WRITE ?
>
> It is a left over of some debuging i was doing, i missed it.
>
>>
>> > + }
>> > +
>> > + if (!ret)
>> > + continue;
>> > +
>> > + hmm_entry = swp_entry_to_pte(make_hmm_entry());
>> > + ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
>>
>>
>> Again we loop through the same range ?
>
> Yes but this is the out of memory code path here, ie we have to split the migration
> into several pass. So what happen here is we clear the new_pte array for entry we
> failed to allocate a page for.
>
>>
>> > + for (addr = cstart, i = (addr - start) >> PAGE_SHIFT;
>> > + addr < next; addr += PAGE_SIZE, ptep++, i++) {
>> > + unsigned long pfn = pte_pfn(new_pte[i]);
>> > +
>> > + if (!pte_present(new_pte[i]) || !is_zero_pfn(pfn))
>> > + continue;
>>



So here we are using the fact that we had set new pte using zero pfn in
the firs loop and hence if we find a present new_pte with zero pfn, it implies we
failed to allocate a page for that ?

>>
>> What is that checking for ?
>
> If new_pte entry is not present then it is not something we want to migrate. If it
> is present but does not point to zero pfn then it is an entry for which we allocated
> a page so we want to keep it.
>
>> > +
>> > + set_pte_at(mm, addr, ptep, hmm_entry);
>> > + pte_clear(mm, addr, &new_pte[i]);
>>
>> what is that pte_clear for ?. Handling of new_pte needs more code comments.
>>
>
> Entry for which we failed to allocate memory we clear the special HMM swap entry
> as well as the new_pte entry so that migration code knows it does not have to do
> anything here.
>

So that pte_clear is not expecting to do any sort of tlb flushes etc ? The
idea is to put new_pte = 0 ?.

Can we do all those conditionals without using pte bits ? A check like
pte_present, is_zero_pfn etc confuse the reader. Instead can
we do

if (pte_state[i] == SKIP_LOOP_FIRST)

if (pte_state[i] == SKIP_LOOP_SECOND)

I understand that we want to return new_pte array with valid pages, so
may be the above will make code complex, but atleast code should have
more comments explaining each step

-aneesh

2016-03-21 14:25:40

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH v12 22/29] HMM: mm add helper to update page table when migrating memory v3.

Jérôme Glisse <[email protected]> writes:

> +
> + /* Try to fail early on. */
> + if (unlikely(anon_vma_prepare(vma)))
> + return -ENOMEM;
> +

What is this about ?

> +retry:
> + lru_add_drain();
> + tlb_gather_mmu(&tlb, mm, range.start, range.end);
> + update_hiwater_rss(mm);
> + mmu_notifier_invalidate_range_start_excluding(mm, &range,
> + mmu_notifier_exclude);
> + tlb_start_vma(&tlb, vma);
> + for (addr = range.start, i = 0; addr < end && !ret;) {
> + unsigned long cstart, next, npages = 0;
> + spinlock_t *ptl;
> + pgd_t *pgdp;
> + pud_t *pudp;
> + pmd_t *pmdp;
> + pte_t *ptep;
> +
> + /*
> + * Pretty much the exact same logic as __handle_mm_fault(),
> + * exception being the handling of huge pmd.
> + */
> + pgdp = pgd_offset(mm, addr);
> + pudp = pud_alloc(mm, pgdp, addr);
> + if (!pudp) {
> + ret = -ENOMEM;
> + break;
> + }
> + pmdp = pmd_alloc(mm, pudp, addr);
> + if (!pmdp) {
> + ret = -ENOMEM;
> + break;
> + }
> + if (unlikely(pte_alloc(mm, pmdp, addr))) {
> + ret = -ENOMEM;
> + break;
> + }
> +
> + /*
> + * If a huge pmd materialized under us just retry later. Use
> + * pmd_trans_unstable() instead of pmd_trans_huge() to ensure the pmd
> + * didn't become pmd_trans_huge under us and then back to pmd_none, as
> + * a result of MADV_DONTNEED running immediately after a huge pmd fault
> + * in a different thread of this mm, in turn leading to a misleading
> + * pmd_trans_huge() retval. All we have to ensure is that it is a
> + * regular pmd that we can walk with pte_offset_map() and we can do that
> + * through an atomic read in C, which is what pmd_trans_unstable()
> + * provides.
> + */
> + if (unlikely(pmd_trans_unstable(pmdp) || pmd_devmap(*pmdp))) {
> + ret = -EAGAIN;
> + break;
> + }
> +
> + /*
> + * If an huge pmd materialized from under us split it and break
> + * out of the loop to retry.
> + */
> + if (pmd_trans_huge(*pmdp) || pmd_devmap(*pmdp)) {
> + split_huge_pmd(vma, addr, pmdp);
> + ret = -EAGAIN;
> + break;
> + }
> +
> + /*
> + * A regular pmd is established and it can't morph into a huge pmd
> + * from under us anymore at this point because we hold the mmap_sem
> + * read mode and khugepaged takes it in write mode. So now it's
> + * safe to run pte_offset_map().
> + */
> + ptep = pte_offset_map(pmdp, addr);
> +
> + /*
> + * A regular pmd is established and it can't morph into a huge
> + * pmd from under us anymore at this point because we hold the
> + * mmap_sem read mode and khugepaged takes it in write mode. So
> + * now it's safe to run pte_offset_map().
> + */
> + ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);


Why pte_offset_map followed by map_lock ?

> + for (i = (addr - start) >> PAGE_SHIFT, cstart = addr,
> + next = min((addr + PMD_SIZE) & PMD_MASK, end);
> + addr < next; addr += PAGE_SIZE, ptep++, i++) {
> + save_pte[i] = ptep_get_and_clear(mm, addr, ptep);
> + tlb_remove_tlb_entry(&tlb, ptep, addr);
> + set_pte_at(mm, addr, ptep, hmm_entry);
> +
> + if (pte_present(save_pte[i]))
> + continue;
> +
> + if (!pte_none(save_pte[i])) {
> + set_pte_at(mm, addr, ptep, save_pte[i]);
> + ret = -ENOENT;
> + ptep++;
> + break;
> + }

What is special about pte_none ? Why break the loop ? I guess we are
checking for swap_pte ? why not is_swap_pte ? is that we already checked
pte_present ?

> + /*
> + * TODO: This mm_forbids_zeropage() really does not
> + * apply to us. First it seems only S390 have it set,
> + * second we are not even using the zero page entry
> + * to populate the CPU page table, thought on error
> + * we might use the save_pte entry to set the CPU
> + * page table entry.
> + *
> + * Live with that oddity for now.
> + */
> + if (mm_forbids_zeropage(mm)) {
> + pte_clear(mm, addr, &save_pte[i]);
> + npages++;
> + continue;
> + }
> + save_pte[i] = pte_mkspecial(pfn_pte(my_zero_pfn(addr),
> + vma->vm_page_prot));
> + }
> + pte_unmap_unlock(ptep - 1, ptl);
> +
> + /*
> + * So we must allocate pages before checking for error, which
> + * here indicate that one entry is a swap entry. We need to
> + * allocate first because otherwise there is no easy way to
> + * know on retry or in error code path wether the CPU page
> + * table locked HMM entry is ours or from some other thread.
> + */
> +
> + if (!npages)
> + continue;
> +
> + for (next = addr, addr = cstart,
> + i = (addr - start) >> PAGE_SHIFT;
> + addr < next; addr += PAGE_SIZE, i++) {
> + struct mem_cgroup *memcg;
> + struct page *page;
> +
> + if (pte_present(save_pte[i]) || !pte_none(save_pte[i]))
> + continue;
> +
> + page = alloc_zeroed_user_highpage_movable(vma, addr);
> + if (!page) {
> + ret = -ENOMEM;
> + break;
> + }
> + __SetPageUptodate(page);
> + if (mem_cgroup_try_charge(page, mm, GFP_KERNEL,
> + &memcg, false)) {
> + page_cache_release(page);
> + ret = -ENOMEM;
> + break;
> + }
> + save_pte[i] = mk_pte(page, vma->vm_page_prot);
> + if (vma->vm_flags & VM_WRITE)
> + save_pte[i] = pte_mkwrite(save_pte[i]);

I guess this also need to go ?

> + inc_mm_counter_fast(mm, MM_ANONPAGES);
> + /*
> + * Because we set the page table entry to the special
> + * HMM locked entry we know no other process might do
> + * anything with it and thus we can safely account the
> + * page without holding any lock at this point.
> + */
> + page_add_new_anon_rmap(page, vma, addr, false);
> + mem_cgroup_commit_charge(page, memcg, false, false);
> + /*
> + * Add to active list so we know vmscan will not waste
> + * its time with that page while we are still using it.
> + */
> + lru_cache_add_active_or_unevictable(page, vma);
> + }
> + }
> + tlb_end_vma(&tlb, vma);
> + mmu_notifier_invalidate_range_end_excluding(mm, &range,
> + mmu_notifier_exclude);
> + tlb_finish_mmu(&tlb, range.start, range.end);
> +
> + if (backoff && *backoff) {
> + /* Stick to the range we updated. */
> + ret = -EAGAIN;
> + end = addr;
> + goto out;
> + }
> +
> + /* Check if something is missing or something went wrong. */
> + if (ret == -ENOENT) {
> + int flags = FAULT_FLAG_ALLOW_RETRY;
> +
> + do {
> + /*
> + * Using __handle_mm_fault() as current->mm != mm ie we
> + * might have been call from a kernel thread on behalf
> + * of a driver and all accounting handle_mm_fault() is
> + * pointless in our case.
> + */
> + ret = __handle_mm_fault(mm, vma, addr, flags);
> + flags |= FAULT_FLAG_TRIED;
> + } while ((ret & VM_FAULT_RETRY));
> + if ((ret & VM_FAULT_ERROR)) {
> + /* Stick to the range we updated. */
> + end = addr;
> + ret = -EFAULT;
> + goto out;
> + }
> + range.start = addr;
> + goto retry;
> + }
> + if (ret == -EAGAIN) {
> + range.start = addr;
> + goto retry;
> + }
> + if (ret)
> + /* Stick to the range we updated. */
> + end = addr;
> +
> + /*
> + * At this point no one else can take a reference on the page from this
> + * process CPU page table. So we can safely check wether we can migrate
> + * or not the page.
> + */
> +
> +out:
> + for (addr = start, i = 0; addr < end;) {
> + unsigned long next;
> + spinlock_t *ptl;
> + pgd_t *pgdp;
> + pud_t *pudp;
> + pmd_t *pmdp;
> + pte_t *ptep;
> +
> + /*
> + * We know for certain that we did set special swap entry for
> + * the range and HMM entry are mark as locked so it means that
> + * no one beside us can modify them which apply that all level
> + * of the CPU page table are valid.
> + */
> + pgdp = pgd_offset(mm, addr);
> + pudp = pud_offset(pgdp, addr);
> + VM_BUG_ON(!pudp);
> + pmdp = pmd_offset(pudp, addr);
> + VM_BUG_ON(!pmdp || pmd_bad(*pmdp) || pmd_none(*pmdp) ||
> + pmd_trans_huge(*pmdp));
> +
> + ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
> + for (next = min((addr + PMD_SIZE) & PMD_MASK, end),
> + i = (addr - start) >> PAGE_SHIFT; addr < next;
> + addr += PAGE_SIZE, ptep++, i++) {
> + struct page *page;
> + swp_entry_t entry;
> + int swapped;
> +
> + entry = pte_to_swp_entry(save_pte[i]);
> + if (is_hmm_entry(entry)) {
> + /*
> + * Logic here is pretty involve. If save_pte is
> + * an HMM special swap entry then it means that
> + * we failed to swap in that page so error must
> + * be set.
> + *
> + * If that's not the case than it means we are
> + * seriously screw.
> + */
> + VM_BUG_ON(!ret);
> + continue;
> + }
> +
> + /*
> + * This can not happen, no one else can replace our
> + * special entry and as range end is re-ajusted on
> + * error.
> + */
> + entry = pte_to_swp_entry(*ptep);
> + VM_BUG_ON(!is_hmm_entry_locked(entry));
> +
> + /* On error or backoff restore all the saved pte. */
> + if (ret)
> + goto restore;
> +
> + page = vm_normal_page(vma, addr, save_pte[i]);
> + /* The zero page is fine to migrate. */
> + if (!page)
> + continue;
> +
> + /*
> + * Check that only CPU mapping hold a reference on the
> + * page. To make thing simpler we just refuse bail out
> + * if page_mapcount() != page_count() (also accounting
> + * for swap cache).
> + *
> + * There is a small window here where wp_page_copy()
> + * might have decremented mapcount but have not yet
> + * decremented the page count. This is not an issue as
> + * we backoff in that case.
> + */
> + swapped = PageSwapCache(page);
> + if (page_mapcount(page) + swapped == page_count(page))
> + continue;
> +
> +restore:
> + /* Ok we have to restore that page. */
> + set_pte_at(mm, addr, ptep, save_pte[i]);
> + /*
> + * No need to invalidate - it was non-present
> + * before.
> + */
> + update_mmu_cache(vma, addr, ptep);
> + pte_clear(mm, addr, &save_pte[i]);
> + }
> + pte_unmap_unlock(ptep - 1, ptl);
> + }
> + return ret;
> +}
> +EXPORT_SYMBOL(mm_hmm_migrate);

-aneesh

2016-03-21 14:31:12

by Jerome Glisse

[permalink] [raw]
Subject: Re: [PATCH v12 21/29] HMM: mm add helper to update page table when migrating memory back v2.

On Mon, Mar 21, 2016 at 07:18:41PM +0530, Aneesh Kumar K.V wrote:
> Jerome Glisse <[email protected]> writes:
> > [ text/plain ]
> > On Mon, Mar 21, 2016 at 04:57:32PM +0530, Aneesh Kumar K.V wrote:
> >> J?r?me Glisse <[email protected]> writes:

[...]

> >> > + ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
> >> > + for (cstart = addr, i = (addr - start) >> PAGE_SHIFT,
> >> > + next = min((addr + PMD_SIZE) & PMD_MASK, end);
> >> > + addr < next; addr += PAGE_SIZE, ptep++, i++) {
> >> > + swp_entry_t entry;
> >> > +
> >> > + entry = pte_to_swp_entry(*ptep);
> >> > + if (pte_none(*ptep) || pte_present(*ptep) ||
> >> > + !is_hmm_entry(entry) ||
> >> > + is_hmm_entry_locked(entry))
> >> > + continue;
> >> > +
> >> > + set_pte_at(mm, addr, ptep, hmm_entry);
> >> > + new_pte[i] = pte_mkspecial(pfn_pte(my_zero_pfn(addr),
> >> > + vma->vm_page_prot));
> >> > + }
> >> > + pte_unmap_unlock(ptep - 1, ptl);
> >>
> >>
> >> I guess this is fixing all the ptes in the cpu page table mapping a pmd
> >> entry. But then what is below ?
> >
> > Because we are dealing with special swap entry we know we can not have huge pages.
> > So we only care about HMM special swap entry. We record entry we want to migrate
> > in the new_pte array. The loop above is under pmd spin lock, the loop below does
> > memory allocation and we do not want to hold any spin lock while doing allocation.
> >
>
> Can this go as code comment ?

Yes of course, i should have added more comment in first place.


> >> > +
> >> > + for (addr = cstart, i = (addr - start) >> PAGE_SHIFT;
> >> > + addr < next; addr += PAGE_SIZE, i++) {
> >>
> >> Your use of vairable addr with multiple loops updating then is also
> >> making it complex. We should definitely add more comments here. I guess
> >> we are going through the same range we iterated above here.
> >
> > Correct we are going over the exact same range, i am keeping the addr only
> > for alloc_zeroed_user_highpage_movable() purpose.
> >
>
> Can we use a different variable name there ?

If you have suggestion for name ? I am just lacking imagination but i can use
a different name like vaddr.


> >> > + struct mem_cgroup *memcg;
> >> > + struct page *page;
> >> > +
> >> > + if (!pte_present(new_pte[i]))
> >> > + continue;
> >>
> >> What is that checking for ?. We set that using pte_mkspecial above ?
> >
> > Not all entry in the range might match the criteria (ie special unlocked HMM swap
> > entry). We want to allocate pages only for entry that match the criteria.
> >
>
> Since we did in the beginning,
> memset(new_pte, 0, sizeof(pte_t) * ((end - start) >> PAGE_SHIFT));
>
> we should not find present bit set ? using present there is confusing,
> may be pte_none(). Also with comments around explaining the details ?

Yes pte_none() will works too, i will use that and add comments.


> >> > + page = alloc_zeroed_user_highpage_movable(vma, addr);
> >> > + if (!page) {
> >> > + ret = -ENOMEM;
> >> > + break;
> >> > + }
> >> > + __SetPageUptodate(page);
> >> > + if (mem_cgroup_try_charge(page, mm, GFP_KERNEL,
> >> > + &memcg)) {
> >> > + page_cache_release(page);
> >> > + ret = -ENOMEM;
> >> > + break;
> >> > + }
> >> > + /*
> >> > + * We can safely reuse the s_mem/mapping field of page
> >> > + * struct to store the memcg as the page is only seen
> >> > + * by HMM at this point and we can clear it before it
> >> > + * is public see mm_hmm_migrate_back_cleanup().
> >> > + */
> >> > + page->s_mem = memcg;
> >> > + new_pte[i] = mk_pte(page, vma->vm_page_prot);
> >> > + if (vma->vm_flags & VM_WRITE) {
> >> > + new_pte[i] = pte_mkdirty(new_pte[i]);
> >> > + new_pte[i] = pte_mkwrite(new_pte[i]);
> >> > + }
> >>
> >> Why mark it dirty if vm_flags is VM_WRITE ?
> >
> > It is a left over of some debuging i was doing, i missed it.

I actually remember why i set the dirty bit, i wanted to change the driver
API to have driver clear the dirty bit if they did not write instead on
relying on them to set it if they did. I thought it was a safer to cope with
potentialy buggy driver. I might update patchset to do that.

[...]

> >> > + for (addr = cstart, i = (addr - start) >> PAGE_SHIFT;
> >> > + addr < next; addr += PAGE_SIZE, ptep++, i++) {
> >> > + unsigned long pfn = pte_pfn(new_pte[i]);
> >> > +
> >> > + if (!pte_present(new_pte[i]) || !is_zero_pfn(pfn))
> >> > + continue;
> >>
>
> So here we are using the fact that we had set new pte using zero pfn in
> the firs loop and hence if we find a present new_pte with zero pfn, it implies we
> failed to allocate a page for that ?

Yes that's correct. I could use a another pte flag instead on relying on zero
pfn.

[...]

> >> > +
> >> > + set_pte_at(mm, addr, ptep, hmm_entry);
> >> > + pte_clear(mm, addr, &new_pte[i]);
> >>
> >> what is that pte_clear for ?. Handling of new_pte needs more code comments.
> >>
> >
> > Entry for which we failed to allocate memory we clear the special HMM swap entry
> > as well as the new_pte entry so that migration code knows it does not have to do
> > anything here.
> >
>
> So that pte_clear is not expecting to do any sort of tlb flushes etc ? The
> idea is to put new_pte = 0 ?.

Correct, no tlb flushing needed, new_pte is a private array use only during
migration and never expose to outside world. I will change to new_pte[i] = 0
instead.


>
> Can we do all those conditionals without using pte bits ? A check like
> pte_present, is_zero_pfn etc confuse the reader. Instead can
> we do
>
> if (pte_state[i] == SKIP_LOOP_FIRST)
>
> if (pte_state[i] == SKIP_LOOP_SECOND)
>
> I understand that we want to return new_pte array with valid pages, so
> may be the above will make code complex, but atleast code should have
> more comments explaining each step

Well another point of new_pte is that we can directly use the new_pte
value to update the CPU page table in the final migration step. But i
can define some HMM_PTE_MIGRATE, HMM_PTE_RESTORE as alias of existing pte
flag and they will be clear along the way depending on outcomes of each
step.

J?r?me

2016-03-23 06:52:50

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH v12 08/29] HMM: add device page fault support v6.

Jérôme Glisse <[email protected]> writes:

> [ text/plain ]
> This patch add helper for device page fault. Thus helpers will fill
> the mirror page table using the CPU page table and synchronizing
> with any update to CPU page table.
>
> Changed since v1:
> - Add comment about directory lock.
>
> Changed since v2:
> - Check for mirror->hmm in hmm_mirror_fault()
>
> Changed since v3:
> - Adapt to HMM page table changes.
>
> Changed since v4:
> - Fix PROT_NONE, ie do not populate from protnone pte.
> - Fix huge pmd handling (start address may != pmd start address)
> - Fix missing entry case.
>
> Signed-off-by: Jérôme Glisse <[email protected]>
> Signed-off-by: Sherry Cheung <[email protected]>
> Signed-off-by: Subhash Gutti <[email protected]>
> Signed-off-by: Mark Hairgrove <[email protected]>
> Signed-off-by: John Hubbard <[email protected]>
> Signed-off-by: Jatin Kumar <[email protected]>
> ---


....
....

+static int hmm_mirror_fault_hpmd(struct hmm_mirror *mirror,
> + struct hmm_event *event,
> + struct vm_area_struct *vma,
> + struct hmm_pt_iter *iter,
> + pmd_t *pmdp,
> + struct hmm_mirror_fault *mirror_fault,
> + unsigned long start,
> + unsigned long end)
> +{
> + struct page *page;
> + unsigned long addr, pfn;
> + unsigned flags = FOLL_TOUCH;
> + spinlock_t *ptl;
> + int ret;
> +
> + ptl = pmd_lock(mirror->hmm->mm, pmdp);
> + if (unlikely(!pmd_trans_huge(*pmdp))) {
> + spin_unlock(ptl);
> + return -EAGAIN;
> + }
> + flags |= event->etype == HMM_DEVICE_WFAULT ? FOLL_WRITE : 0;
> + page = follow_trans_huge_pmd(vma, start, pmdp, flags);
> + pfn = page_to_pfn(page);
> + spin_unlock(ptl);
> +
> + /* Just fault in the whole PMD. */
> + start &= PMD_MASK;
> + end = start + PMD_SIZE - 1;
> +
> + if (!pmd_write(*pmdp) && event->etype == HMM_DEVICE_WFAULT)
> + return -ENOENT;
> +
> + for (ret = 0, addr = start; !ret && addr < end;) {
> + unsigned long i, next = end;
> + dma_addr_t *hmm_pte;
> +
> + hmm_pte = hmm_pt_iter_populate(iter, addr, &next);
> + if (!hmm_pte)
> + return -ENOMEM;
> +
> + i = hmm_pt_index(&mirror->pt, addr, mirror->pt.llevel);
> +
> + /*
> + * The directory lock protect against concurrent clearing of
> + * page table bit flags. Exceptions being the dirty bit and
> + * the device driver private flags.
> + */
> + hmm_pt_iter_directory_lock(iter);
> + do {
> + if (!hmm_pte_test_valid_pfn(&hmm_pte[i])) {
> + hmm_pte[i] = hmm_pte_from_pfn(pfn);
> + hmm_pt_iter_directory_ref(iter);

I looked at that and it is actually
static inline void hmm_pt_iter_directory_ref(struct hmm_pt_iter *iter)
{
BUG_ON(!iter->ptd[iter->pt->llevel - 1]);
hmm_pt_directory_ref(iter->pt, iter->ptd[iter->pt->llevel - 1]);
}

static inline void hmm_pt_directory_ref(struct hmm_pt *pt,
struct page *ptd)
{
if (!atomic_inc_not_zero(&ptd->_mapcount))
/* Illegal this should not happen. */
BUG();
}

what is the mapcount update about ?

> + }
> + BUG_ON(hmm_pte_pfn(hmm_pte[i]) != pfn);
> + if (pmd_write(*pmdp))
> + hmm_pte_set_write(&hmm_pte[i]);
> + } while (addr += PAGE_SIZE, pfn++, i++, addr != next);
> + hmm_pt_iter_directory_unlock(iter);
> + mirror_fault->addr = addr;
> + }
> +

So we don't have huge page mapping in hmm page table ?


> + return 0;
> +}
> +
> +static int hmm_pte_hole(unsigned long addr,
> + unsigned long next,
> + struct mm_walk *walk)
> +{
> + return -ENOENT;
> +}
> +


-aneesh

2016-03-23 10:10:09

by Jerome Glisse

[permalink] [raw]
Subject: Re: [PATCH v12 08/29] HMM: add device page fault support v6.

On Wed, Mar 23, 2016 at 12:22:23PM +0530, Aneesh Kumar K.V wrote:
> J?r?me Glisse <[email protected]> writes:
>
> > [ text/plain ]
> > This patch add helper for device page fault. Thus helpers will fill
> > the mirror page table using the CPU page table and synchronizing
> > with any update to CPU page table.
> >
> > Changed since v1:
> > - Add comment about directory lock.
> >
> > Changed since v2:
> > - Check for mirror->hmm in hmm_mirror_fault()
> >
> > Changed since v3:
> > - Adapt to HMM page table changes.
> >
> > Changed since v4:
> > - Fix PROT_NONE, ie do not populate from protnone pte.
> > - Fix huge pmd handling (start address may != pmd start address)
> > - Fix missing entry case.
> >
> > Signed-off-by: J?r?me Glisse <[email protected]>
> > Signed-off-by: Sherry Cheung <[email protected]>
> > Signed-off-by: Subhash Gutti <[email protected]>
> > Signed-off-by: Mark Hairgrove <[email protected]>
> > Signed-off-by: John Hubbard <[email protected]>
> > Signed-off-by: Jatin Kumar <[email protected]>
> > ---
>
>
> ....
> ....
>
> +static int hmm_mirror_fault_hpmd(struct hmm_mirror *mirror,
> > + struct hmm_event *event,
> > + struct vm_area_struct *vma,
> > + struct hmm_pt_iter *iter,
> > + pmd_t *pmdp,
> > + struct hmm_mirror_fault *mirror_fault,
> > + unsigned long start,
> > + unsigned long end)
> > +{
> > + struct page *page;
> > + unsigned long addr, pfn;
> > + unsigned flags = FOLL_TOUCH;
> > + spinlock_t *ptl;
> > + int ret;
> > +
> > + ptl = pmd_lock(mirror->hmm->mm, pmdp);
> > + if (unlikely(!pmd_trans_huge(*pmdp))) {
> > + spin_unlock(ptl);
> > + return -EAGAIN;
> > + }
> > + flags |= event->etype == HMM_DEVICE_WFAULT ? FOLL_WRITE : 0;
> > + page = follow_trans_huge_pmd(vma, start, pmdp, flags);
> > + pfn = page_to_pfn(page);
> > + spin_unlock(ptl);
> > +
> > + /* Just fault in the whole PMD. */
> > + start &= PMD_MASK;
> > + end = start + PMD_SIZE - 1;
> > +
> > + if (!pmd_write(*pmdp) && event->etype == HMM_DEVICE_WFAULT)
> > + return -ENOENT;
> > +
> > + for (ret = 0, addr = start; !ret && addr < end;) {
> > + unsigned long i, next = end;
> > + dma_addr_t *hmm_pte;
> > +
> > + hmm_pte = hmm_pt_iter_populate(iter, addr, &next);
> > + if (!hmm_pte)
> > + return -ENOMEM;
> > +
> > + i = hmm_pt_index(&mirror->pt, addr, mirror->pt.llevel);
> > +
> > + /*
> > + * The directory lock protect against concurrent clearing of
> > + * page table bit flags. Exceptions being the dirty bit and
> > + * the device driver private flags.
> > + */
> > + hmm_pt_iter_directory_lock(iter);
> > + do {
> > + if (!hmm_pte_test_valid_pfn(&hmm_pte[i])) {
> > + hmm_pte[i] = hmm_pte_from_pfn(pfn);
> > + hmm_pt_iter_directory_ref(iter);
>
> I looked at that and it is actually
> static inline void hmm_pt_iter_directory_ref(struct hmm_pt_iter *iter)
> {
> BUG_ON(!iter->ptd[iter->pt->llevel - 1]);
> hmm_pt_directory_ref(iter->pt, iter->ptd[iter->pt->llevel - 1]);
> }
>
> static inline void hmm_pt_directory_ref(struct hmm_pt *pt,
> struct page *ptd)
> {
> if (!atomic_inc_not_zero(&ptd->_mapcount))
> /* Illegal this should not happen. */
> BUG();
> }
>
> what is the mapcount update about ?

Unlike regular CPU page table we do not rely on unmap to prune HMM mirror
page table. Rather we free/prune it aggressively once the device no longer
have anything mirror in a given range.

As such mapcount is use to keep track of any many valid entry there is per
directory.

Moreover mapcount is also use to protect from concurrent pruning when
you walk through the page table you increment refcount by one along your
way. When you done walking you decrement refcount.

Because of that last aspect, the mapcount can never reach zero because we
unmap page, it can only reach zero once we cleanup the page table walk.

>
> > + }
> > + BUG_ON(hmm_pte_pfn(hmm_pte[i]) != pfn);
> > + if (pmd_write(*pmdp))
> > + hmm_pte_set_write(&hmm_pte[i]);
> > + } while (addr += PAGE_SIZE, pfn++, i++, addr != next);
> > + hmm_pt_iter_directory_unlock(iter);
> > + mirror_fault->addr = addr;
> > + }
> > +
>
> So we don't have huge page mapping in hmm page table ?

No we don't right now. First reason is that i wanted to keep things simple for
device driver. Second motivation is to keep first patchset simpler especialy
the page migration code.

Memory overhead is 2MB per GB of virtual memory mirrored. There is no TLB here.
I believe adding huge page can be done as part of a latter patchset if it makes
sense.

Cheers,
J?r?me

2016-03-23 10:29:51

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH v12 08/29] HMM: add device page fault support v6.

Jerome Glisse <[email protected]> writes:

> [ text/plain ]
> On Wed, Mar 23, 2016 at 12:22:23PM +0530, Aneesh Kumar K.V wrote:
>> Jérôme Glisse <[email protected]> writes:
>>
>> > [ text/plain ]
>> > This patch add helper for device page fault. Thus helpers will fill
>> > the mirror page table using the CPU page table and synchronizing
>> > with any update to CPU page table.
>> >
>> > Changed since v1:
>> > - Add comment about directory lock.
>> >
>> > Changed since v2:
>> > - Check for mirror->hmm in hmm_mirror_fault()
>> >
>> > Changed since v3:
>> > - Adapt to HMM page table changes.
>> >
>> > Changed since v4:
>> > - Fix PROT_NONE, ie do not populate from protnone pte.
>> > - Fix huge pmd handling (start address may != pmd start address)
>> > - Fix missing entry case.
>> >
>> > Signed-off-by: Jérôme Glisse <[email protected]>
>> > Signed-off-by: Sherry Cheung <[email protected]>
>> > Signed-off-by: Subhash Gutti <[email protected]>
>> > Signed-off-by: Mark Hairgrove <[email protected]>
>> > Signed-off-by: John Hubbard <[email protected]>
>> > Signed-off-by: Jatin Kumar <[email protected]>
>> > ---
>>
>>
>> ....
>> ....
>>
>> +static int hmm_mirror_fault_hpmd(struct hmm_mirror *mirror,
>> > + struct hmm_event *event,
>> > + struct vm_area_struct *vma,
>> > + struct hmm_pt_iter *iter,
>> > + pmd_t *pmdp,
>> > + struct hmm_mirror_fault *mirror_fault,
>> > + unsigned long start,
>> > + unsigned long end)
>> > +{
>> > + struct page *page;
>> > + unsigned long addr, pfn;
>> > + unsigned flags = FOLL_TOUCH;
>> > + spinlock_t *ptl;
>> > + int ret;
>> > +
>> > + ptl = pmd_lock(mirror->hmm->mm, pmdp);
>> > + if (unlikely(!pmd_trans_huge(*pmdp))) {
>> > + spin_unlock(ptl);
>> > + return -EAGAIN;
>> > + }
>> > + flags |= event->etype == HMM_DEVICE_WFAULT ? FOLL_WRITE : 0;
>> > + page = follow_trans_huge_pmd(vma, start, pmdp, flags);
>> > + pfn = page_to_pfn(page);
>> > + spin_unlock(ptl);
>> > +
>> > + /* Just fault in the whole PMD. */
>> > + start &= PMD_MASK;
>> > + end = start + PMD_SIZE - 1;
>> > +
>> > + if (!pmd_write(*pmdp) && event->etype == HMM_DEVICE_WFAULT)
>> > + return -ENOENT;
>> > +
>> > + for (ret = 0, addr = start; !ret && addr < end;) {
>> > + unsigned long i, next = end;
>> > + dma_addr_t *hmm_pte;
>> > +
>> > + hmm_pte = hmm_pt_iter_populate(iter, addr, &next);
>> > + if (!hmm_pte)
>> > + return -ENOMEM;
>> > +
>> > + i = hmm_pt_index(&mirror->pt, addr, mirror->pt.llevel);
>> > +
>> > + /*
>> > + * The directory lock protect against concurrent clearing of
>> > + * page table bit flags. Exceptions being the dirty bit and
>> > + * the device driver private flags.
>> > + */
>> > + hmm_pt_iter_directory_lock(iter);
>> > + do {
>> > + if (!hmm_pte_test_valid_pfn(&hmm_pte[i])) {
>> > + hmm_pte[i] = hmm_pte_from_pfn(pfn);
>> > + hmm_pt_iter_directory_ref(iter);
>>
>> I looked at that and it is actually
>> static inline void hmm_pt_iter_directory_ref(struct hmm_pt_iter *iter)
>> {
>> BUG_ON(!iter->ptd[iter->pt->llevel - 1]);
>> hmm_pt_directory_ref(iter->pt, iter->ptd[iter->pt->llevel - 1]);
>> }
>>
>> static inline void hmm_pt_directory_ref(struct hmm_pt *pt,
>> struct page *ptd)
>> {
>> if (!atomic_inc_not_zero(&ptd->_mapcount))
>> /* Illegal this should not happen. */
>> BUG();
>> }
>>
>> what is the mapcount update about ?
>
> Unlike regular CPU page table we do not rely on unmap to prune HMM mirror
> page table. Rather we free/prune it aggressively once the device no longer
> have anything mirror in a given range.

Which patch does this ?

>
> As such mapcount is use to keep track of any many valid entry there is per
> directory.
>
> Moreover mapcount is also use to protect from concurrent pruning when
> you walk through the page table you increment refcount by one along your
> way. When you done walking you decrement refcount.
>
> Because of that last aspect, the mapcount can never reach zero because we
> unmap page, it can only reach zero once we cleanup the page table walk.
>
>>
>> > + }
>> > + BUG_ON(hmm_pte_pfn(hmm_pte[i]) != pfn);
>> > + if (pmd_write(*pmdp))
>> > + hmm_pte_set_write(&hmm_pte[i]);
>> > + } while (addr += PAGE_SIZE, pfn++, i++, addr != next);
>> > + hmm_pt_iter_directory_unlock(iter);
>> > + mirror_fault->addr = addr;
>> > + }
>> > +
>>
>> So we don't have huge page mapping in hmm page table ?
>
> No we don't right now. First reason is that i wanted to keep things simple for
> device driver. Second motivation is to keep first patchset simpler especialy
> the page migration code.
>
> Memory overhead is 2MB per GB of virtual memory mirrored. There is no TLB here.
> I believe adding huge page can be done as part of a latter patchset if it makes
> sense.
>

One of the thing I am wondering is can we do the patch series in such a
way that we move the page table mirror to device driver. That is an
hmm fault will look at cpu page table and call into a device driver callback
with the pte entry details. It is upto the device driver to maintain a
mirror table if needed. Similarly for cpu fault we call into hmm
callback to find per pte dma_addr and do a migrate using
copy_from_device callback. I haven't fully looked at how easy this would
be, but I guess lot of the code in this series got to do with mirror
table and I wondering is there a simpler version we can get upstream
that hides it within a driver.


Also does it simply to have interfaces that operates on one pte than an
array of ptes ?

-aneesh

2016-03-23 11:25:50

by Jerome Glisse

[permalink] [raw]
Subject: Re: [PATCH v12 08/29] HMM: add device page fault support v6.

On Wed, Mar 23, 2016 at 03:59:32PM +0530, Aneesh Kumar K.V wrote:
> Jerome Glisse <[email protected]> writes:

[...]

> >> +static int hmm_mirror_fault_hpmd(struct hmm_mirror *mirror,
> >> > + struct hmm_event *event,
> >> > + struct vm_area_struct *vma,
> >> > + struct hmm_pt_iter *iter,
> >> > + pmd_t *pmdp,
> >> > + struct hmm_mirror_fault *mirror_fault,
> >> > + unsigned long start,
> >> > + unsigned long end)
> >> > +{
> >> > + struct page *page;
> >> > + unsigned long addr, pfn;
> >> > + unsigned flags = FOLL_TOUCH;
> >> > + spinlock_t *ptl;
> >> > + int ret;
> >> > +
> >> > + ptl = pmd_lock(mirror->hmm->mm, pmdp);
> >> > + if (unlikely(!pmd_trans_huge(*pmdp))) {
> >> > + spin_unlock(ptl);
> >> > + return -EAGAIN;
> >> > + }
> >> > + flags |= event->etype == HMM_DEVICE_WFAULT ? FOLL_WRITE : 0;
> >> > + page = follow_trans_huge_pmd(vma, start, pmdp, flags);
> >> > + pfn = page_to_pfn(page);
> >> > + spin_unlock(ptl);
> >> > +
> >> > + /* Just fault in the whole PMD. */
> >> > + start &= PMD_MASK;
> >> > + end = start + PMD_SIZE - 1;
> >> > +
> >> > + if (!pmd_write(*pmdp) && event->etype == HMM_DEVICE_WFAULT)
> >> > + return -ENOENT;
> >> > +
> >> > + for (ret = 0, addr = start; !ret && addr < end;) {
> >> > + unsigned long i, next = end;
> >> > + dma_addr_t *hmm_pte;
> >> > +
> >> > + hmm_pte = hmm_pt_iter_populate(iter, addr, &next);
> >> > + if (!hmm_pte)
> >> > + return -ENOMEM;
> >> > +
> >> > + i = hmm_pt_index(&mirror->pt, addr, mirror->pt.llevel);
> >> > +
> >> > + /*
> >> > + * The directory lock protect against concurrent clearing of
> >> > + * page table bit flags. Exceptions being the dirty bit and
> >> > + * the device driver private flags.
> >> > + */
> >> > + hmm_pt_iter_directory_lock(iter);
> >> > + do {
> >> > + if (!hmm_pte_test_valid_pfn(&hmm_pte[i])) {
> >> > + hmm_pte[i] = hmm_pte_from_pfn(pfn);
> >> > + hmm_pt_iter_directory_ref(iter);
> >>
> >> I looked at that and it is actually
> >> static inline void hmm_pt_iter_directory_ref(struct hmm_pt_iter *iter)
> >> {
> >> BUG_ON(!iter->ptd[iter->pt->llevel - 1]);
> >> hmm_pt_directory_ref(iter->pt, iter->ptd[iter->pt->llevel - 1]);
> >> }
> >>
> >> static inline void hmm_pt_directory_ref(struct hmm_pt *pt,
> >> struct page *ptd)
> >> {
> >> if (!atomic_inc_not_zero(&ptd->_mapcount))
> >> /* Illegal this should not happen. */
> >> BUG();
> >> }
> >>
> >> what is the mapcount update about ?
> >
> > Unlike regular CPU page table we do not rely on unmap to prune HMM mirror
> > page table. Rather we free/prune it aggressively once the device no longer
> > have anything mirror in a given range.
>
> Which patch does this ?

Well it is done in hmm_pt_iter_directory_unref_safe() so there is no particular
patch per say. One optimization i want to do, as part of latter patch is to
delay directory pruning so that we avoid freeing and the reallocating right
away because device or some memory event wrongly induced us into believing it
was done with a range. But i do not want to complexify code before knowing if
it does make sense to do so with hard numbers.


> > As such mapcount is use to keep track of any many valid entry there is per
> > directory.
> >
> > Moreover mapcount is also use to protect from concurrent pruning when
> > you walk through the page table you increment refcount by one along your
> > way. When you done walking you decrement refcount.
> >
> > Because of that last aspect, the mapcount can never reach zero because we
> > unmap page, it can only reach zero once we cleanup the page table walk.
> >
> >>
> >> > + }
> >> > + BUG_ON(hmm_pte_pfn(hmm_pte[i]) != pfn);
> >> > + if (pmd_write(*pmdp))
> >> > + hmm_pte_set_write(&hmm_pte[i]);
> >> > + } while (addr += PAGE_SIZE, pfn++, i++, addr != next);
> >> > + hmm_pt_iter_directory_unlock(iter);
> >> > + mirror_fault->addr = addr;
> >> > + }
> >> > +
> >>
> >> So we don't have huge page mapping in hmm page table ?
> >
> > No we don't right now. First reason is that i wanted to keep things simple for
> > device driver. Second motivation is to keep first patchset simpler especialy
> > the page migration code.
> >
> > Memory overhead is 2MB per GB of virtual memory mirrored. There is no TLB here.
> > I believe adding huge page can be done as part of a latter patchset if it makes
> > sense.
> >
>
> One of the thing I am wondering is can we do the patch series in such a
> way that we move the page table mirror to device driver. That is an
> hmm fault will look at cpu page table and call into a device driver callback
> with the pte entry details. It is upto the device driver to maintain a
> mirror table if needed. Similarly for cpu fault we call into hmm
> callback to find per pte dma_addr and do a migrate using
> copy_from_device callback. I haven't fully looked at how easy this would
> be, but I guess lot of the code in this series got to do with mirror
> table and I wondering is there a simpler version we can get upstream
> that hides it within a driver.

This is one possibility but it means that many device driver will duplicate
page table code. It also means that some optimization that i want to do down
the road are not doable. Most notably i want to share IOMMU directory among
several devices (when those devices mirror the same virtual address range)
but this require works in the DMA/IOMMU code.

Another side is related to page reclaimation, with having page use by device
we could get stall because device page table invalidation is way more complex
and takes more time the CPU page table invalidation.

Having mirror in common code makes it easier to have a new lru list for pages
referenced by device. Allowing to hide device page table invalidation latency.
This is also probably doable if we hide the mirror page table into the device
driver but it is harder for common code to know to which device it needs to
ask unmapping. Also this would require either a new page flag or a new pte
flags. Both of which is in short supply and i am not sure people would be
thrill to reserve one just for this feature.

Also i think we want to limit device usage of things like mmu_notifier API.
At least i would.

Another possibility that i did explore is having common code manage mirror
range (instead of a page table) and have the device driver deals on its own
with the down to the page mirroring. I even have patch doing this somewhere.
This might be a middle ground solution. Note by range i means something like:

struct mirror_range {
struct hmm_device *hdev;
unsigned long start; /* virtual start address for the range */
unsigned long end; /* virtual end address for the range */
/* other field like for an rb_tree and flags. */
};

But it gets quite ugly with range merging/splitting and the obvious worst
case of having one of this struct per page (like mirroring a range every
other page).


> Also does it simply to have interfaces that operates on one pte than an
> array of ptes ?

I strongly believe we do not want to do that. GPU are like 2048 cores with
16384 threads in flight, if each of the threads page fault over a linear
range you end up having to do 16384 calls and the overhead is gona kill
performances. GPU is about batching things up. So doing things in bulk is
what we want for performances.

Also, i should add that on GPU saving thread context out to memory and
swapping in another is way more expensive. First you can only do so on
large boundary, ie 256 thread at a time or more depends on the GPU.
Seconds for each thread there is way much memory, think few kilo bytes,
so you easily endup moving around MB of thread context data. This is
not lightweight. It is a different paradigm from CPU.

Cheers,
J?r?me

2016-03-29 22:58:34

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH v12 07/29] HMM: add per mirror page table v4.

On Tue, 8 Mar 2016, Jérôme Glisse wrote:

> This patch add the per mirror page table. It also propagate CPU page
> table update to this per mirror page table using mmu_notifier callback.
> All update are contextualized with an HMM event structure that convey
> all information needed by device driver to take proper actions (update
> its own mmu to reflect changes and schedule proper flushing).
>
> Core HMM is responsible for updating the per mirror page table once
> the device driver is done with its update. Most importantly HMM will
> properly propagate HMM page table dirty bit to underlying page.
>
> Changed since v1:
> - Removed unused fence code to defer it to latter patches.
>
> Changed since v2:
> - Use new bit flag helper for mirror page table manipulation.
> - Differentiate fork event with HMM_FORK from other events.
>
> Changed since v3:
> - Get rid of HMM_ISDIRTY and rely on write protect instead.
> - Adapt to HMM page table changes
>
> Signed-off-by: Jérôme Glisse <[email protected]>
> Signed-off-by: Sherry Cheung <[email protected]>
> Signed-off-by: Subhash Gutti <[email protected]>
> Signed-off-by: Mark Hairgrove <[email protected]>
> Signed-off-by: John Hubbard <[email protected]>
> Signed-off-by: Jatin Kumar <[email protected]>
> ---
> include/linux/hmm.h | 83 ++++++++++++++++++++
> mm/hmm.c | 221 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 304 insertions(+)
>
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index b559c0b..5488fa9 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -46,6 +46,7 @@
> #include <linux/mmu_notifier.h>
> #include <linux/workqueue.h>
> #include <linux/mman.h>
> +#include <linux/hmm_pt.h>
>
>
> struct hmm_device;
> @@ -53,6 +54,38 @@ struct hmm_mirror;
> struct hmm;
>
>
> +/*
> + * hmm_event - each event is described by a type associated with a struct.
> + */
> +enum hmm_etype {
> + HMM_NONE = 0,
> + HMM_FORK,
> + HMM_MIGRATE,
> + HMM_MUNMAP,
> + HMM_DEVICE_RFAULT,
> + HMM_DEVICE_WFAULT,

Hi Jerome,

Just a tiny thing I noticed, while connecting HMM to NVIDIA's upcoming
device driver: the last two enum items above should probably be named
like this:

HMM_DEVICE_READ_FAULT,
HMM_DEVICE_WRITE_FAULT,

instead of _WFAULT / _RFAULT. (Earlier code reviewers asked for more
clarity on these types of names.)

thanks,
John Hubbard

> + HMM_WRITE_PROTECT,
> +};
> +