Cliff note: HMM offers 2 things (each standing on its own). First
it allows to use device memory transparently inside any process
without any modifications to process program code. Second it allows
to mirror process address space on a device.
Changes since v17:
- typos
- ZONE_DEVICE page refcount move put_zone_device_page()
Work is still underway to use this feature inside the upstream
nouveau driver. It has been tested with closed source driver
and test are still underway on top of new kernel. So far we have
found no issues. I expect to get a tested-by soon. Also this
feature is not only useful for NVidia GPU, i expect AMD GPU will
need it too if they want to support some of the new industry API.
I also expect some FPGA company to use it and probably other
hardware.
That being said I don't expect i will ever get a review-by anyone
for reasons beyond my control. Many people have read the code and
i included their comments each time they had any. So i believe this
code had sufficient scrutiny from various people to warrent it being
merge. I am willing to face and deal with the fallout but i don't
expect any as this is an opt-in code thought i believe all major
distribution will enable it in order to support new hardware.
I do not wish to compete for the patchset with the highest revision
count and i would like a clear cut position on wether it can be
merge or not. If not i would like to know why because i am more than
willing to address any issues people might have. I just don't want
to keep submitting it over and over until i end up in hell.
So please consider applying for 4.12
Know issues:
Device memory pick some random unuse physical address range. Latter
memory hotplug might fails because of this. Intention is to fix this
in latter patchset to use physical address above the platform limit
thus making sure that no real memory can be hotplug at conflicting
address.
Patchset overview:
Patchset is divided into 3 features that can each be use independently
from one another. First is changes to ZONE_DEVICE so we can have struct
page for device un-addressable memory (patch 1-4 and 13-14). Second is
process address space mirroring (patch 8 to 11), this allow to snapshot
CPU page table and to keep the device page table synchronize with the
CPU one.
Last is a new page migration helper which allow migration for range of
virtual address using hardware copy engine (patch 5-7 for new migrate
function and 12 for migration of un-addressable memory).
Future plan:
In this patchset i restricted myself to set of core features what
is missing:
- force read only on CPU for memory duplication and GPU atomic
- changes to mmu_notifier for optimization purposes
- migration of file back page to device memory
I plan to submit a couple more patchset to implement those features
once core HMM is upstream.
Git tree:
https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v18
Previous patchset posting :
v1 http://lwn.net/Articles/597289/
v2 https://lkml.org/lkml/2014/6/12/559
v3 https://lkml.org/lkml/2014/6/13/633
v4 https://lkml.org/lkml/2014/8/29/423
v5 https://lkml.org/lkml/2014/11/3/759
v6 http://lwn.net/Articles/619737/
v7 http://lwn.net/Articles/627316/
v8 https://lwn.net/Articles/645515/
v9 https://lwn.net/Articles/651553/
v10 https://lwn.net/Articles/654430/
v11 http://www.gossamer-threads.com/lists/linux/kernel/2286424
v12 http://www.kernelhub.org/?msg=972982&p=2
v13 https://lwn.net/Articles/706856/
v14 https://lkml.org/lkml/2016/12/8/344
v15 http://www.mail-archive.com/[email protected]/msg1304107.html
v16 http://www.spinics.net/lists/linux-mm/msg119814.html
v17 https://lkml.org/lkml/2017/1/27/847
Jérôme Glisse (16):
mm/memory/hotplug: convert device bool to int to allow for more flags
v3
mm/put_page: move ref decrement to put_zone_device_page()
mm/ZONE_DEVICE/free-page: callback when page is freed v3
mm/ZONE_DEVICE/unaddressable: add support for un-addressable device
memory v3
mm/ZONE_DEVICE/x86: add support for un-addressable device memory
mm/migrate: add new boolean copy flag to migratepage() callback
mm/migrate: new memory migration helper for use with device memory v4
mm/migrate: migrate_vma() unmap page from vma while collecting pages
mm/hmm: heterogeneous memory management (HMM for short)
mm/hmm/mirror: mirror process address space on device with HMM helpers
mm/hmm/mirror: helper to snapshot CPU page table v2
mm/hmm/mirror: device page fault handler
mm/hmm/migrate: support un-addressable ZONE_DEVICE page in migration
mm/migrate: allow migrate_vma() to alloc new page on empty entry
mm/hmm/devmem: device memory hotplug using ZONE_DEVICE
mm/hmm/devmem: dummy HMM device for ZONE_DEVICE memory v2
MAINTAINERS | 7 +
arch/ia64/mm/init.c | 23 +-
arch/powerpc/mm/mem.c | 23 +-
arch/s390/mm/init.c | 10 +-
arch/sh/mm/init.c | 22 +-
arch/tile/mm/init.c | 10 +-
arch/x86/mm/init_32.c | 23 +-
arch/x86/mm/init_64.c | 41 +-
drivers/staging/lustre/lustre/llite/rw26.c | 8 +-
fs/aio.c | 7 +-
fs/btrfs/disk-io.c | 11 +-
fs/f2fs/data.c | 8 +-
fs/f2fs/f2fs.h | 2 +-
fs/hugetlbfs/inode.c | 9 +-
fs/nfs/internal.h | 5 +-
fs/nfs/write.c | 9 +-
fs/proc/task_mmu.c | 7 +
fs/ubifs/file.c | 8 +-
include/linux/balloon_compaction.h | 3 +-
include/linux/fs.h | 13 +-
include/linux/hmm.h | 468 +++++++++++
include/linux/ioport.h | 1 +
include/linux/memory_hotplug.h | 31 +-
include/linux/memremap.h | 37 +
include/linux/migrate.h | 86 +-
include/linux/mm.h | 8 +-
include/linux/mm_types.h | 5 +
include/linux/swap.h | 18 +-
include/linux/swapops.h | 67 ++
kernel/fork.c | 2 +
kernel/memremap.c | 34 +-
mm/Kconfig | 38 +
mm/Makefile | 1 +
mm/balloon_compaction.c | 2 +-
mm/hmm.c | 1231 ++++++++++++++++++++++++++++
mm/memory.c | 66 +-
mm/memory_hotplug.c | 14 +-
mm/migrate.c | 786 +++++++++++++++++-
mm/mprotect.c | 12 +
mm/page_vma_mapped.c | 10 +
mm/rmap.c | 25 +
mm/zsmalloc.c | 12 +-
42 files changed, 3119 insertions(+), 84 deletions(-)
create mode 100644 include/linux/hmm.h
create mode 100644 mm/hmm.c
--
2.4.11
When hotpluging memory we want more informations on the type of memory and
its properties. Replace the device boolean flag by an int and define a set
of flags.
New property for device memory is an opt-in flag to allow page migration
from and to a ZONE_DEVICE. Existing user of ZONE_DEVICE are not expecting
page migration to work for their pages. New changes to page migration i
changing that and we now need a flag to explicitly opt-in page migration.
Changes since v2:
- pr_err() in case of hotplug failure
Changes since v1:
- Improved commit message
- Improved define name
- Improved comments
- Typos
Signed-off-by: Jérôme Glisse <[email protected]>
Cc: Russell King <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Cc: Rich Felker <[email protected]>
Cc: Chris Metcalf <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
---
arch/ia64/mm/init.c | 23 ++++++++++++++++++++---
arch/powerpc/mm/mem.c | 23 +++++++++++++++++++----
arch/s390/mm/init.c | 10 ++++++++--
arch/sh/mm/init.c | 22 +++++++++++++++++++---
arch/tile/mm/init.c | 10 ++++++++--
arch/x86/mm/init_32.c | 23 ++++++++++++++++++++---
arch/x86/mm/init_64.c | 23 ++++++++++++++++++++---
include/linux/memory_hotplug.h | 24 ++++++++++++++++++++++--
include/linux/memremap.h | 13 +++++++++++++
kernel/memremap.c | 5 +++--
mm/memory_hotplug.c | 4 ++--
11 files changed, 154 insertions(+), 26 deletions(-)
diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
index 8f3efa6..1dbe5a5 100644
--- a/arch/ia64/mm/init.c
+++ b/arch/ia64/mm/init.c
@@ -646,18 +646,27 @@ mem_init (void)
}
#ifdef CONFIG_MEMORY_HOTPLUG
-int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
+int arch_add_memory(int nid, u64 start, u64 size, int flags)
{
+ const int supported_flags = MEMORY_DEVICE |
+ MEMORY_DEVICE_ALLOW_MIGRATE;
pg_data_t *pgdat;
struct zone *zone;
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
int ret;
+ /* Each flag need special handling so error out on un-supported flag */
+ if (flags & (~supported_flags)) {
+ pr_err("hotplug unsupported memory type 0x%08x\n", flags);
+ return -EINVAL;
+ }
+
pgdat = NODE_DATA(nid);
zone = pgdat->node_zones +
- zone_for_memory(nid, start, size, ZONE_NORMAL, for_device);
+ zone_for_memory(nid, start, size, ZONE_NORMAL,
+ flags & MEMORY_DEVICE);
ret = __add_pages(nid, zone, start_pfn, nr_pages);
if (ret)
@@ -668,13 +677,21 @@ int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
}
#ifdef CONFIG_MEMORY_HOTREMOVE
-int arch_remove_memory(u64 start, u64 size)
+int arch_remove_memory(u64 start, u64 size, int flags)
{
+ const int supported_flags = MEMORY_DEVICE |
+ MEMORY_DEVICE_ALLOW_MIGRATE;
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
struct zone *zone;
int ret;
+ /* Each flag need special handling so error out on un-supported flag */
+ if (flags & (~supported_flags)) {
+ pr_err("hotremove unsupported memory type 0x%08x\n", flags);
+ return -EINVAL;
+ }
+
zone = page_zone(pfn_to_page(start_pfn));
ret = __remove_pages(zone, start_pfn, nr_pages);
if (ret)
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index 9ee536e..4669c056 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -126,16 +126,23 @@ int __weak remove_section_mapping(unsigned long start, unsigned long end)
return -ENODEV;
}
-int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
+int arch_add_memory(int nid, u64 start, u64 size, int flags)
{
+ const int supported_flags = MEMORY_DEVICE |
+ MEMORY_DEVICE_ALLOW_MIGRATE;
struct pglist_data *pgdata;
struct zone *zone;
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
int rc;
- resize_hpt_for_hotplug(memblock_phys_mem_size());
+ /* Each flag need special handling so error out on un-supported flag */
+ if (flags & (~supported_flags)) {
+ pr_err("hotplug unsupported memory type 0x%08x\n", flags);
+ return -EINVAL;
+ }
+ resize_hpt_for_hotplug(memblock_phys_mem_size());
pgdata = NODE_DATA(nid);
start = (unsigned long)__va(start);
@@ -149,19 +156,27 @@ int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
/* this should work for most non-highmem platforms */
zone = pgdata->node_zones +
- zone_for_memory(nid, start, size, 0, for_device);
+ zone_for_memory(nid, start, size, 0, flags & MEMORY_DEVICE);
return __add_pages(nid, zone, start_pfn, nr_pages);
}
#ifdef CONFIG_MEMORY_HOTREMOVE
-int arch_remove_memory(u64 start, u64 size)
+int arch_remove_memory(u64 start, u64 size, int flags)
{
+ const int supported_flags = MEMORY_DEVICE |
+ MEMORY_DEVICE_ALLOW_MIGRATE;
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
struct zone *zone;
int ret;
+ /* Each flag need special handling so error out on un-supported flag */
+ if (flags & (~supported_flags)) {
+ pr_err("hotremove unsupported memory type 0x%08x\n", flags);
+ return -EINVAL;
+ }
+
zone = page_zone(pfn_to_page(start_pfn));
ret = __remove_pages(zone, start_pfn, nr_pages);
if (ret)
diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
index ee506671..b858303 100644
--- a/arch/s390/mm/init.c
+++ b/arch/s390/mm/init.c
@@ -161,7 +161,7 @@ unsigned long memory_block_size_bytes(void)
}
#ifdef CONFIG_MEMORY_HOTPLUG
-int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
+int arch_add_memory(int nid, u64 start, u64 size, int flags)
{
unsigned long zone_start_pfn, zone_end_pfn, nr_pages;
unsigned long start_pfn = PFN_DOWN(start);
@@ -170,6 +170,12 @@ int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
struct zone *zone;
int rc, i;
+ /* Each flag need special handling so error out on un-supported flag */
+ if (flags) {
+ pr_err("hotplug unsupported memory type 0x%08x\n", flags);
+ return -EINVAL;
+ }
+
rc = vmem_add_mapping(start, size);
if (rc)
return rc;
@@ -204,7 +210,7 @@ int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
}
#ifdef CONFIG_MEMORY_HOTREMOVE
-int arch_remove_memory(u64 start, u64 size)
+int arch_remove_memory(u64 start, u64 size, int flags)
{
/*
* There is no hardware or firmware interface which could trigger a
diff --git a/arch/sh/mm/init.c b/arch/sh/mm/init.c
index 7549186..30a239f 100644
--- a/arch/sh/mm/init.c
+++ b/arch/sh/mm/init.c
@@ -485,19 +485,27 @@ void free_initrd_mem(unsigned long start, unsigned long end)
#endif
#ifdef CONFIG_MEMORY_HOTPLUG
-int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
+int arch_add_memory(int nid, u64 start, u64 size, int flags)
{
+ const int supported_flags = MEMORY_DEVICE |
+ MEMORY_DEVICE_ALLOW_MIGRATE;
pg_data_t *pgdat;
unsigned long start_pfn = PFN_DOWN(start);
unsigned long nr_pages = size >> PAGE_SHIFT;
int ret;
+ /* Each flag need special handling so error out on un-supported flag */
+ if (flags & (~supported_flags)) {
+ pr_err("hotplug unsupported memory type 0x%08x\n", flags);
+ return -EINVAL;
+ }
+
pgdat = NODE_DATA(nid);
/* We only have ZONE_NORMAL, so this is easy.. */
ret = __add_pages(nid, pgdat->node_zones +
zone_for_memory(nid, start, size, ZONE_NORMAL,
- for_device),
+ flags & MEMORY_DEVICE),
start_pfn, nr_pages);
if (unlikely(ret))
printk("%s: Failed, __add_pages() == %d\n", __func__, ret);
@@ -516,13 +524,21 @@ EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
#endif
#ifdef CONFIG_MEMORY_HOTREMOVE
-int arch_remove_memory(u64 start, u64 size)
+int arch_remove_memory(u64 start, u64 size, int flags)
{
+ const int supported_flags = MEMORY_DEVICE |
+ MEMORY_DEVICE_ALLOW_MIGRATE;
unsigned long start_pfn = PFN_DOWN(start);
unsigned long nr_pages = size >> PAGE_SHIFT;
struct zone *zone;
int ret;
+ /* Each flag need special handling so error out on un-supported flag */
+ if (flags & (~supported_flags)) {
+ pr_err("hotremove unsupported memory type 0x%08x\n", flags);
+ return -EINVAL;
+ }
+
zone = page_zone(pfn_to_page(start_pfn));
ret = __remove_pages(zone, start_pfn, nr_pages);
if (unlikely(ret))
diff --git a/arch/tile/mm/init.c b/arch/tile/mm/init.c
index 3a97e4d..eed98e2 100644
--- a/arch/tile/mm/init.c
+++ b/arch/tile/mm/init.c
@@ -863,13 +863,19 @@ void __init mem_init(void)
* memory to the highmem for now.
*/
#ifndef CONFIG_NEED_MULTIPLE_NODES
-int arch_add_memory(u64 start, u64 size, bool for_device)
+int arch_add_memory(u64 start, u64 size, int flags)
{
struct pglist_data *pgdata = &contig_page_data;
struct zone *zone = pgdata->node_zones + MAX_NR_ZONES-1;
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
+ /* Each flag need special handling so error out on un-supported flag */
+ if (flags) {
+ pr_err("hotplug unsupported memory type 0x%08x\n", flags);
+ return -EINVAL;
+ }
+
return __add_pages(zone, start_pfn, nr_pages);
}
@@ -879,7 +885,7 @@ int remove_memory(u64 start, u64 size)
}
#ifdef CONFIG_MEMORY_HOTREMOVE
-int arch_remove_memory(u64 start, u64 size)
+int arch_remove_memory(u64 start, u64 size, int flags)
{
/* TODO */
return -EBUSY;
diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index 2b4b53e..d7d7f9a 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -816,24 +816,41 @@ void __init mem_init(void)
}
#ifdef CONFIG_MEMORY_HOTPLUG
-int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
+int arch_add_memory(int nid, u64 start, u64 size, int flags)
{
+ const int supported_flags = MEMORY_DEVICE |
+ MEMORY_DEVICE_ALLOW_MIGRATE;
struct pglist_data *pgdata = NODE_DATA(nid);
struct zone *zone = pgdata->node_zones +
- zone_for_memory(nid, start, size, ZONE_HIGHMEM, for_device);
+ zone_for_memory(nid, start, size, ZONE_HIGHMEM,
+ flags & MEMORY_DEVICE);
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
+ /* Each flag need special handling so error out on un-supported flag */
+ if (flags & (~supported_flags)) {
+ pr_err("hotplug unsupported memory type 0x%08x\n", flags);
+ return -EINVAL;
+ }
+
return __add_pages(nid, zone, start_pfn, nr_pages);
}
#ifdef CONFIG_MEMORY_HOTREMOVE
-int arch_remove_memory(u64 start, u64 size)
+int arch_remove_memory(u64 start, u64 size, int flags)
{
+ const int supported_flags = MEMORY_DEVICE |
+ MEMORY_DEVICE_ALLOW_MIGRATE;
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
struct zone *zone;
+ /* Each flag need special handling so error out on un-supported flag */
+ if (flags & (~supported_flags)) {
+ pr_err("hotremove unsupported memory type 0x%08x\n", flags);
+ return -EINVAL;
+ }
+
zone = page_zone(pfn_to_page(start_pfn));
return __remove_pages(zone, start_pfn, nr_pages);
}
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 15173d3..0098dc9 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -641,15 +641,24 @@ static void update_end_of_memory_vars(u64 start, u64 size)
* Memory is added always to NORMAL zone. This means you will never get
* additional DMA/DMA32 memory.
*/
-int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
+int arch_add_memory(int nid, u64 start, u64 size, int flags)
{
+ const int supported_flags = MEMORY_DEVICE |
+ MEMORY_DEVICE_ALLOW_MIGRATE;
struct pglist_data *pgdat = NODE_DATA(nid);
struct zone *zone = pgdat->node_zones +
- zone_for_memory(nid, start, size, ZONE_NORMAL, for_device);
+ zone_for_memory(nid, start, size, ZONE_NORMAL,
+ flags & MEMORY_DEVICE);
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
int ret;
+ /* Each flag need special handling so error out on un-supported flag */
+ if (flags & (~supported_flags)) {
+ pr_err("hotplug unsupported memory type 0x%08x\n", flags);
+ return -EINVAL;
+ }
+
init_memory_mapping(start, start + size);
ret = __add_pages(nid, zone, start_pfn, nr_pages);
@@ -946,8 +955,10 @@ kernel_physical_mapping_remove(unsigned long start, unsigned long end)
remove_pagetable(start, end, true);
}
-int __ref arch_remove_memory(u64 start, u64 size)
+int __ref arch_remove_memory(u64 start, u64 size, int flags)
{
+ const int supported_flags = MEMORY_DEVICE |
+ MEMORY_DEVICE_ALLOW_MIGRATE;
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
struct page *page = pfn_to_page(start_pfn);
@@ -955,6 +966,12 @@ int __ref arch_remove_memory(u64 start, u64 size)
struct zone *zone;
int ret;
+ /* Each flag need special handling so error out on un-supported flag */
+ if (flags & (~supported_flags)) {
+ pr_err("hotremove unsupported memory type 0x%08x\n", flags);
+ return -EINVAL;
+ }
+
/* With altmap the first mapped page is offset from @start */
altmap = to_vmem_altmap((unsigned long) page);
if (altmap)
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 134a2f6..30253da 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -104,7 +104,7 @@ extern bool memhp_auto_online;
#ifdef CONFIG_MEMORY_HOTREMOVE
extern bool is_pageblock_removable_nolock(struct page *page);
-extern int arch_remove_memory(u64 start, u64 size);
+extern int arch_remove_memory(u64 start, u64 size, int flags);
extern int __remove_pages(struct zone *zone, unsigned long start_pfn,
unsigned long nr_pages);
#endif /* CONFIG_MEMORY_HOTREMOVE */
@@ -276,7 +276,27 @@ extern int add_memory(int nid, u64 start, u64 size);
extern int add_memory_resource(int nid, struct resource *resource, bool online);
extern int zone_for_memory(int nid, u64 start, u64 size, int zone_default,
bool for_device);
-extern int arch_add_memory(int nid, u64 start, u64 size, bool for_device);
+
+/*
+ * When hotpluging memory with arch_add_memory() we want more informations on
+ * the type of memory and its properties. The flags parameter allow to provide
+ * more informations on the memory which is being addedd.
+ *
+ * Provide an opt-in flag for struct page migration. Persistent device memory
+ * never relied on struct page migration so far and new user of might also
+ * prefer avoiding struct page migration.
+ *
+ * New non device memory specific flags can be added if ever needed.
+ *
+ * MEMORY_REGULAR: regular system memory
+ * DEVICE_MEMORY: device memory create a ZONE_DEVICE zone for it
+ * DEVICE_MEMORY_ALLOW_MIGRATE: page in that device memory ca be migrated
+ */
+#define MEMORY_NORMAL 0
+#define MEMORY_DEVICE (1 << 0)
+#define MEMORY_DEVICE_ALLOW_MIGRATE (1 << 1)
+
+extern int arch_add_memory(int nid, u64 start, u64 size, int flags);
extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
extern bool is_memblock_offlined(struct memory_block *mem);
extern void remove_memory(int nid, u64 start, u64 size);
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 9341619..29d2cca 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -41,18 +41,26 @@ static inline struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
* @res: physical address range covered by @ref
* @ref: reference count that pins the devm_memremap_pages() mapping
* @dev: host device of the mapping for debug
+ * @flags: memory flags see MEMORY_* in memory_hotplug.h
*/
struct dev_pagemap {
struct vmem_altmap *altmap;
const struct resource *res;
struct percpu_ref *ref;
struct device *dev;
+ int flags;
};
#ifdef CONFIG_ZONE_DEVICE
void *devm_memremap_pages(struct device *dev, struct resource *res,
struct percpu_ref *ref, struct vmem_altmap *altmap);
struct dev_pagemap *find_dev_pagemap(resource_size_t phys);
+
+static inline bool dev_page_allow_migrate(const struct page *page)
+{
+ return ((page_zonenum(page) == ZONE_DEVICE) &&
+ (page->pgmap->flags & MEMORY_DEVICE_ALLOW_MIGRATE));
+}
#else
static inline void *devm_memremap_pages(struct device *dev,
struct resource *res, struct percpu_ref *ref,
@@ -71,6 +79,11 @@ static inline struct dev_pagemap *find_dev_pagemap(resource_size_t phys)
{
return NULL;
}
+
+static inline bool dev_page_allow_migrate(const struct page *page)
+{
+ return false;
+}
#endif
/**
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 0612323..40d4af8 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -249,7 +249,7 @@ static void devm_memremap_pages_release(struct device *dev, void *data)
lock_device_hotplug();
mem_hotplug_begin();
- arch_remove_memory(align_start, align_size);
+ arch_remove_memory(align_start, align_size, MEMORY_DEVICE);
mem_hotplug_done();
unlock_device_hotplug();
@@ -328,6 +328,7 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
}
pgmap->ref = ref;
pgmap->res = &page_map->res;
+ pgmap->flags = MEMORY_DEVICE;
mutex_lock(&pgmap_lock);
error = 0;
@@ -366,7 +367,7 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
lock_device_hotplug();
mem_hotplug_begin();
- error = arch_add_memory(nid, align_start, align_size, true);
+ error = arch_add_memory(nid, align_start, align_size, MEMORY_DEVICE);
mem_hotplug_done();
unlock_device_hotplug();
if (error)
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 295479b..46960b3 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1381,7 +1381,7 @@ int __ref add_memory_resource(int nid, struct resource *res, bool online)
}
/* call arch's memory hotadd */
- ret = arch_add_memory(nid, start, size, false);
+ ret = arch_add_memory(nid, start, size, MEMORY_NORMAL);
if (ret < 0)
goto error;
@@ -2185,7 +2185,7 @@ void __ref remove_memory(int nid, u64 start, u64 size)
memblock_free(start, size);
memblock_remove(start, size);
- arch_remove_memory(start, size);
+ arch_remove_memory(start, size, MEMORY_NORMAL);
try_offline_node(nid);
--
2.4.11
Allow to unmap and restore special swap entry of un-addressable
ZONE_DEVICE memory.
Signed-off-by: Jérôme Glisse <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
---
include/linux/migrate.h | 2 +
mm/migrate.c | 141 +++++++++++++++++++++++++++++++++++++-----------
mm/page_vma_mapped.c | 10 ++++
mm/rmap.c | 25 +++++++++
4 files changed, 147 insertions(+), 31 deletions(-)
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 6c610ee..c43669b 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -130,6 +130,8 @@ static inline int migrate_misplaced_transhuge_page(struct mm_struct *mm,
#define MIGRATE_PFN_HUGE (1UL << (BITS_PER_LONG_LONG - 3))
#define MIGRATE_PFN_LOCKED (1UL << (BITS_PER_LONG_LONG - 4))
#define MIGRATE_PFN_WRITE (1UL << (BITS_PER_LONG_LONG - 5))
+#define MIGRATE_PFN_DEVICE (1UL << (BITS_PER_LONG_LONG - 6))
+#define MIGRATE_PFN_ERROR (1UL << (BITS_PER_LONG_LONG - 7))
#define MIGRATE_PFN_MASK ((1UL << (BITS_PER_LONG_LONG - PAGE_SHIFT)) - 1)
static inline struct page *migrate_pfn_to_page(unsigned long mpfn)
diff --git a/mm/migrate.c b/mm/migrate.c
index 5a14b4ec..9950245 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -41,6 +41,7 @@
#include <linux/page_idle.h>
#include <linux/page_owner.h>
#include <linux/sched/mm.h>
+#include <linux/memremap.h>
#include <asm/tlbflush.h>
@@ -230,7 +231,15 @@ static int remove_migration_pte(struct page *page, struct vm_area_struct *vma,
pte = arch_make_huge_pte(pte, vma, new, 0);
}
#endif
- flush_dcache_page(new);
+
+ if (unlikely(is_zone_device_page(new)) &&
+ !is_addressable_page(new)) {
+ entry = make_device_entry(new, pte_write(pte));
+ pte = swp_entry_to_pte(entry);
+ if (pte_swp_soft_dirty(*pvmw.pte))
+ pte = pte_mksoft_dirty(pte);
+ } else
+ flush_dcache_page(new);
set_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte);
if (PageHuge(new)) {
@@ -302,6 +311,8 @@ void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
*/
if (!get_page_unless_zero(page))
goto out;
+ if (is_zone_device_page(page))
+ get_zone_device_page(page);
pte_unmap_unlock(ptep, ptl);
wait_on_page_locked(page);
put_page(page);
@@ -2101,12 +2112,14 @@ static int migrate_vma_collect_hole(unsigned long start,
next = pmd_addr_end(addr, end);
npages = (next - addr) >> PAGE_SHIFT;
if (npages == (PMD_SIZE >> PAGE_SHIFT)) {
+ migrate->dst[migrate->npages] = 0;
migrate->src[migrate->npages++] = MIGRATE_PFN_HUGE;
ret = migrate_vma_array_full(migrate);
if (ret)
return ret;
} else {
for (i = 0; i < npages; ++i) {
+ migrate->dst[migrate->npages] = 0;
migrate->src[migrate->npages++] = 0;
ret = migrate_vma_array_full(migrate);
if (ret)
@@ -2148,17 +2161,44 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
pte = *ptep;
pfn = pte_pfn(pte);
- if (!pte_present(pte)) {
+ if (pte_none(pte)) {
flags = pfn = 0;
goto next;
}
+ if (!pte_present(pte)) {
+ flags = pfn = 0;
+
+ /*
+ * Only care about unaddressable device page special
+ * page table entry. Other special swap entry are not
+ * migratable and we ignore regular swapped page.
+ */
+ entry = pte_to_swp_entry(pte);
+ if (!is_device_entry(entry))
+ goto next;
+
+ page = device_entry_to_page(entry);
+ if (!dev_page_allow_migrate(page))
+ goto next;
+
+ flags = MIGRATE_PFN_VALID |
+ MIGRATE_PFN_DEVICE |
+ MIGRATE_PFN_MIGRATE;
+ if (is_write_device_entry(entry))
+ flags |= MIGRATE_PFN_WRITE;
+ } else {
+ page = vm_normal_page(migrate->vma, addr, pte);
+ flags = MIGRATE_PFN_VALID | MIGRATE_PFN_MIGRATE;
+ flags |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
+ }
+
/* FIXME support THP */
- page = vm_normal_page(migrate->vma, addr, pte);
if (!page || !page->mapping || PageTransCompound(page)) {
flags = pfn = 0;
goto next;
}
+ pfn = page_to_pfn(page);
/*
* By getting a reference on the page we pin it and that blocks
@@ -2171,8 +2211,6 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
*/
get_page(page);
migrate->cpages++;
- flags = MIGRATE_PFN_VALID | MIGRATE_PFN_MIGRATE;
- flags |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
/*
* Optimize for the common case where page is only mapped once
@@ -2203,6 +2241,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
}
next:
+ migrate->dst[migrate->npages] = 0;
migrate->src[migrate->npages++] = pfn | flags;
ret = migrate_vma_array_full(migrate);
if (ret) {
@@ -2277,6 +2316,13 @@ static bool migrate_vma_check_page(struct page *page)
if (PageCompound(page))
return false;
+ /* Page from ZONE_DEVICE have one extra reference */
+ if (is_zone_device_page(page)) {
+ if (!dev_page_allow_migrate(page))
+ return false;
+ extra++;
+ }
+
if ((page_count(page) - extra) > page_mapcount(page))
return false;
@@ -2316,28 +2362,31 @@ static void migrate_vma_prepare(struct migrate_vma *migrate)
migrate->src[i] |= MIGRATE_PFN_LOCKED;
}
- if (!PageLRU(page) && allow_drain) {
- /* Drain CPU's pagevec */
- lru_add_drain_all();
- allow_drain = false;
- }
+ /* ZONE_DEVICE page are not on LRU */
+ if (!is_zone_device_page(page)) {
+ if (!PageLRU(page) && allow_drain) {
+ /* Drain CPU's pagevec */
+ lru_add_drain_all();
+ allow_drain = false;
+ }
- if (isolate_lru_page(page)) {
- if (remap) {
- migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
- migrate->cpages--;
- restore++;
- } else {
- migrate->src[i] = 0;
- unlock_page(page);
- migrate->cpages--;
- put_page(page);
+ if (isolate_lru_page(page)) {
+ if (remap) {
+ migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
+ migrate->cpages--;
+ restore++;
+ } else {
+ migrate->src[i] = 0;
+ unlock_page(page);
+ migrate->cpages--;
+ put_page(page);
+ }
+ continue;
}
- continue;
- }
- /* Drop the reference we took in collect */
- put_page(page);
+ /* Drop the reference we took in collect */
+ put_page(page);
+ }
if (!migrate_vma_check_page(page)) {
if (remap) {
@@ -2345,14 +2394,19 @@ static void migrate_vma_prepare(struct migrate_vma *migrate)
migrate->cpages--;
restore++;
- get_page(page);
- putback_lru_page(page);
+ if (!is_zone_device_page(page)) {
+ get_page(page);
+ putback_lru_page(page);
+ }
} else {
migrate->src[i] = 0;
unlock_page(page);
migrate->cpages--;
- putback_lru_page(page);
+ if (!is_zone_device_page(page))
+ putback_lru_page(page);
+ else
+ put_page(page);
}
}
}
@@ -2391,7 +2445,7 @@ static void migrate_vma_unmap(struct migrate_vma *migrate)
const unsigned long npages = migrate->npages;
const unsigned long start = migrate->start;
- for (i = 0; i < npages && migrate->cpages; addr += size, i++) {
+ for (addr = start, i = 0; i < npages; addr += size, i++) {
struct page *page = migrate_pfn_to_page(migrate->src[i]);
size = migrate_pfn_size(migrate->src[i]);
@@ -2419,7 +2473,10 @@ static void migrate_vma_unmap(struct migrate_vma *migrate)
unlock_page(page);
restore--;
- putback_lru_page(page);
+ if (is_zone_device_page(page))
+ put_page(page);
+ else
+ putback_lru_page(page);
}
}
@@ -2451,6 +2508,22 @@ static void migrate_vma_pages(struct migrate_vma *migrate)
mapping = page_mapping(page);
+ if (is_zone_device_page(newpage)) {
+ if (!dev_page_allow_migrate(newpage)) {
+ migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
+ continue;
+ }
+
+ /*
+ * For now only support private anonymous when migrating
+ * to un-addressable device memory.
+ */
+ if (mapping && !is_addressable_page(newpage)) {
+ migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
+ continue;
+ }
+ }
+
r = migrate_page(mapping, newpage, page, MIGRATE_SYNC, false);
if (r != MIGRATEPAGE_SUCCESS)
migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
@@ -2492,11 +2565,17 @@ static void migrate_vma_finalize(struct migrate_vma *migrate)
unlock_page(page);
migrate->cpages--;
- putback_lru_page(page);
+ if (is_zone_device_page(page))
+ put_page(page);
+ else
+ putback_lru_page(page);
if (newpage != page) {
unlock_page(newpage);
- putback_lru_page(newpage);
+ if (is_zone_device_page(newpage))
+ put_page(newpage);
+ else
+ putback_lru_page(newpage);
}
}
}
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index c4c9def..5730d23 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -48,6 +48,7 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
if (!is_swap_pte(*pvmw->pte))
return false;
entry = pte_to_swp_entry(*pvmw->pte);
+
if (!is_migration_entry(entry))
return false;
if (migration_entry_to_page(entry) - pvmw->page >=
@@ -60,6 +61,15 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
WARN_ON_ONCE(1);
#endif
} else {
+ if (is_swap_pte(*pvmw->pte)) {
+ swp_entry_t entry;
+
+ entry = pte_to_swp_entry(*pvmw->pte);
+ if (is_device_entry(entry) &&
+ device_entry_to_page(entry) == pvmw->page)
+ return true;
+ }
+
if (!pte_present(*pvmw->pte))
return false;
diff --git a/mm/rmap.c b/mm/rmap.c
index 49ed681..59c34d5 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -63,6 +63,7 @@
#include <linux/hugetlb.h>
#include <linux/backing-dev.h>
#include <linux/page_idle.h>
+#include <linux/memremap.h>
#include <asm/tlbflush.h>
@@ -1315,6 +1316,10 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED))
return SWAP_AGAIN;
+ if (IS_ENABLED(CONFIG_MIGRATION) && (flags & TTU_MIGRATION) &&
+ is_zone_device_page(page) && !dev_page_allow_migrate(page))
+ return SWAP_AGAIN;
+
if (flags & TTU_SPLIT_HUGE_PMD) {
split_huge_pmd_address(vma, address,
flags & TTU_MIGRATION, page);
@@ -1350,6 +1355,26 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
address = pvmw.address;
+ if (IS_ENABLED(CONFIG_MIGRATION) &&
+ (flags & TTU_MIGRATION) &&
+ is_zone_device_page(page)) {
+ swp_entry_t entry;
+ pte_t swp_pte;
+
+ pteval = ptep_get_and_clear(mm, address, pvmw.pte);
+
+ /*
+ * Store the pfn of the page in a special migration
+ * pte. do_swap_page() will wait until the migration
+ * pte is removed and then restart fault handling.
+ */
+ entry = make_migration_entry(page, 0);
+ swp_pte = swp_entry_to_pte(entry);
+ if (pte_soft_dirty(pteval))
+ swp_pte = pte_swp_mksoft_dirty(swp_pte);
+ set_pte_at(mm, address, pvmw.pte, swp_pte);
+ goto discard;
+ }
if (!(flags & TTU_IGNORE_ACCESS)) {
if (ptep_clear_flush_young_notify(vma, address,
--
2.4.11
This add support for un-addressable device memory. Such memory is hotpluged
only so we can have struct page but we should never map them as such memory
can not be accessed by CPU. For that reason it uses a special swap entry for
CPU page table entry.
This patch implement all the logic from special swap type to handling CPU
page fault through a callback specified in the ZONE_DEVICE pgmap struct.
Architecture that wish to support un-addressable device memory should make
sure to never populate the kernel linar mapping for the physical range.
This feature potentially breaks memory hotplug unless every driver using it
magically predicts the future addresses of where memory will be hotplugged.
Changes since v2:
- Do not change devm_memremap_pages()
Changes since v1:
- Add unaddressable memory resource descriptor enum
- Explain why memory hotplug can fail because of un-addressable memory
Signed-off-by: Jérôme Glisse <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Ross Zwisler <[email protected]>
---
fs/proc/task_mmu.c | 7 +++++
include/linux/ioport.h | 1 +
include/linux/memory_hotplug.h | 7 +++++
include/linux/memremap.h | 18 ++++++++++++
include/linux/swap.h | 18 ++++++++++--
include/linux/swapops.h | 67 ++++++++++++++++++++++++++++++++++++++++++
kernel/memremap.c | 22 ++++++++++++--
mm/Kconfig | 12 ++++++++
mm/memory.c | 66 ++++++++++++++++++++++++++++++++++++++++-
mm/memory_hotplug.c | 10 +++++--
mm/mprotect.c | 12 ++++++++
11 files changed, 232 insertions(+), 8 deletions(-)
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index f08bd31..d2dea5c 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -538,6 +538,8 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
}
} else if (is_migration_entry(swpent))
page = migration_entry_to_page(swpent);
+ else if (is_device_entry(swpent))
+ page = device_entry_to_page(swpent);
} else if (unlikely(IS_ENABLED(CONFIG_SHMEM) && mss->check_shmem_swap
&& pte_none(*pte))) {
page = find_get_entry(vma->vm_file->f_mapping,
@@ -700,6 +702,8 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
if (is_migration_entry(swpent))
page = migration_entry_to_page(swpent);
+ else if (is_device_entry(swpent))
+ page = device_entry_to_page(swpent);
}
if (page) {
int mapcount = page_mapcount(page);
@@ -1183,6 +1187,9 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
flags |= PM_SWAP;
if (is_migration_entry(entry))
page = migration_entry_to_page(entry);
+
+ if (is_device_entry(entry))
+ page = device_entry_to_page(entry);
}
if (page && !PageAnon(page))
diff --git a/include/linux/ioport.h b/include/linux/ioport.h
index 6230064..d154a18 100644
--- a/include/linux/ioport.h
+++ b/include/linux/ioport.h
@@ -130,6 +130,7 @@ enum {
IORES_DESC_ACPI_NV_STORAGE = 3,
IORES_DESC_PERSISTENT_MEMORY = 4,
IORES_DESC_PERSISTENT_MEMORY_LEGACY = 5,
+ IORES_DESC_UNADDRESSABLE_MEMORY = 6,
};
/* helpers to define resources */
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 30253da..69aabab 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -286,15 +286,22 @@ extern int zone_for_memory(int nid, u64 start, u64 size, int zone_default,
* never relied on struct page migration so far and new user of might also
* prefer avoiding struct page migration.
*
+ * For device memory (which use ZONE_DEVICE) we want differentiate between CPU
+ * accessible memory (persitent memory, device memory on an architecture with a
+ * system bus that allow transparent access to device memory) and unaddressable
+ * memory (device memory that can not be accessed by CPU directly).
+ *
* New non device memory specific flags can be added if ever needed.
*
* MEMORY_REGULAR: regular system memory
* DEVICE_MEMORY: device memory create a ZONE_DEVICE zone for it
* DEVICE_MEMORY_ALLOW_MIGRATE: page in that device memory ca be migrated
+ * MEMORY_DEVICE_UNADDRESSABLE: un-addressable memory (CPU can not access it)
*/
#define MEMORY_NORMAL 0
#define MEMORY_DEVICE (1 << 0)
#define MEMORY_DEVICE_ALLOW_MIGRATE (1 << 1)
+#define MEMORY_DEVICE_UNADDRESSABLE (1 << 2)
extern int arch_add_memory(int nid, u64 start, u64 size, int flags);
extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 3e04f58..0ae7548 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -35,10 +35,16 @@ static inline struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
}
#endif
+typedef int (*dev_page_fault_t)(struct vm_area_struct *vma,
+ unsigned long addr,
+ struct page *page,
+ unsigned flags,
+ pmd_t *pmdp);
typedef void (*dev_page_free_t)(struct page *page, void *data);
/**
* struct dev_pagemap - metadata for ZONE_DEVICE mappings
+ * @page_fault: callback when CPU fault on an un-addressable device page
* @page_free: free page callback when page refcount reach 1
* @altmap: pre-allocated/reserved memory for vmemmap allocations
* @res: physical address range covered by @ref
@@ -48,6 +54,7 @@ typedef void (*dev_page_free_t)(struct page *page, void *data);
* @flags: memory flags see MEMORY_* in memory_hotplug.h
*/
struct dev_pagemap {
+ dev_page_fault_t page_fault;
dev_page_free_t page_free;
struct vmem_altmap *altmap;
const struct resource *res;
@@ -67,6 +74,12 @@ static inline bool dev_page_allow_migrate(const struct page *page)
return ((page_zonenum(page) == ZONE_DEVICE) &&
(page->pgmap->flags & MEMORY_DEVICE_ALLOW_MIGRATE));
}
+
+static inline bool is_addressable_page(const struct page *page)
+{
+ return ((page_zonenum(page) != ZONE_DEVICE) ||
+ !(page->pgmap->flags & MEMORY_DEVICE_UNADDRESSABLE));
+}
#else
static inline void *devm_memremap_pages(struct device *dev,
struct resource *res, struct percpu_ref *ref,
@@ -90,6 +103,11 @@ static inline bool dev_page_allow_migrate(const struct page *page)
{
return false;
}
+
+static inline bool is_addressable_page(const struct page *page)
+{
+ return true;
+}
#endif
/**
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 45e91dd..ba564bc 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -51,6 +51,17 @@ static inline int current_is_kswapd(void)
*/
/*
+ * Un-addressable device memory support
+ */
+#ifdef CONFIG_DEVICE_UNADDRESSABLE
+#define SWP_DEVICE_NUM 2
+#define SWP_DEVICE_WRITE (MAX_SWAPFILES + SWP_HWPOISON_NUM + SWP_MIGRATION_NUM)
+#define SWP_DEVICE (MAX_SWAPFILES + SWP_HWPOISON_NUM + SWP_MIGRATION_NUM + 1)
+#else
+#define SWP_DEVICE_NUM 0
+#endif
+
+/*
* NUMA node memory migration support
*/
#ifdef CONFIG_MIGRATION
@@ -72,7 +83,8 @@ static inline int current_is_kswapd(void)
#endif
#define MAX_SWAPFILES \
- ((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
+ ((1 << MAX_SWAPFILES_SHIFT) - SWP_DEVICE_NUM - \
+ SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
/*
* Magic header for a swap area. The first part of the union is
@@ -435,8 +447,8 @@ static inline void show_swap_cache_info(void)
{
}
-#define free_swap_and_cache(swp) is_migration_entry(swp)
-#define swapcache_prepare(swp) is_migration_entry(swp)
+#define free_swap_and_cache(e) (is_migration_entry(e) || is_device_entry(e))
+#define swapcache_prepare(e) (is_migration_entry(e) || is_device_entry(e))
static inline int add_swap_count_continuation(swp_entry_t swp, gfp_t gfp_mask)
{
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 5c3a5f3..0e339f0 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -100,6 +100,73 @@ static inline void *swp_to_radix_entry(swp_entry_t entry)
return (void *)(value | RADIX_TREE_EXCEPTIONAL_ENTRY);
}
+#if IS_ENABLED(CONFIG_DEVICE_UNADDRESSABLE)
+static inline swp_entry_t make_device_entry(struct page *page, bool write)
+{
+ return swp_entry(write?SWP_DEVICE_WRITE:SWP_DEVICE, page_to_pfn(page));
+}
+
+static inline bool is_device_entry(swp_entry_t entry)
+{
+ int type = swp_type(entry);
+ return type == SWP_DEVICE || type == SWP_DEVICE_WRITE;
+}
+
+static inline void make_device_entry_read(swp_entry_t *entry)
+{
+ *entry = swp_entry(SWP_DEVICE, swp_offset(*entry));
+}
+
+static inline bool is_write_device_entry(swp_entry_t entry)
+{
+ return unlikely(swp_type(entry) == SWP_DEVICE_WRITE);
+}
+
+static inline struct page *device_entry_to_page(swp_entry_t entry)
+{
+ return pfn_to_page(swp_offset(entry));
+}
+
+int device_entry_fault(struct vm_area_struct *vma,
+ unsigned long addr,
+ swp_entry_t entry,
+ unsigned flags,
+ pmd_t *pmdp);
+#else /* CONFIG_DEVICE_UNADDRESSABLE */
+static inline swp_entry_t make_device_entry(struct page *page, bool write)
+{
+ return swp_entry(0, 0);
+}
+
+static inline void make_device_entry_read(swp_entry_t *entry)
+{
+}
+
+static inline bool is_device_entry(swp_entry_t entry)
+{
+ return false;
+}
+
+static inline bool is_write_device_entry(swp_entry_t entry)
+{
+ return false;
+}
+
+static inline struct page *device_entry_to_page(swp_entry_t entry)
+{
+ return NULL;
+}
+
+static inline int device_entry_fault(struct vm_area_struct *vma,
+ unsigned long addr,
+ swp_entry_t entry,
+ unsigned flags,
+ pmd_t *pmdp)
+{
+ return VM_FAULT_SIGBUS;
+}
+#endif /* CONFIG_DEVICE_UNADDRESSABLE */
+
#ifdef CONFIG_MIGRATION
static inline swp_entry_t make_migration_entry(struct page *page, int write)
{
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 19df1f5..d42f039f 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -18,6 +18,8 @@
#include <linux/io.h>
#include <linux/mm.h>
#include <linux/memory_hotplug.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
#ifndef ioremap_cache
/* temporary while we convert existing ioremap_cache users to memremap */
@@ -203,6 +205,21 @@ void put_zone_device_page(struct page *page)
}
EXPORT_SYMBOL(put_zone_device_page);
+#if IS_ENABLED(CONFIG_DEVICE_UNADDRESSABLE)
+int device_entry_fault(struct vm_area_struct *vma,
+ unsigned long addr,
+ swp_entry_t entry,
+ unsigned flags,
+ pmd_t *pmdp)
+{
+ struct page *page = device_entry_to_page(entry);
+
+ BUG_ON(!page->pgmap->page_fault);
+ return page->pgmap->page_fault(vma, addr, page, flags, pmdp);
+}
+EXPORT_SYMBOL(device_entry_fault);
+#endif /* CONFIG_DEVICE_UNADDRESSABLE */
+
static void pgmap_radix_release(struct resource *res)
{
resource_size_t key, align_start, align_size, align_end;
@@ -258,7 +275,7 @@ static void devm_memremap_pages_release(struct device *dev, void *data)
lock_device_hotplug();
mem_hotplug_begin();
- arch_remove_memory(align_start, align_size, MEMORY_DEVICE);
+ arch_remove_memory(align_start, align_size, pgmap->flags);
mem_hotplug_done();
unlock_device_hotplug();
@@ -338,6 +355,7 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
pgmap->ref = ref;
pgmap->res = &page_map->res;
pgmap->flags = MEMORY_DEVICE;
+ pgmap->page_fault = NULL;
pgmap->page_free = NULL;
pgmap->data = NULL;
@@ -378,7 +396,7 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
lock_device_hotplug();
mem_hotplug_begin();
- error = arch_add_memory(nid, align_start, align_size, MEMORY_DEVICE);
+ error = arch_add_memory(nid, align_start, align_size, pgmap->flags);
mem_hotplug_done();
unlock_device_hotplug();
if (error)
diff --git a/mm/Kconfig b/mm/Kconfig
index 9b8fccb..9502315 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -700,6 +700,18 @@ config ZONE_DEVICE
If FS_DAX is enabled, then say Y.
+config DEVICE_UNADDRESSABLE
+ bool "Un-addressable device memory (GPU memory, ...)"
+ depends on ZONE_DEVICE
+
+ help
+ Allow to create struct page for un-addressable device memory
+ ie memory that is only accessible by the device (or group of
+ devices).
+
+ Having struct page is necessary for process memory migration
+ to device memory.
+
config FRAME_VECTOR
bool
diff --git a/mm/memory.c b/mm/memory.c
index 235ba51..33aff303 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -49,6 +49,7 @@
#include <linux/swap.h>
#include <linux/highmem.h>
#include <linux/pagemap.h>
+#include <linux/memremap.h>
#include <linux/ksm.h>
#include <linux/rmap.h>
#include <linux/export.h>
@@ -927,6 +928,25 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
pte = pte_swp_mksoft_dirty(pte);
set_pte_at(src_mm, addr, src_pte, pte);
}
+ } else if (is_device_entry(entry)) {
+ page = device_entry_to_page(entry);
+
+ /*
+ * Update rss count even for un-addressable page as
+ * they should be consider just like any other page.
+ */
+ get_page(page);
+ rss[mm_counter(page)]++;
+ page_dup_rmap(page, false);
+
+ if (is_write_device_entry(entry) &&
+ is_cow_mapping(vm_flags)) {
+ make_device_entry_read(&entry);
+ pte = swp_entry_to_pte(entry);
+ if (pte_swp_soft_dirty(*src_pte))
+ pte = pte_swp_mksoft_dirty(pte);
+ set_pte_at(src_mm, addr, src_pte, pte);
+ }
}
goto out_set_pte;
}
@@ -1243,6 +1263,34 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
}
continue;
}
+
+ /*
+ * Un-addressable page must always be check that are not like
+ * other swap entries and thus should be check no matter what
+ * details->check_swap_entries value is.
+ */
+ entry = pte_to_swp_entry(ptent);
+ if (non_swap_entry(entry) && is_device_entry(entry)) {
+ struct page *page = device_entry_to_page(entry);
+
+ if (unlikely(details && details->check_mapping)) {
+ /*
+ * unmap_shared_mapping_pages() wants to
+ * invalidate cache without truncating:
+ * unmap shared but keep private pages.
+ */
+ if (details->check_mapping !=
+ page_rmapping(page))
+ continue;
+ }
+
+ pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
+ rss[mm_counter(page)]--;
+ page_remove_rmap(page, false);
+ put_page(page);
+ continue;
+ }
+
/* If details->check_mapping, we leave swap entries. */
if (unlikely(details))
continue;
@@ -2690,6 +2738,14 @@ int do_swap_page(struct vm_fault *vmf)
if (is_migration_entry(entry)) {
migration_entry_wait(vma->vm_mm, vmf->pmd,
vmf->address);
+ } else if (is_device_entry(entry)) {
+ /*
+ * For un-addressable device memory we call the pgmap
+ * fault handler callback. The callback must migrate
+ * the page back to some CPU accessible page.
+ */
+ ret = device_entry_fault(vma, vmf->address, entry,
+ vmf->flags, vmf->pmd);
} else if (is_hwpoison_entry(entry)) {
ret = VM_FAULT_HWPOISON;
} else {
@@ -3679,6 +3735,7 @@ static int wp_huge_pud(struct vm_fault *vmf, pud_t orig_pud)
static int handle_pte_fault(struct vm_fault *vmf)
{
pte_t entry;
+ struct page *page;
if (unlikely(pmd_none(*vmf->pmd))) {
/*
@@ -3729,9 +3786,16 @@ static int handle_pte_fault(struct vm_fault *vmf)
if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
return do_numa_page(vmf);
+ /* Catch mapping of un-addressable memory this should never happen */
+ entry = vmf->orig_pte;
+ page = pfn_to_page(pte_pfn(entry));
+ if (!is_addressable_page(page)) {
+ print_bad_pte(vmf->vma, vmf->address, entry, page);
+ return VM_FAULT_SIGBUS;
+ }
+
vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
spin_lock(vmf->ptl);
- entry = vmf->orig_pte;
if (unlikely(!pte_same(*vmf->pte, entry)))
goto unlock;
if (vmf->flags & FAULT_FLAG_WRITE) {
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 46960b3..4dcc003 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -152,7 +152,7 @@ void mem_hotplug_done(void)
/* add this memory to iomem resource */
static struct resource *register_memory_resource(u64 start, u64 size)
{
- struct resource *res;
+ struct resource *res, *conflict;
res = kzalloc(sizeof(struct resource), GFP_KERNEL);
if (!res)
return ERR_PTR(-ENOMEM);
@@ -161,7 +161,13 @@ static struct resource *register_memory_resource(u64 start, u64 size)
res->start = start;
res->end = start + size - 1;
res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
- if (request_resource(&iomem_resource, res) < 0) {
+ conflict = request_resource_conflict(&iomem_resource, res);
+ if (conflict) {
+ if (conflict->desc == IORES_DESC_UNADDRESSABLE_MEMORY) {
+ pr_debug("Device un-addressable memory block "
+ "memory hotplug at %#010llx !\n",
+ (unsigned long long)start);
+ }
pr_debug("System RAM resource %pR cannot be added\n", res);
kfree(res);
return ERR_PTR(-EEXIST);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 8edd0d5..50ac297 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -126,6 +126,18 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
pages++;
}
+
+ if (is_write_device_entry(entry)) {
+ pte_t newpte;
+
+ make_device_entry_read(&entry);
+ newpte = swp_entry_to_pte(entry);
+ if (pte_swp_soft_dirty(oldpte))
+ newpte = pte_swp_mksoft_dirty(newpte);
+ set_pte_at(mm, addr, pte, newpte);
+
+ pages++;
+ }
}
} while (pte++, addr += PAGE_SIZE, addr != end);
arch_leave_lazy_mmu_mode();
--
2.4.11
This introduce a simple struct and associated helpers for device driver
to use when hotpluging un-addressable device memory as ZONE_DEVICE. It
will find a unuse physical address range and trigger memory hotplug for
it which allocates and initialize struct page for the device memory.
Signed-off-by: Jérôme Glisse <[email protected]>
Signed-off-by: Evgeny Baskakov <[email protected]>
Signed-off-by: John Hubbard <[email protected]>
Signed-off-by: Mark Hairgrove <[email protected]>
Signed-off-by: Sherry Cheung <[email protected]>
Signed-off-by: Subhash Gutti <[email protected]>
---
include/linux/hmm.h | 115 +++++++++++++++
mm/Kconfig | 7 +
mm/hmm.c | 398 ++++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 520 insertions(+)
diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index c6d2cca..3054ce7 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -79,6 +79,11 @@
#if IS_ENABLED(CONFIG_HMM)
+#include <linux/migrate.h>
+#include <linux/memremap.h>
+#include <linux/completion.h>
+
+
struct hmm;
/*
@@ -321,6 +326,116 @@ int hmm_vma_fault(struct vm_area_struct *vma,
#endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
+#if IS_ENABLED(CONFIG_HMM_DEVMEM)
+struct hmm_devmem;
+
+struct page *hmm_vma_alloc_locked_page(struct vm_area_struct *vma,
+ unsigned long addr);
+
+/*
+ * struct hmm_devmem_ops - callback for ZONE_DEVICE memory events
+ *
+ * @free: call when refcount on page reach 1 and thus is no longer use
+ * @fault: call when there is a page fault to unaddressable memory
+ */
+struct hmm_devmem_ops {
+ void (*free)(struct hmm_devmem *devmem, struct page *page);
+ int (*fault)(struct hmm_devmem *devmem,
+ struct vm_area_struct *vma,
+ unsigned long addr,
+ struct page *page,
+ unsigned flags,
+ pmd_t *pmdp);
+};
+
+/*
+ * struct hmm_devmem - track device memory
+ *
+ * @completion: completion object for device memory
+ * @pfn_first: first pfn for this resource (set by hmm_devmem_add())
+ * @pfn_last: last pfn for this resource (set by hmm_devmem_add())
+ * @resource: IO resource reserved for this chunk of memory
+ * @pagemap: device page map for that chunk
+ * @device: device to bind resource to
+ * @ops: memory operations callback
+ * @ref: per CPU refcount
+ *
+ * This an helper structure for device drivers that do not wish to implement
+ * the gory details related to hotplugging new memoy and allocating struct
+ * pages.
+ *
+ * Device drivers can directly use ZONE_DEVICE memory on their own if they
+ * wish to do so.
+ */
+struct hmm_devmem {
+ struct completion completion;
+ unsigned long pfn_first;
+ unsigned long pfn_last;
+ struct resource *resource;
+ struct device *device;
+ struct dev_pagemap pagemap;
+ const struct hmm_devmem_ops *ops;
+ struct percpu_ref ref;
+};
+
+/*
+ * To add (hotplug) device memory, HMM assumes that there is no real resource
+ * that reserves a range in the physical address space (this is intended to be
+ * use by unaddressable device memory). It will reserve a physical range big
+ * enough and allocate struct page for it.
+ *
+ * Device driver can wrap the hmm_devmem struct inside a private device driver
+ * struct. Device driver must call hmm_devmem_remove() before device goes away
+ * and before freeing the hmm_devmem struct memory.
+ */
+struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
+ struct device *device,
+ unsigned long size);
+void hmm_devmem_remove(struct hmm_devmem *devmem);
+
+int hmm_devmem_fault_range(struct hmm_devmem *devmem,
+ struct vm_area_struct *vma,
+ const struct migrate_vma_ops *ops,
+ unsigned long mentry,
+ unsigned long *src,
+ unsigned long *dst,
+ unsigned long start,
+ unsigned long addr,
+ unsigned long end,
+ void *private);
+
+/*
+ * hmm_devmem_page_set_drvdata - set per-page driver data field
+ *
+ * @page: pointer to struct page
+ * @data: driver data value to set
+ *
+ * Because page can not be on lru we have an unsigned long that driver can use
+ * to store a per page field. This just a simple helper to do that.
+ */
+static inline void hmm_devmem_page_set_drvdata(struct page *page,
+ unsigned long data)
+{
+ unsigned long *drvdata = (unsigned long *)&page->pgmap;
+
+ drvdata[1] = data;
+}
+
+/*
+ * hmm_devmem_page_get_drvdata - get per page driver data field
+ *
+ * @page: pointer to struct page
+ * Return: driver data value
+ */
+static inline unsigned long hmm_devmem_page_get_drvdata(struct page *page)
+{
+ unsigned long *drvdata = (unsigned long *)&page->pgmap;
+
+ return drvdata[1];
+}
+#endif /* IS_ENABLED(CONFIG_HMM_DEVMEM) */
+
+
/* Below are for HMM internal use only! Not to be used by device driver! */
void hmm_mm_destroy(struct mm_struct *mm);
diff --git a/mm/Kconfig b/mm/Kconfig
index 8ae7600..a430d51 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -308,6 +308,13 @@ config HMM_MIRROR
range of virtual address. This require careful synchronization with
CPU page table update.
+config HMM_DEVMEM
+ bool "HMM device memory helpers (to leverage ZONE_DEVICE)"
+ select HMM
+ help
+ HMM devmem are helpers to leverage new ZONE_DEVICE feature. This is
+ just to avoid device driver to replicate boiler plate code.
+
config PHYS_ADDR_T_64BIT
def_bool 64BIT || ARCH_PHYS_ADDR_T_64BIT
diff --git a/mm/hmm.c b/mm/hmm.c
index ad5d9b1..019f379 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -23,10 +23,15 @@
#include <linux/swap.h>
#include <linux/slab.h>
#include <linux/sched.h>
+#include <linux/mmzone.h>
+#include <linux/pagemap.h>
#include <linux/swapops.h>
#include <linux/hugetlb.h>
+#include <linux/memremap.h>
#include <linux/mmu_notifier.h>
+#define SECTION_SIZE (1UL << PA_SECTION_SHIFT)
+
/*
* struct hmm - HMM per mm struct
@@ -735,3 +740,396 @@ int hmm_vma_fault(struct vm_area_struct *vma,
}
EXPORT_SYMBOL(hmm_vma_fault);
#endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
+
+
+#if IS_ENABLED(CONFIG_HMM_DEVMEM)
+struct page *hmm_vma_alloc_locked_page(struct vm_area_struct *vma,
+ unsigned long addr)
+{
+ struct page *page;
+
+ page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
+ if (!page)
+ return NULL;
+ lock_page(page);
+ return page;
+}
+EXPORT_SYMBOL(hmm_vma_alloc_locked_page);
+
+
+static void hmm_devmem_ref_release(struct percpu_ref *ref)
+{
+ struct hmm_devmem *devmem;
+
+ devmem = container_of(ref, struct hmm_devmem, ref);
+ complete(&devmem->completion);
+}
+
+static void hmm_devmem_ref_exit(void *data)
+{
+ struct percpu_ref *ref = data;
+ struct hmm_devmem *devmem;
+
+ devmem = container_of(ref, struct hmm_devmem, ref);
+ percpu_ref_exit(ref);
+ devm_remove_action(devmem->device, &hmm_devmem_ref_exit, data);
+}
+
+static void hmm_devmem_ref_kill(void *data)
+{
+ struct percpu_ref *ref = data;
+ struct hmm_devmem *devmem;
+
+ devmem = container_of(ref, struct hmm_devmem, ref);
+ percpu_ref_kill(ref);
+ wait_for_completion(&devmem->completion);
+ devm_remove_action(devmem->device, &hmm_devmem_ref_kill, data);
+}
+
+static int hmm_devmem_fault(struct vm_area_struct *vma,
+ unsigned long addr,
+ struct page *page,
+ unsigned flags,
+ pmd_t *pmdp)
+{
+ struct hmm_devmem *devmem = page->pgmap->data;
+
+ return devmem->ops->fault(devmem, vma, addr, page, flags, pmdp);
+}
+
+static void hmm_devmem_free(struct page *page, void *data)
+{
+ struct hmm_devmem *devmem = data;
+
+ devmem->ops->free(devmem, page);
+}
+
+static DEFINE_MUTEX(hmm_devmem_lock);
+static RADIX_TREE(hmm_devmem_radix, GFP_KERNEL);
+#define SECTION_SIZE (1UL << PA_SECTION_SHIFT)
+
+static void hmm_devmem_radix_release(struct resource *resource)
+{
+ resource_size_t key, align_start, align_size, align_end;
+
+ align_start = resource->start & ~(SECTION_SIZE - 1);
+ align_size = ALIGN(resource_size(resource), SECTION_SIZE);
+ align_end = align_start + align_size - 1;
+
+ mutex_lock(&hmm_devmem_lock);
+ for (key = resource->start; key <= resource->end; key += SECTION_SIZE)
+ radix_tree_delete(&hmm_devmem_radix, key >> PA_SECTION_SHIFT);
+ mutex_unlock(&hmm_devmem_lock);
+}
+
+static void hmm_devmem_release(struct device *dev, void *data)
+{
+ struct hmm_devmem *devmem = data;
+ resource_size_t align_start, align_size;
+ struct resource *resource = devmem->resource;
+
+ if (percpu_ref_tryget_live(&devmem->ref)) {
+ dev_WARN(dev, "%s: page mapping is still live!\n", __func__);
+ percpu_ref_put(&devmem->ref);
+ }
+
+ /* pages are dead and unused, undo the arch mapping */
+ align_start = resource->start & ~(SECTION_SIZE - 1);
+ align_size = ALIGN(resource_size(resource), SECTION_SIZE);
+ arch_remove_memory(align_start, align_size, devmem->pagemap.flags);
+ untrack_pfn(NULL, PHYS_PFN(align_start), align_size);
+ hmm_devmem_radix_release(resource);
+}
+
+static struct hmm_devmem *hmm_devmem_find(resource_size_t phys)
+{
+ WARN_ON_ONCE(!rcu_read_lock_held());
+
+ return radix_tree_lookup(&hmm_devmem_radix, phys >> PA_SECTION_SHIFT);
+}
+
+static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
+{
+ resource_size_t key, align_start, align_size, align_end;
+ struct device *device = devmem->device;
+ pgprot_t pgprot = PAGE_KERNEL;
+ int ret, nid, is_ram;
+ unsigned long pfn;
+
+ align_start = devmem->resource->start & ~(SECTION_SIZE - 1);
+ align_size = ALIGN(devmem->resource->start +
+ resource_size(devmem->resource),
+ SECTION_SIZE) - align_start;
+
+ is_ram = region_intersects(align_start, align_size,
+ IORESOURCE_SYSTEM_RAM,
+ IORES_DESC_NONE);
+ if (is_ram == REGION_MIXED) {
+ WARN_ONCE(1, "%s attempted on mixed region %pr\n",
+ __func__, devmem->resource);
+ return -ENXIO;
+ }
+ if (is_ram == REGION_INTERSECTS)
+ return -ENXIO;
+
+ devmem->pagemap.flags = MEMORY_DEVICE |
+ MEMORY_DEVICE_ALLOW_MIGRATE |
+ MEMORY_DEVICE_UNADDRESSABLE;
+ devmem->pagemap.res = devmem->resource;
+ devmem->pagemap.page_fault = hmm_devmem_fault;
+ devmem->pagemap.page_free = hmm_devmem_free;
+ devmem->pagemap.dev = devmem->device;
+ devmem->pagemap.ref = &devmem->ref;
+ devmem->pagemap.data = devmem;
+
+ mutex_lock(&hmm_devmem_lock);
+ align_end = align_start + align_size - 1;
+ for (key = align_start; key <= align_end; key += SECTION_SIZE) {
+ struct hmm_devmem *dup;
+
+ rcu_read_lock();
+ dup = hmm_devmem_find(key);
+ rcu_read_unlock();
+ if (dup) {
+ dev_err(device, "%s: collides with mapping for %s\n",
+ __func__, dev_name(dup->device));
+ mutex_unlock(&hmm_devmem_lock);
+ ret = -EBUSY;
+ goto error;
+ }
+ ret = radix_tree_insert(&hmm_devmem_radix,
+ key >> PA_SECTION_SHIFT,
+ devmem);
+ if (ret) {
+ dev_err(device, "%s: failed: %d\n", __func__, ret);
+ mutex_unlock(&hmm_devmem_lock);
+ goto error_radix;
+ }
+ }
+ mutex_unlock(&hmm_devmem_lock);
+
+ nid = dev_to_node(device);
+ if (nid < 0)
+ nid = numa_mem_id();
+
+ ret = track_pfn_remap(NULL, &pgprot, PHYS_PFN(align_start),
+ 0, align_size);
+ if (ret)
+ goto error_radix;
+
+ ret = arch_add_memory(nid, align_start, align_size,
+ devmem->pagemap.flags);
+ if (ret)
+ goto error_add_memory;
+
+ for (pfn = devmem->pfn_first; pfn < devmem->pfn_last; pfn++) {
+ struct page *page = pfn_to_page(pfn);
+
+ /*
+ * ZONE_DEVICE pages union ->lru with a ->pgmap back
+ * pointer. It is a bug if a ZONE_DEVICE page is ever
+ * freed or placed on a driver-private list. Seed the
+ * storage with LIST_POISON* values.
+ */
+ list_del(&page->lru);
+ page->pgmap = &devmem->pagemap;
+ }
+ return 0;
+
+error_add_memory:
+ untrack_pfn(NULL, PHYS_PFN(align_start), align_size);
+error_radix:
+ hmm_devmem_radix_release(devmem->resource);
+error:
+ return ret;
+}
+
+static int hmm_devmem_match(struct device *dev, void *data, void *match_data)
+{
+ struct hmm_devmem *devmem = data;
+
+ return devmem->resource == match_data;
+}
+
+static void hmm_devmem_pages_remove(struct hmm_devmem *devmem)
+{
+ devres_release(devmem->device, &hmm_devmem_release,
+ &hmm_devmem_match, devmem->resource);
+}
+
+/*
+ * hmm_devmem_add() - hotplug fake ZONE_DEVICE memory for device memory
+ *
+ * @ops: memory event device driver callback (see struct hmm_devmem_ops)
+ * @device: device struct to bind the resource too
+ * @size: size in bytes of the device memory to add
+ * Returns: pointer to new hmm_devmem struct ERR_PTR otherwise
+ *
+ * This first find an empty range of physical address big enough to for the new
+ * resource and then hotplug it as ZONE_DEVICE memory allocating struct page.
+ * It does not do anything beside that, all events affecting the memory will go
+ * through the various callback provided by hmm_devmem_ops struct.
+ */
+struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
+ struct device *device,
+ unsigned long size)
+{
+ struct hmm_devmem *devmem;
+ resource_size_t addr;
+ int ret;
+
+ devmem = devres_alloc_node(&hmm_devmem_release, sizeof(*devmem),
+ GFP_KERNEL, dev_to_node(device));
+ if (!devmem)
+ return ERR_PTR(-ENOMEM);
+
+ init_completion(&devmem->completion);
+ devmem->pfn_first = -1UL;
+ devmem->pfn_last = -1UL;
+ devmem->resource = NULL;
+ devmem->device = device;
+ devmem->ops = ops;
+
+ ret = percpu_ref_init(&devmem->ref, &hmm_devmem_ref_release,
+ 0, GFP_KERNEL);
+ if (ret)
+ goto error_percpu_ref;
+
+ ret = devm_add_action(device, hmm_devmem_ref_exit, &devmem->ref);
+ if (ret)
+ goto error_devm_add_action;
+
+ size = ALIGN(size, SECTION_SIZE);
+ addr = (iomem_resource.end + 1ULL) - size;
+
+ /*
+ * FIXME add a new helper to quickly walk resource tree and find free
+ * range
+ *
+ * FIXME what about ioport_resource resource ?
+ */
+ for (; addr > size && addr >= iomem_resource.start; addr -= size) {
+ ret = region_intersects(addr, size, 0, IORES_DESC_NONE);
+ if (ret != REGION_DISJOINT)
+ continue;
+
+ devmem->resource = devm_request_mem_region(device, addr, size,
+ dev_name(device));
+ if (!devmem->resource) {
+ ret = -ENOMEM;
+ goto error_no_resource;
+ }
+ devmem->resource->desc = IORES_DESC_UNADDRESSABLE_MEMORY;
+ break;
+ }
+ if (!devmem->resource) {
+ ret = -ERANGE;
+ goto error_no_resource;
+ }
+
+ devmem->pfn_first = devmem->resource->start >> PAGE_SHIFT;
+ devmem->pfn_last = devmem->pfn_first +
+ (resource_size(devmem->resource) >> PAGE_SHIFT);
+
+ ret = hmm_devmem_pages_create(devmem);
+ if (ret)
+ goto error_pages;
+
+ devres_add(device, devmem);
+
+ ret = devm_add_action(device, hmm_devmem_ref_kill, &devmem->ref);
+ if (ret) {
+ hmm_devmem_remove(devmem);
+ return ERR_PTR(ret);
+ }
+
+ return devmem;
+
+error_pages:
+ devm_release_mem_region(device, devmem->resource->start,
+ resource_size(devmem->resource));
+error_no_resource:
+error_devm_add_action:
+ hmm_devmem_ref_kill(&devmem->ref);
+ hmm_devmem_ref_exit(&devmem->ref);
+error_percpu_ref:
+ devres_free(devmem);
+ return ERR_PTR(ret);
+}
+EXPORT_SYMBOL(hmm_devmem_add);
+
+/*
+ * hmm_devmem_remove() - remove device memory (kill and free ZONE_DEVICE)
+ *
+ * @devmem: hmm_devmem struct use to track and manage the ZONE_DEVICE memory
+ *
+ * This will hot-unplug memory that was hotplugged by hmm_devmem_add on behalf
+ * of the device driver. It will free struct page and remove the resource that
+ * reserve the physical address range for this device memory.
+ */
+void hmm_devmem_remove(struct hmm_devmem *devmem)
+{
+ resource_size_t start, size;
+ struct device *device;
+
+ if (!devmem)
+ return;
+
+ device = devmem->device;
+ start = devmem->resource->start;
+ size = resource_size(devmem->resource);
+
+ hmm_devmem_ref_kill(&devmem->ref);
+ hmm_devmem_ref_exit(&devmem->ref);
+ hmm_devmem_pages_remove(devmem);
+
+ devm_release_mem_region(device, start, size);
+}
+EXPORT_SYMBOL(hmm_devmem_remove);
+
+/*
+ * hmm_devmem_fault_range() - migrate back a virtual range of memory
+ *
+ * @devmem: hmm_devmem struct use to track and manage the ZONE_DEVICE memory
+ * @vma: virtual memory area containing the range to be migrated
+ * @ops: migration callback for allocating destination memory and copying
+ * @mentry: maximum number of entries in src or dst array
+ * @src: array of unsigned long containing source pfns
+ * @dst: array of unsigned long containing destination pfns
+ * @start: start address of the range to migrate (inclusive)
+ * @addr: fault address (must be inside the range)
+ * @end: end address of the range to migrate (exclusive)
+ * @private: pointer passed back to each of the callback
+ * Returns: 0 on success, VM_FAULT_SIGBUS on error
+ *
+ * This is a wrapper around migrate_vma() which check the migration status
+ * for a given fault address and return corresponding page fault handler status
+ * ie 0 on success or VM_FAULT_SIGBUS if migration failed for fault address.
+ *
+ * This is a helper intendend to be used by ZONE_DEVICE fault handler.
+ */
+int hmm_devmem_fault_range(struct hmm_devmem *devmem,
+ struct vm_area_struct *vma,
+ const struct migrate_vma_ops *ops,
+ unsigned long mentry,
+ unsigned long *src,
+ unsigned long *dst,
+ unsigned long start,
+ unsigned long addr,
+ unsigned long end,
+ void *private)
+{
+ unsigned long i, size, tmp;
+ if (migrate_vma(ops, vma, mentry, start, end, src, dst, private))
+ return VM_FAULT_SIGBUS;
+
+ for (i = 0, tmp = start; tmp < addr; i++, tmp += size) {
+ size = migrate_pfn_size(src[i]);
+ }
+ if (dst[i] & MIGRATE_PFN_ERROR)
+ return VM_FAULT_SIGBUS;
+
+ return 0;
+}
+EXPORT_SYMBOL(hmm_devmem_fault_range);
+#endif /* IS_ENABLED(CONFIG_HMM_DEVMEM) */
--
2.4.11
This handle page fault on behalf of device driver, unlike handle_mm_fault()
it does not trigger migration back to system memory for device memory.
Signed-off-by: Jérôme Glisse <[email protected]>
Signed-off-by: Evgeny Baskakov <[email protected]>
Signed-off-by: John Hubbard <[email protected]>
Signed-off-by: Mark Hairgrove <[email protected]>
Signed-off-by: Sherry Cheung <[email protected]>
Signed-off-by: Subhash Gutti <[email protected]>
---
include/linux/hmm.h | 27 ++++++
mm/hmm.c | 269 ++++++++++++++++++++++++++++++++++++++++++++++------
2 files changed, 268 insertions(+), 28 deletions(-)
diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 6e89da4..c6d2cca 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -291,6 +291,33 @@ int hmm_vma_get_pfns(struct vm_area_struct *vma,
unsigned long end,
hmm_pfn_t *pfns);
bool hmm_vma_range_done(struct vm_area_struct *vma, struct hmm_range *range);
+
+
+/*
+ * Fault memory on behalf of device driver. Unlike handle_mm_fault(), this will
+ * not migrate any device memory back to system memory. The hmm_pfn_t array will
+ * be updated with the fault result and current snapshot of the CPU page table
+ * for the range.
+ *
+ * The mmap_sem must be taken in read mode before entering and it might be
+ * dropped by the function if block argument is false. In that case, the
+ * function returns -EAGAIN.
+ *
+ * Return value does not reflect if the fault was successful for every single
+ * address or not. Therefore, the caller must to inspect the hmm_pfn_t array to
+ * determine fault status for each address.
+ *
+ * Trying to fault inside an invalid vma will result in -EINVAL.
+ *
+ * See function description in mm/hmm.c for further documentation.
+ */
+int hmm_vma_fault(struct vm_area_struct *vma,
+ struct hmm_range *range,
+ unsigned long start,
+ unsigned long end,
+ hmm_pfn_t *pfns,
+ bool write,
+ bool block);
#endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
diff --git a/mm/hmm.c b/mm/hmm.c
index 9b52d36..ad5d9b1 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -288,6 +288,15 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror)
}
EXPORT_SYMBOL(hmm_mirror_unregister);
+
+static void hmm_pfns_error(hmm_pfn_t *pfns,
+ unsigned long addr,
+ unsigned long end)
+{
+ for (; addr < end; addr += PAGE_SIZE, pfns++)
+ *pfns = HMM_PFN_ERROR;
+}
+
static void hmm_pfns_empty(hmm_pfn_t *pfns,
unsigned long addr,
unsigned long end)
@@ -304,10 +313,43 @@ static void hmm_pfns_special(hmm_pfn_t *pfns,
*pfns = HMM_PFN_SPECIAL;
}
-static void hmm_vma_walk(struct vm_area_struct *vma,
- unsigned long start,
- unsigned long end,
- hmm_pfn_t *pfns)
+static void hmm_pfns_clear(hmm_pfn_t *pfns,
+ unsigned long addr,
+ unsigned long end)
+{
+ unsigned long npfns = (end - addr) >> PAGE_SHIFT;
+
+ memset(pfns, 0, sizeof(*pfns) * npfns);
+}
+
+static int hmm_vma_do_fault(struct vm_area_struct *vma,
+ const hmm_pfn_t fault,
+ unsigned long addr,
+ hmm_pfn_t *pfn,
+ bool block)
+{
+ unsigned flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_REMOTE;
+ int r;
+
+ flags |= block ? 0 : FAULT_FLAG_ALLOW_RETRY;
+ flags |= (fault & HMM_PFN_WRITE) ? FAULT_FLAG_WRITE : 0;
+ r = handle_mm_fault(vma, addr, flags);
+ if (r & VM_FAULT_RETRY)
+ return -EAGAIN;
+ if (r & VM_FAULT_ERROR) {
+ *pfn = HMM_PFN_ERROR;
+ return -EFAULT;
+ }
+
+ return 0;
+}
+
+static int hmm_vma_walk(struct vm_area_struct *vma,
+ const hmm_pfn_t fault,
+ unsigned long start,
+ unsigned long end,
+ hmm_pfn_t *pfns,
+ bool block)
{
unsigned long addr, next;
hmm_pfn_t flag;
@@ -321,6 +363,7 @@ static void hmm_vma_walk(struct vm_area_struct *vma,
pmd_t *pmdp;
pte_t *ptep;
pmd_t pmd;
+ int ret;
/*
* We are accessing/faulting for a device from an unknown
@@ -331,15 +374,37 @@ static void hmm_vma_walk(struct vm_area_struct *vma,
next = pgd_addr_end(addr, end);
pgdp = pgd_offset(vma->vm_mm, addr);
if (pgd_none(*pgdp) || pgd_bad(*pgdp)) {
- hmm_pfns_empty(&pfns[i], addr, next);
- continue;
+ if (!(vma->vm_flags & VM_READ)) {
+ hmm_pfns_empty(&pfns[i], addr, next);
+ continue;
+ }
+ if (!fault) {
+ hmm_pfns_empty(&pfns[i], addr, next);
+ continue;
+ }
+ pudp = pud_alloc(vma->vm_mm, pgdp, addr);
+ if (!pudp) {
+ hmm_pfns_error(&pfns[i], addr, next);
+ continue;
+ }
}
next = pud_addr_end(addr, end);
pudp = pud_offset(pgdp, addr);
if (pud_none(*pudp) || pud_bad(*pudp)) {
- hmm_pfns_empty(&pfns[i], addr, next);
- continue;
+ if (!(vma->vm_flags & VM_READ)) {
+ hmm_pfns_empty(&pfns[i], addr, next);
+ continue;
+ }
+ if (!fault) {
+ hmm_pfns_empty(&pfns[i], addr, next);
+ continue;
+ }
+ pmdp = pmd_alloc(vma->vm_mm, pudp, addr);
+ if (!pmdp) {
+ hmm_pfns_error(&pfns[i], addr, next);
+ continue;
+ }
}
next = pmd_addr_end(addr, end);
@@ -347,8 +412,24 @@ static void hmm_vma_walk(struct vm_area_struct *vma,
pmd = pmd_read_atomic(pmdp);
barrier();
if (pmd_none(pmd) || pmd_bad(pmd)) {
- hmm_pfns_empty(&pfns[i], addr, next);
- continue;
+ if (!(vma->vm_flags & VM_READ)) {
+ hmm_pfns_empty(&pfns[i], addr, next);
+ continue;
+ }
+ if (!fault) {
+ hmm_pfns_empty(&pfns[i], addr, next);
+ continue;
+ }
+ /*
+ * Use pte_alloc() instead of pte_alloc_map, because we
+ * can't run pte_offset_map on the pmd, if a huge pmd
+ * could materialize from under us.
+ */
+ if (unlikely(pte_alloc(vma->vm_mm, pmdp, addr))) {
+ hmm_pfns_error(&pfns[i], addr, next);
+ continue;
+ }
+ pmd = *pmdp;
}
if (pmd_trans_huge(pmd) || pmd_devmap(pmd)) {
unsigned long pfn = pmd_pfn(pmd) + pte_index(addr);
@@ -356,10 +437,14 @@ static void hmm_vma_walk(struct vm_area_struct *vma,
if (pmd_protnone(pmd)) {
hmm_pfns_clear(&pfns[i], addr, next);
+ if (fault)
+ goto fault;
continue;
}
flags |= pmd_write(*pmdp) ? HMM_PFN_WRITE : 0;
flags |= pmd_devmap(pmd) ? HMM_PFN_DEVICE : 0;
+ if ((flags & fault) != fault)
+ goto fault;
for (; addr < next; addr += PAGE_SIZE, i++, pfn++)
pfns[i] = hmm_pfn_from_pfn(pfn) | flags;
continue;
@@ -370,41 +455,63 @@ static void hmm_vma_walk(struct vm_area_struct *vma,
swp_entry_t entry;
pte_t pte = *ptep;
- pfns[i] = 0;
-
if (pte_none(pte)) {
+ if (fault) {
+ pte_unmap(ptep);
+ goto fault;
+ }
pfns[i] = HMM_PFN_EMPTY;
continue;
}
entry = pte_to_swp_entry(pte);
if (!pte_present(pte) && !non_swap_entry(entry)) {
+ if (fault) {
+ pte_unmap(ptep);
+ goto fault;
+ }
+ pfns[i] = 0;
continue;
}
if (pte_present(pte)) {
pfns[i] = hmm_pfn_from_pfn(pte_pfn(pte))|flag;
pfns[i] |= pte_write(pte) ? HMM_PFN_WRITE : 0;
- continue;
- }
-
- /*
- * This is a special swap entry, ignore migration, use
- * device and report anything else as error.
- */
- if (is_device_entry(entry)) {
+ } else if (is_device_entry(entry)) {
+ /* Do not fault device entry */
pfns[i] = hmm_pfn_from_pfn(swp_offset(entry));
if (is_write_device_entry(entry))
pfns[i] |= HMM_PFN_WRITE;
pfns[i] |= HMM_PFN_DEVICE;
pfns[i] |= HMM_PFN_UNADDRESSABLE;
pfns[i] |= flag;
- } else if (!is_migration_entry(entry)) {
+ } else if (is_migration_entry(entry) && fault) {
+ migration_entry_wait(vma->vm_mm, pmdp, addr);
+ /* Start again for current address */
+ next = addr;
+ ptep++;
+ break;
+ } else {
+ /* Report error for everything else */
pfns[i] = HMM_PFN_ERROR;
}
+ if ((fault & pfns[i]) != fault) {
+ pte_unmap(ptep);
+ goto fault;
+ }
}
pte_unmap(ptep - 1);
+ continue;
+
+fault:
+ ret = hmm_vma_do_fault(vma, fault, addr, &pfns[i], block);
+ if (ret)
+ return ret;
+ /* Start again for current address */
+ next = addr;
}
+
+ return 0;
}
/*
@@ -463,7 +570,7 @@ int hmm_vma_get_pfns(struct vm_area_struct *vma,
list_add_rcu(&range->list, &hmm->ranges);
spin_unlock(&hmm->lock);
- hmm_vma_walk(vma, start, end, pfns);
+ hmm_vma_walk(vma, 0, start, end, pfns, false);
return 0;
}
EXPORT_SYMBOL(hmm_vma_get_pfns);
@@ -474,14 +581,22 @@ EXPORT_SYMBOL(hmm_vma_get_pfns);
* @range: range being track
* Returns: false if range data have been invalidated, true otherwise
*
- * Range struct is use to track update to CPU page table after call to
- * hmm_vma_get_pfns(). Once device driver is done using or want to lock update
- * to data it gots from this function it calls hmm_vma_range_done() which stop
- * the tracking.
+ * Range struct is use to track update to CPU page table after call to either
+ * hmm_vma_get_pfns() or hmm_vma_fault(). Once device driver is done using or
+ * want to lock update to data it gots from those functions it must call the
+ * hmm_vma_range_done() function which stop tracking CPU page table update.
+ *
+ * Note that device driver must still implement general CPU page table update
+ * tracking either by using hmm_mirror (see hmm_mirror_register()) or by using
+ * mmu_notifier API directly.
+ *
+ * CPU page table update tracking done through hmm_range is only temporary and
+ * to be use while trying to duplicate CPU page table content for a range of
+ * virtual address.
*
* There is 2 way to use this :
* again:
- * hmm_vma_get_pfns(vma, range, start, end, pfns);
+ * hmm_vma_get_pfns(vma, range, start, end, pfns); or hmm_vma_fault(...);
* trans = device_build_page_table_update_transaction(pfns);
* device_page_table_lock();
* if (!hmm_vma_range_done(vma, range)) {
@@ -492,7 +607,7 @@ EXPORT_SYMBOL(hmm_vma_get_pfns);
* device_page_table_unlock();
*
* Or:
- * hmm_vma_get_pfns(vma, range, start, end, pfns);
+ * hmm_vma_get_pfns(vma, range, start, end, pfns); or hmm_vma_fault(...);
* device_page_table_lock();
* hmm_vma_range_done(vma, range);
* device_update_page_table(pfns);
@@ -521,4 +636,102 @@ bool hmm_vma_range_done(struct vm_area_struct *vma, struct hmm_range *range)
return range->valid;
}
EXPORT_SYMBOL(hmm_vma_range_done);
+
+/*
+ * hmm_vma_fault() - try to fault some address in a virtual address range
+ * @vma: virtual memory area containing the virtual address range
+ * @range: use to track pfns array content validity
+ * @start: fault range virtual start address (inclusive)
+ * @end: fault range virtual end address (exclusive)
+ * @pfns: array of hmm_pfn_t, only entry with fault flag set will be faulted
+ * @write: is it a write fault
+ * @block: allow blocking on fault (if true it sleeps and do not drop mmap_sem)
+ * Returns: 0 success, error otherwise (-EAGAIN means mmap_sem have been drop)
+ *
+ * This is similar to a regular CPU page fault except that it will not trigger
+ * any memory migration if the memory being faulted is not accessible by CPUs.
+ *
+ * On error, for one virtual address in the range, the function will set the
+ * hmm_pfn_t error flag for the corresponding pfn entry.
+ *
+ * Expected use pattern:
+ * retry:
+ * down_read(&mm->mmap_sem);
+ * // Find vma and address device wants to fault, initialize hmm_pfn_t
+ * // array accordingly
+ * ret = hmm_vma_fault(vma, start, end, pfns, allow_retry);
+ * switch (ret) {
+ * case -EAGAIN:
+ * hmm_vma_range_done(vma, range);
+ * // You might want to rate limit or yield to play nicely, you may
+ * // also commit any valid pfn in the array assuming that you are
+ * // getting true from hmm_vma_range_monitor_end()
+ * goto retry;
+ * case 0:
+ * break;
+ * default:
+ * // Handle error !
+ * up_read(&mm->mmap_sem)
+ * return;
+ * }
+ * // Take device driver lock that serialize device page table update
+ * driver_lock_device_page_table_update();
+ * hmm_vma_range_done(vma, range);
+ * // Commit pfns we got from hmm_vma_fault()
+ * driver_unlock_device_page_table_update();
+ * up_read(&mm->mmap_sem)
+ *
+ * YOU MUST CALL hmm_vma_range_done() AFTER THIS FUNCTION RETURN SUCCESS (0)
+ * BEFORE FREEING THE range struct OR YOU WILL HAVE SERIOUS MEMORY CORRUPTION !
+ *
+ * YOU HAVE BEEN WARN !
+ */
+int hmm_vma_fault(struct vm_area_struct *vma,
+ struct hmm_range *range,
+ unsigned long start,
+ unsigned long end,
+ hmm_pfn_t *pfns,
+ bool write,
+ bool block)
+{
+ hmm_pfn_t fault = HMM_PFN_READ | (write ? HMM_PFN_WRITE : 0);
+ struct hmm *hmm;
+ int ret;
+
+ /* Sanity check, this really should not happen ! */
+ if (start < vma->vm_start || start >= vma->vm_end)
+ return -EINVAL;
+ if (end < vma->vm_start || end > vma->vm_end)
+ return -EINVAL;
+
+ hmm = hmm_register(vma->vm_mm);
+ if (!hmm) {
+ hmm_pfns_clear(pfns, start, end);
+ return -ENOMEM;
+ }
+ /* Caller must have registered a mirror using hmm_mirror_register() */
+ if (!hmm->mmu_notifier.ops)
+ return -EINVAL;
+
+ /* Initialize range to track CPU page table update */
+ range->start = start;
+ range->pfns = pfns;
+ range->end = end;
+ spin_lock(&hmm->lock);
+ range->valid = true;
+ list_add_rcu(&range->list, &hmm->ranges);
+ spin_unlock(&hmm->lock);
+
+ /* FIXME support hugetlb fs */
+ if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)) {
+ hmm_pfns_special(pfns, start, end);
+ return 0;
+ }
+
+ ret = hmm_vma_walk(vma, fault, start, end, pfns, block);
+ if (ret)
+ hmm_vma_range_done(vma, range);
+ return ret;
+}
+EXPORT_SYMBOL(hmm_vma_fault);
#endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
--
2.4.11
This is a heterogeneous memory management (HMM) process address space
mirroring. In a nutshell this provide an API to mirror process address
space on a device. This boils down to keeping CPU and device page table
synchronize (we assume that both device and CPU are cache coherent like
PCIe device can be).
This patch provide a simple API for device driver to achieve address
space mirroring thus avoiding each device driver to grow its own CPU
page table walker and its own CPU page table synchronization mechanism.
This is useful for NVidia GPU >= Pascal, Mellanox IB >= mlx5 and more
hardware in the future.
Signed-off-by: Jérôme Glisse <[email protected]>
Signed-off-by: Evgeny Baskakov <[email protected]>
Signed-off-by: John Hubbard <[email protected]>
Signed-off-by: Mark Hairgrove <[email protected]>
Signed-off-by: Sherry Cheung <[email protected]>
Signed-off-by: Subhash Gutti <[email protected]>
---
include/linux/hmm.h | 101 ++++++++++++++++++++++++++++
mm/Kconfig | 15 +++++
mm/hmm.c | 186 ++++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 302 insertions(+)
diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 9fb6767..e64f92c 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -79,6 +79,7 @@
#if IS_ENABLED(CONFIG_HMM)
+struct hmm;
/*
* hmm_pfn_t - HMM use its own pfn type to keep several flags per page
@@ -141,6 +142,106 @@ static inline hmm_pfn_t hmm_pfn_from_pfn(unsigned long pfn)
}
+#if IS_ENABLED(CONFIG_HMM_MIRROR)
+/*
+ * Mirroring: how to use synchronize device page table with CPU page table ?
+ *
+ * Device driver must always synchronize with CPU page table update, for this
+ * they can either directly use mmu_notifier API or they can use the hmm_mirror
+ * API. Device driver can decide to register one mirror per device per process
+ * or just one mirror per process for a group of device. Pattern is:
+ *
+ * int device_bind_address_space(..., struct mm_struct *mm, ...)
+ * {
+ * struct device_address_space *das;
+ * int ret;
+ * // Device driver specific initialization, and allocation of das
+ * // which contain an hmm_mirror struct as one of its field.
+ * ret = hmm_mirror_register(&das->mirror, mm, &device_mirror_ops);
+ * if (ret) {
+ * // Cleanup on error
+ * return ret;
+ * }
+ * // Other device driver specific initialization
+ * }
+ *
+ * Device driver must not free the struct containing hmm_mirror struct before
+ * calling hmm_mirror_unregister() expected usage is to do that when device
+ * driver is unbinding from an address space.
+ *
+ * void device_unbind_address_space(struct device_address_space *das)
+ * {
+ * // Device driver specific cleanup
+ * hmm_mirror_unregister(&das->mirror);
+ * // Other device driver specific cleanup and now das can be free
+ * }
+ *
+ * Once an hmm_mirror is registered for an address space, device driver will get
+ * callbacks through the update() operation (see hmm_mirror_ops struct).
+ */
+
+struct hmm_mirror;
+
+/*
+ * enum hmm_update - type of update
+ * @HMM_UPDATE_INVALIDATE: invalidate range (no indication as to why)
+ */
+enum hmm_update {
+ HMM_UPDATE_INVALIDATE,
+};
+
+/*
+ * struct hmm_mirror_ops - HMM mirror device operations callback
+ *
+ * @update: callback to update range on a device
+ */
+struct hmm_mirror_ops {
+ /* update() - update virtual address range of memory
+ *
+ * @mirror: pointer to struct hmm_mirror
+ * @update: update's type (turn read only, unmap, ...)
+ * @start: virtual start address of the range to update
+ * @end: virtual end address of the range to update
+ *
+ * This callback is call when the CPU page table is updated, the device
+ * driver must update device page table accordingly to update's action.
+ *
+ * Device driver callback must wait until the device has fully updated
+ * its view for the range. Note we plan to make this asynchronous in
+ * later patches, so that multiple devices can schedule update to their
+ * page tables, and once all device have schedule the update then we
+ * wait for them to propagate.
+ */
+ void (*update)(struct hmm_mirror *mirror,
+ enum hmm_update action,
+ unsigned long start,
+ unsigned long end);
+};
+
+/*
+ * struct hmm_mirror - mirror struct for a device driver
+ *
+ * @hmm: pointer to struct hmm (which is unique per mm_struct)
+ * @ops: device driver callback for HMM mirror operations
+ * @list: for list of mirrors of a given mm
+ *
+ * Each address space (mm_struct) being mirrored by a device must register one
+ * of hmm_mirror struct with HMM. HMM will track list of all mirrors for each
+ * mm_struct (or each process).
+ */
+struct hmm_mirror {
+ struct hmm *hmm;
+ const struct hmm_mirror_ops *ops;
+ struct list_head list;
+};
+
+int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm);
+int hmm_mirror_register_locked(struct hmm_mirror *mirror,
+ struct mm_struct *mm);
+void hmm_mirror_unregister(struct hmm_mirror *mirror);
+#endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
+
+
/* Below are for HMM internal use only! Not to be used by device driver! */
void hmm_mm_destroy(struct mm_struct *mm);
diff --git a/mm/Kconfig b/mm/Kconfig
index fe8ad24..8ae7600 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -293,6 +293,21 @@ config HMM
bool
depends on MMU
+config HMM_MIRROR
+ bool "HMM mirror CPU page table into a device page table"
+ select HMM
+ select MMU_NOTIFIER
+ help
+ HMM mirror is a set of helpers to mirror CPU page table into a device
+ page table. There is two side, first keep both page table synchronize
+ so that no virtual address can point to different page (but one page
+ table might lag ie onee might still point to page while the other is
+ is pointing to nothing).
+
+ Second side of the equation is replicating CPU page table content for
+ range of virtual address. This require careful synchronization with
+ CPU page table update.
+
config PHYS_ADDR_T_64BIT
def_bool 64BIT || ARCH_PHYS_ADDR_T_64BIT
diff --git a/mm/hmm.c b/mm/hmm.c
index ed3a847..6a2d299 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -21,14 +21,27 @@
#include <linux/hmm.h>
#include <linux/slab.h>
#include <linux/sched.h>
+#include <linux/mmu_notifier.h>
/*
* struct hmm - HMM per mm struct
*
* @mm: mm struct this HMM struct is bound to
+ * @lock: lock protecting mirrors list
+ * @mirrors: list of mirrors for this mm
+ * @wait_queue: wait queue
+ * @sequence: we track updates to the CPU page table with a sequence number
+ * @mmu_notifier: mmu notifier to track updates to CPU page table
+ * @notifier_count: number of currently active notifiers
*/
struct hmm {
struct mm_struct *mm;
+ spinlock_t lock;
+ struct list_head mirrors;
+ atomic_t sequence;
+ wait_queue_head_t wait_queue;
+ struct mmu_notifier mmu_notifier;
+ atomic_t notifier_count;
};
/*
@@ -47,6 +60,12 @@ static struct hmm *hmm_register(struct mm_struct *mm)
hmm = kmalloc(sizeof(*hmm), GFP_KERNEL);
if (!hmm)
return NULL;
+ init_waitqueue_head(&hmm->wait_queue);
+ atomic_set(&hmm->notifier_count, 0);
+ INIT_LIST_HEAD(&hmm->mirrors);
+ atomic_set(&hmm->sequence, 0);
+ hmm->mmu_notifier.ops = NULL;
+ spin_lock_init(&hmm->lock);
hmm->mm = mm;
spin_lock(&mm->page_table_lock);
@@ -79,3 +98,170 @@ void hmm_mm_destroy(struct mm_struct *mm)
spin_unlock(&mm->page_table_lock);
kfree(hmm);
}
+
+
+#if IS_ENABLED(CONFIG_HMM_MIRROR)
+static void hmm_invalidate_range(struct hmm *hmm,
+ enum hmm_update action,
+ unsigned long start,
+ unsigned long end)
+{
+ struct hmm_mirror *mirror;
+
+ /*
+ * Mirror being added or removed is a rare event so list traversal isn't
+ * protected by a lock, we rely on simple rules. All list modification
+ * are done using list_add_rcu() and list_del_rcu() under a spinlock to
+ * protect from concurrent addition or removal but not traversal.
+ *
+ * Because hmm_mirror_unregister() waits for all running invalidation to
+ * complete (and thus all list traversals to finish), none of the mirror
+ * structs can be freed from under us while traversing the list and thus
+ * it is safe to dereference their list pointer even if they were just
+ * removed.
+ */
+ list_for_each_entry (mirror, &hmm->mirrors, list)
+ mirror->ops->update(mirror, action, start, end);
+}
+
+static void hmm_invalidate_page(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ unsigned long addr)
+{
+ unsigned long start = addr & PAGE_MASK;
+ unsigned long end = start + PAGE_SIZE;
+ struct hmm *hmm = mm->hmm;
+
+ VM_BUG_ON(!hmm);
+
+ atomic_inc(&hmm->notifier_count);
+ atomic_inc(&hmm->sequence);
+ hmm_invalidate_range(mm->hmm, HMM_UPDATE_INVALIDATE, start, end);
+ atomic_dec(&hmm->notifier_count);
+ wake_up(&hmm->wait_queue);
+}
+
+static void hmm_invalidate_range_start(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ unsigned long start,
+ unsigned long end)
+{
+ struct hmm *hmm = mm->hmm;
+
+ VM_BUG_ON(!hmm);
+
+ atomic_inc(&hmm->notifier_count);
+ atomic_inc(&hmm->sequence);
+}
+
+static void hmm_invalidate_range_end(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ unsigned long start,
+ unsigned long end)
+{
+ struct hmm *hmm = mm->hmm;
+
+ VM_BUG_ON(!hmm);
+
+ hmm_invalidate_range(mm->hmm, HMM_UPDATE_INVALIDATE, start, end);
+
+ /* Reverse order here because we are getting out of invalidation */
+ atomic_dec(&hmm->notifier_count);
+ wake_up(&hmm->wait_queue);
+}
+
+static const struct mmu_notifier_ops hmm_mmu_notifier_ops = {
+ .invalidate_page = hmm_invalidate_page,
+ .invalidate_range_start = hmm_invalidate_range_start,
+ .invalidate_range_end = hmm_invalidate_range_end,
+};
+
+static int hmm_mirror_do_register(struct hmm_mirror *mirror,
+ struct mm_struct *mm,
+ const bool locked)
+{
+ /* Sanity check */
+ if (!mm || !mirror || !mirror->ops)
+ return -EINVAL;
+
+ mirror->hmm = hmm_register(mm);
+ if (!mirror->hmm)
+ return -ENOMEM;
+
+ /* Register mmu_notifier if not already, use mmap_sem for locking */
+ if (!mirror->hmm->mmu_notifier.ops) {
+ struct hmm *hmm = mirror->hmm;
+
+ if (!locked)
+ down_write(&mm->mmap_sem);
+ if (!hmm->mmu_notifier.ops) {
+ hmm->mmu_notifier.ops = &hmm_mmu_notifier_ops;
+ if (__mmu_notifier_register(&hmm->mmu_notifier, mm)) {
+ hmm->mmu_notifier.ops = NULL;
+ up_write(&mm->mmap_sem);
+ return -ENOMEM;
+ }
+ }
+ if (!locked)
+ up_write(&mm->mmap_sem);
+ }
+
+ spin_lock(&mirror->hmm->lock);
+ list_add_rcu(&mirror->list, &mirror->hmm->mirrors);
+ spin_unlock(&mirror->hmm->lock);
+
+ return 0;
+}
+
+/*
+ * hmm_mirror_register() - register a mirror against an mm
+ *
+ * @mirror: new mirror struct to register
+ * @mm: mm to register against
+ *
+ * To start mirroring a process address space, the device driver must register
+ * an HMM mirror struct.
+ */
+int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm)
+{
+ return hmm_mirror_do_register(mirror, mm, false);
+}
+EXPORT_SYMBOL(hmm_mirror_register);
+
+/*
+ * hmm_mirror_register_locked() - register a mirror against an mm
+ *
+ * @mirror: new mirror struct to register
+ * @mm: mm to register against
+ *
+ * Same as hmm_mirror_register() except that mmap_sem must be held for writing.
+ */
+int hmm_mirror_register_locked(struct hmm_mirror *mirror, struct mm_struct *mm)
+{
+ return hmm_mirror_do_register(mirror, mm, true);
+}
+EXPORT_SYMBOL(hmm_mirror_register_locked);
+
+/*
+ * hmm_mirror_unregister() - unregister a mirror
+ *
+ * @mirror: new mirror struct to register
+ *
+ * Stop mirroring a process address space, and cleanup.
+ */
+void hmm_mirror_unregister(struct hmm_mirror *mirror)
+{
+ struct hmm *hmm = mirror->hmm;
+
+ spin_lock(&hmm->lock);
+ list_del_rcu(&mirror->list);
+ spin_unlock(&hmm->lock);
+
+ /*
+ * Wait for all active notifiers so that it is safe to traverse the
+ * mirror list without holding any locks.
+ */
+ wait_event(hmm->wait_queue, !atomic_read(&hmm->notifier_count));
+}
+EXPORT_SYMBOL(hmm_mirror_unregister);
+#endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
--
2.4.11
This introduce a dummy HMM device class so device driver can use it to
create hmm_device for the sole purpose of registering device memory.
It is usefull to device driver that want to manage multiple physical
device memory under same struct device umbrella.
Changed since v1:
- Improve commit message
- Add drvdata parameter to set on struct device
Signed-off-by: Jérôme Glisse <[email protected]>
Signed-off-by: Evgeny Baskakov <[email protected]>
Signed-off-by: John Hubbard <[email protected]>
Signed-off-by: Mark Hairgrove <[email protected]>
Signed-off-by: Sherry Cheung <[email protected]>
Signed-off-by: Subhash Gutti <[email protected]>
---
include/linux/hmm.h | 22 +++++++++++-
mm/hmm.c | 96 +++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 117 insertions(+), 1 deletion(-)
diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 3054ce7..e4e6b36 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -79,11 +79,11 @@
#if IS_ENABLED(CONFIG_HMM)
+#include <linux/device.h>
#include <linux/migrate.h>
#include <linux/memremap.h>
#include <linux/completion.h>
-
struct hmm;
/*
@@ -433,6 +433,26 @@ static inline unsigned long hmm_devmem_page_get_drvdata(struct page *page)
return drvdata[1];
}
+
+
+/*
+ * struct hmm_device - fake device to hang device memory onto
+ *
+ * @device: device struct
+ * @minor: device minor number
+ */
+struct hmm_device {
+ struct device device;
+ unsigned minor;
+};
+
+/*
+ * Device driver that wants to handle multiple devices memory through a single
+ * fake device can use hmm_device to do so. This is purely a helper and it
+ * is not needed to make use of any HMM functionality.
+ */
+struct hmm_device *hmm_device_new(void *drvdata);
+void hmm_device_put(struct hmm_device *hmm_device);
#endif /* IS_ENABLED(CONFIG_HMM_DEVMEM) */
diff --git a/mm/hmm.c b/mm/hmm.c
index 019f379..c477bd1 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -24,6 +24,7 @@
#include <linux/slab.h>
#include <linux/sched.h>
#include <linux/mmzone.h>
+#include <linux/module.h>
#include <linux/pagemap.h>
#include <linux/swapops.h>
#include <linux/hugetlb.h>
@@ -1132,4 +1133,99 @@ int hmm_devmem_fault_range(struct hmm_devmem *devmem,
return 0;
}
EXPORT_SYMBOL(hmm_devmem_fault_range);
+
+/*
+ * A device driver that wants to handle multiple devices memory through a
+ * single fake device can use hmm_device to do so. This is purely a helper
+ * and it is not needed to make use of any HMM functionality.
+ */
+#define HMM_DEVICE_MAX 256
+
+static DECLARE_BITMAP(hmm_device_mask, HMM_DEVICE_MAX);
+static DEFINE_SPINLOCK(hmm_device_lock);
+static struct class *hmm_device_class;
+static dev_t hmm_device_devt;
+
+static void hmm_device_release(struct device *device)
+{
+ struct hmm_device *hmm_device;
+
+ hmm_device = container_of(device, struct hmm_device, device);
+ spin_lock(&hmm_device_lock);
+ clear_bit(hmm_device->minor, hmm_device_mask);
+ spin_unlock(&hmm_device_lock);
+
+ kfree(hmm_device);
+}
+
+struct hmm_device *hmm_device_new(void *drvdata)
+{
+ struct hmm_device *hmm_device;
+ int ret;
+
+ hmm_device = kzalloc(sizeof(*hmm_device), GFP_KERNEL);
+ if (!hmm_device)
+ return ERR_PTR(-ENOMEM);
+
+ ret = alloc_chrdev_region(&hmm_device->device.devt,0,1,"hmm_device");
+ if (ret < 0) {
+ kfree(hmm_device);
+ return NULL;
+ }
+
+ spin_lock(&hmm_device_lock);
+ hmm_device->minor=find_first_zero_bit(hmm_device_mask,HMM_DEVICE_MAX);
+ if (hmm_device->minor >= HMM_DEVICE_MAX) {
+ spin_unlock(&hmm_device_lock);
+ kfree(hmm_device);
+ return NULL;
+ }
+ set_bit(hmm_device->minor, hmm_device_mask);
+ spin_unlock(&hmm_device_lock);
+
+ dev_set_name(&hmm_device->device, "hmm_device%d", hmm_device->minor);
+ hmm_device->device.devt = MKDEV(MAJOR(hmm_device_devt),
+ hmm_device->minor);
+ hmm_device->device.release = hmm_device_release;
+ dev_set_drvdata(&hmm_device->device, drvdata);
+ hmm_device->device.class = hmm_device_class;
+ device_initialize(&hmm_device->device);
+
+ return hmm_device;
+}
+EXPORT_SYMBOL(hmm_device_new);
+
+void hmm_device_put(struct hmm_device *hmm_device)
+{
+ put_device(&hmm_device->device);
+}
+EXPORT_SYMBOL(hmm_device_put);
+
+static int __init hmm_init(void)
+{
+ int ret;
+
+ ret = alloc_chrdev_region(&hmm_device_devt, 0,
+ HMM_DEVICE_MAX,
+ "hmm_device");
+ if (ret)
+ return ret;
+
+ hmm_device_class = class_create(THIS_MODULE, "hmm_device");
+ if (IS_ERR(hmm_device_class)) {
+ unregister_chrdev_region(hmm_device_devt, HMM_DEVICE_MAX);
+ return PTR_ERR(hmm_device_class);
+ }
+ return 0;
+}
+
+static void __exit hmm_exit(void)
+{
+ unregister_chrdev_region(hmm_device_devt, HMM_DEVICE_MAX);
+ class_destroy(hmm_device_class);
+}
+
+module_init(hmm_init);
+module_exit(hmm_exit);
+MODULE_LICENSE("GPL");
#endif /* IS_ENABLED(CONFIG_HMM_DEVMEM) */
--
2.4.11
This allow caller of migrate_vma() to allocate new page for empty CPU
page table entry. It only support anoymous memory and it won't allow
new page to be instance if userfaultfd is armed.
This is usefull to device driver that want to migrate a range of virtual
address and would rather allocate new memory than having to fault later
on.
Signed-off-by: Jérôme Glisse <[email protected]>
---
include/linux/migrate.h | 6 +-
mm/migrate.c | 156 +++++++++++++++++++++++++++++++++++++++++-------
2 files changed, 138 insertions(+), 24 deletions(-)
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index c43669b..01f4945 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -157,7 +157,11 @@ static inline unsigned long migrate_pfn_size(unsigned long mpfn)
* allocator for destination memory.
*
* Note that in alloc_and_copy device driver can decide not to migrate some of
- * the entry by simply setting corresponding dst entry 0.
+ * the entry by simply setting corresponding dst entry 0. Driver can also try
+ * to allocate memory for empty source entry by setting valid dst entry. If
+ * CPU page table is not populated while alloc_and_copy() callback is taking
+ * place then CPU page table will be updated to point to the newly allocated
+ * memory.
*
* Destination page must locked and MIGRATE_PFN_LOCKED set in the corresponding
* entry of dstarray. It is expected that page allocated will have an elevated
diff --git a/mm/migrate.c b/mm/migrate.c
index 9950245..b03158c 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -42,6 +42,7 @@
#include <linux/page_owner.h>
#include <linux/sched/mm.h>
#include <linux/memremap.h>
+#include <linux/userfaultfd_k.h>
#include <asm/tlbflush.h>
@@ -2103,29 +2104,17 @@ static int migrate_vma_collect_hole(unsigned long start,
struct mm_walk *walk)
{
struct migrate_vma *migrate = walk->private;
- unsigned long addr, next;
+ unsigned long addr;
- for (addr = start & PAGE_MASK; addr < end; addr = next) {
- unsigned long npages, i;
+ for (addr = start & PAGE_MASK; addr < end; addr += PAGE_SIZE) {
int ret;
- next = pmd_addr_end(addr, end);
- npages = (next - addr) >> PAGE_SHIFT;
- if (npages == (PMD_SIZE >> PAGE_SHIFT)) {
- migrate->dst[migrate->npages] = 0;
- migrate->src[migrate->npages++] = MIGRATE_PFN_HUGE;
- ret = migrate_vma_array_full(migrate);
- if (ret)
- return ret;
- } else {
- for (i = 0; i < npages; ++i) {
- migrate->dst[migrate->npages] = 0;
- migrate->src[migrate->npages++] = 0;
- ret = migrate_vma_array_full(migrate);
- if (ret)
- return ret;
- }
- }
+ migrate->cpages++;
+ migrate->dst[migrate->npages] = 0;
+ migrate->src[migrate->npages++] = 0;
+ ret = migrate_vma_array_full(migrate);
+ if (ret)
+ return ret;
}
return 0;
@@ -2162,6 +2151,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
pfn = pte_pfn(pte);
if (pte_none(pte)) {
+ migrate->cpages++;
flags = pfn = 0;
goto next;
}
@@ -2480,6 +2470,114 @@ static void migrate_vma_unmap(struct migrate_vma *migrate)
}
}
+static void migrate_vma_insert_page(struct migrate_vma *migrate,
+ unsigned long addr,
+ struct page *page,
+ unsigned long *src,
+ unsigned long *dst)
+{
+ struct vm_area_struct *vma = migrate->vma;
+ struct mm_struct *mm = vma->vm_mm;
+ struct mem_cgroup *memcg;
+ spinlock_t *ptl;
+ pgd_t *pgdp;
+ pud_t *pudp;
+ pmd_t *pmdp;
+ pte_t *ptep;
+ pte_t entry;
+
+ if ((*dst & MIGRATE_PFN_HUGE) || (*src & MIGRATE_PFN_HUGE))
+ goto abort;
+
+ /* Only allow to populate anonymous memory */
+ if (!vma_is_anonymous(vma))
+ goto abort;
+
+ pgdp = pgd_offset(mm, addr);
+ pudp = pud_alloc(mm, pgdp, addr);
+ if (!pudp)
+ goto abort;
+ pmdp = pmd_alloc(mm, pudp, addr);
+ if (!pmdp)
+ goto abort;
+
+ if (pmd_trans_unstable(pmdp) || pmd_devmap(*pmdp))
+ goto abort;
+
+ /*
+ * Use pte_alloc() instead of pte_alloc_map(). We can't run
+ * pte_offset_map() on pmds where a huge pmd might be created
+ * from a different thread.
+ *
+ * pte_alloc_map() is safe to use under down_write(mmap_sem) or when
+ * parallel threads are excluded by other means.
+ *
+ * Here we only have down_read(mmap_sem).
+ */
+ if (pte_alloc(mm, pmdp, addr))
+ goto abort;
+
+ /* See the comment in pte_alloc_one_map() */
+ if (unlikely(pmd_trans_unstable(pmdp)))
+ goto abort;
+
+ if (unlikely(anon_vma_prepare(vma)))
+ goto abort;
+ if (mem_cgroup_try_charge(page, vma->vm_mm, GFP_KERNEL, &memcg, false))
+ goto abort;
+
+ /*
+ * The memory barrier inside __SetPageUptodate makes sure that
+ * preceeding stores to the page contents become visible before
+ * the set_pte_at() write.
+ */
+ __SetPageUptodate(page);
+
+ if (is_zone_device_page(page) && !is_addressable_page(page)) {
+ swp_entry_t swp_entry;
+
+ swp_entry = make_device_entry(page, vma->vm_flags & VM_WRITE);
+ entry = swp_entry_to_pte(swp_entry);
+ } else {
+ entry = mk_pte(page, vma->vm_page_prot);
+ if (vma->vm_flags & VM_WRITE)
+ entry = pte_mkwrite(pte_mkdirty(entry));
+ }
+
+ ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+ if (!pte_none(*ptep)) {
+ pte_unmap_unlock(ptep, ptl);
+ mem_cgroup_cancel_charge(page, memcg, false);
+ goto abort;
+ }
+
+ /* Check for usefaultfd but do not deliver fault just back of */
+ if (userfaultfd_missing(vma)) {
+ pte_unmap_unlock(ptep, ptl);
+ mem_cgroup_cancel_charge(page, memcg, false);
+ goto abort;
+ }
+
+ inc_mm_counter(mm, MM_ANONPAGES);
+ page_add_new_anon_rmap(page, vma, addr, false);
+ mem_cgroup_commit_charge(page, memcg, false, false);
+ if (!is_zone_device_page(page))
+ lru_cache_add_active_or_unevictable(page, vma);
+ set_pte_at(mm, addr, ptep, entry);
+
+ /* Take a reference on the page */
+ get_page(page);
+
+ /* No need to invalidate - it was non-present before */
+ update_mmu_cache(vma, addr, ptep);
+ pte_unmap_unlock(ptep, ptl);
+ *src = MIGRATE_PFN_MIGRATE;
+ return;
+
+abort:
+ *src &= ~MIGRATE_PFN_MIGRATE;
+}
+
/*
* migrate_vma_pages() - migrate meta-data from src page to dst page
* @migrate: migrate struct containing all migration information
@@ -2501,10 +2599,16 @@ static void migrate_vma_pages(struct migrate_vma *migrate)
size = migrate_pfn_size(migrate->src[i]);
- if (!page || !newpage)
+ if (!newpage) {
+ migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
continue;
- if (!(migrate->src[i] & MIGRATE_PFN_MIGRATE))
+ } else if (!(migrate->src[i] & MIGRATE_PFN_MIGRATE)) {
+ if (!page)
+ migrate_vma_insert_page(migrate, addr, newpage,
+ &migrate->src[i],
+ &migrate->dst[i]);
continue;
+ }
mapping = page_mapping(page);
@@ -2551,8 +2655,14 @@ static void migrate_vma_finalize(struct migrate_vma *migrate)
struct page *page = migrate_pfn_to_page(migrate->src[i]);
size = migrate_pfn_size(migrate->src[i]);
- if (!page)
+ if (!page) {
+ if (newpage) {
+ unlock_page(newpage);
+ put_page(newpage);
+ }
continue;
+ }
+
if (!(migrate->src[i] & MIGRATE_PFN_MIGRATE) || !newpage) {
if (newpage) {
unlock_page(newpage);
--
2.4.11
HMM provides 3 separate types of functionality:
- Mirroring: synchronize CPU page table and device page table
- Device memory: allocating struct page for device memory
- Migration: migrating regular memory to device memory
This patch introduces some common helpers and definitions to all of
those 3 functionality.
Signed-off-by: Jérôme Glisse <[email protected]>
Signed-off-by: Evgeny Baskakov <[email protected]>
Signed-off-by: John Hubbard <[email protected]>
Signed-off-by: Mark Hairgrove <[email protected]>
Signed-off-by: Sherry Cheung <[email protected]>
Signed-off-by: Subhash Gutti <[email protected]>
---
MAINTAINERS | 7 +++
include/linux/hmm.h | 153 +++++++++++++++++++++++++++++++++++++++++++++++
include/linux/mm_types.h | 5 ++
kernel/fork.c | 2 +
mm/Kconfig | 4 ++
mm/Makefile | 1 +
mm/hmm.c | 81 +++++++++++++++++++++++++
7 files changed, 253 insertions(+)
create mode 100644 include/linux/hmm.h
create mode 100644 mm/hmm.c
diff --git a/MAINTAINERS b/MAINTAINERS
index c776906..af37f7c 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5964,6 +5964,13 @@ S: Supported
F: drivers/scsi/hisi_sas/
F: Documentation/devicetree/bindings/scsi/hisilicon-sas.txt
+HMM - Heterogeneous Memory Management
+M: Jérôme Glisse <[email protected]>
+L: [email protected]
+S: Maintained
+F: mm/hmm*
+F: include/linux/hmm*
+
HOST AP DRIVER
M: Jouni Malinen <[email protected]>
L: [email protected]
diff --git a/include/linux/hmm.h b/include/linux/hmm.h
new file mode 100644
index 0000000..9fb6767
--- /dev/null
+++ b/include/linux/hmm.h
@@ -0,0 +1,153 @@
+/*
+ * Copyright 2013 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * Authors: Jérôme Glisse <[email protected]>
+ */
+/*
+ * HMM provides 3 separate types of functionality:
+ * - Mirroring: synchronize CPU page table and device page table
+ * - Device memory: allocating struct pages for device memory
+ * - Migration: migrating regular memory to device memory
+ *
+ * Each can be used independently from the others.
+ *
+ *
+ * Mirroring:
+ *
+ * HMM provides helpers to mirror a process address space on a device. For this,
+ * it provides several helpers to order device page table updates with respect
+ * to CPU page table updates. The requirement is that for any given virtual
+ * address the CPU and device page table cannot point to different physical
+ * pages. It uses the mmu_notifier API behind the scenes.
+ *
+ * Device memory:
+ *
+ * HMM provides helpers to help leverage device memory. Device memory is, at any
+ * given time, either CPU-addressable like regular memory, or completely
+ * unaddressable. In both cases the device memory is associated with dedicated
+ * struct pages (which are allocated as if for hotplugged memory). Device memory
+ * management is under the responsibility of the device driver. HMM only
+ * allocates and initializes the struct pages associated with the device memory,
+ * by hotplugging a ZONE_DEVICE memory range.
+ *
+ * Allocating struct pages for device memory allows us to use device memory
+ * almost like regular CPU memory. Unlike regular memory, however, it cannot be
+ * added to the lru, nor can any memory allocation can use device memory
+ * directly. Device memory will only end up in use by a process if the device
+ * driver migrates some of the process memory from regular memory to device
+ * memory.
+ *
+ * Migration:
+ *
+ * The existing memory migration mechanism (mm/migrate.c) does not allow using
+ * anything other than the CPU to copy from source to destination memory.
+ * Moreover, existing code does not provide a way to migrate based on a virtual
+ * address range. Existing code only supports struct-page-based migration. Also,
+ * the migration flow does not allow for graceful failure at intermediate stages
+ * of the migration process.
+ *
+ * HMM solves all of the above, by providing a simple API:
+ *
+ * hmm_vma_migrate(ops, vma, src_pfns, dst_pfns, start, end, private);
+ *
+ * finalize_and_map(). The first, alloc_and_copy(), allocates the destination
+ * memory and initializes it using source memory. Migration can fail at this
+ * point, and the device driver then has a place to abort the migration. The
+ * finalize_and_map() callback allows the device driver to know which pages
+ * were successfully migrated and which were not.
+ *
+ * This can easily be used outside of the original HMM use case.
+ *
+ *
+ * This header file contain all the APIs related to hmm_vma_migrate. Additional
+ * detailed documentation may be found below.
+ */
+#ifndef LINUX_HMM_H
+#define LINUX_HMM_H
+
+#include <linux/kconfig.h>
+
+#if IS_ENABLED(CONFIG_HMM)
+
+
+/*
+ * hmm_pfn_t - HMM use its own pfn type to keep several flags per page
+ *
+ * Flags:
+ * HMM_PFN_VALID: pfn is valid
+ * HMM_PFN_WRITE: CPU page table have the write permission set
+ */
+typedef unsigned long hmm_pfn_t;
+
+#define HMM_PFN_VALID (1 << 0)
+#define HMM_PFN_WRITE (1 << 1)
+#define HMM_PFN_SHIFT 2
+
+/*
+ * hmm_pfn_to_page() - return struct page pointed to by a valid hmm_pfn_t
+ * @pfn: hmm_pfn_t to convert to struct page
+ * Returns: struct page pointer if pfn is a valid hmm_pfn_t, NULL otherwise
+ *
+ * If the hmm_pfn_t is valid (ie valid flag set) then return the struct page
+ * matching the pfn value store in the hmm_pfn_t. Otherwise return NULL.
+ */
+static inline struct page *hmm_pfn_to_page(hmm_pfn_t pfn)
+{
+ if (!(pfn & HMM_PFN_VALID))
+ return NULL;
+ return pfn_to_page(pfn >> HMM_PFN_SHIFT);
+}
+
+/*
+ * hmm_pfn_to_pfn() - return pfn value store in a hmm_pfn_t
+ * @pfn: hmm_pfn_t to extract pfn from
+ * Returns: pfn value if hmm_pfn_t is valid, -1UL otherwise
+ */
+static inline unsigned long hmm_pfn_to_pfn(hmm_pfn_t pfn)
+{
+ if (!(pfn & HMM_PFN_VALID))
+ return -1UL;
+ return (pfn >> HMM_PFN_SHIFT);
+}
+
+/*
+ * hmm_pfn_from_page() - create a valid hmm_pfn_t value from struct page
+ * @page: struct page pointer for which to create the hmm_pfn_t
+ * Returns: valid hmm_pfn_t for the page
+ */
+static inline hmm_pfn_t hmm_pfn_from_page(struct page *page)
+{
+ return (page_to_pfn(page) << HMM_PFN_SHIFT) | HMM_PFN_VALID;
+}
+
+/*
+ * hmm_pfn_from_pfn() - create a valid hmm_pfn_t value from pfn
+ * @pfn: pfn value for which to create the hmm_pfn_t
+ * Returns: valid hmm_pfn_t for the pfn
+ */
+static inline hmm_pfn_t hmm_pfn_from_pfn(unsigned long pfn)
+{
+ return (pfn << HMM_PFN_SHIFT) | HMM_PFN_VALID;
+}
+
+
+/* Below are for HMM internal use only! Not to be used by device driver! */
+void hmm_mm_destroy(struct mm_struct *mm);
+
+#else /* IS_ENABLED(CONFIG_HMM) */
+
+/* Below are for HMM internal use only! Not to be used by device driver! */
+static inline void hmm_mm_destroy(struct mm_struct *mm) {}
+
+#endif /* IS_ENABLED(CONFIG_HMM) */
+#endif /* LINUX_HMM_H */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index f60f45f..81068ce 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -23,6 +23,7 @@
struct address_space;
struct mem_cgroup;
+struct hmm;
/*
* Each physical page in the system has a struct page associated with
@@ -495,6 +496,10 @@ struct mm_struct {
atomic_long_t hugetlb_usage;
#endif
struct work_struct async_put_work;
+#if IS_ENABLED(CONFIG_HMM)
+ /* HMM need to track few things per mm */
+ struct hmm *hmm;
+#endif
};
extern struct mm_struct init_mm;
diff --git a/kernel/fork.c b/kernel/fork.c
index 6c463c80..1f8d612 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -37,6 +37,7 @@
#include <linux/binfmts.h>
#include <linux/mman.h>
#include <linux/mmu_notifier.h>
+#include <linux/hmm.h>
#include <linux/fs.h>
#include <linux/mm.h>
#include <linux/vmacache.h>
@@ -863,6 +864,7 @@ void __mmdrop(struct mm_struct *mm)
BUG_ON(mm == &init_mm);
mm_free_pgd(mm);
destroy_context(mm);
+ hmm_mm_destroy(mm);
mmu_notifier_mm_destroy(mm);
check_mm(mm);
put_user_ns(mm->user_ns);
diff --git a/mm/Kconfig b/mm/Kconfig
index 9502315..fe8ad24 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -289,6 +289,10 @@ config MIGRATION
config ARCH_ENABLE_HUGEPAGE_MIGRATION
bool
+config HMM
+ bool
+ depends on MMU
+
config PHYS_ADDR_T_64BIT
def_bool 64BIT || ARCH_PHYS_ADDR_T_64BIT
diff --git a/mm/Makefile b/mm/Makefile
index 026f6a8..9eb4121 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -75,6 +75,7 @@ obj-$(CONFIG_FAILSLAB) += failslab.o
obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
obj-$(CONFIG_MEMTEST) += memtest.o
obj-$(CONFIG_MIGRATION) += migrate.o
+obj-$(CONFIG_HMM) += hmm.o
obj-$(CONFIG_QUICKLIST) += quicklist.o
obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
diff --git a/mm/hmm.c b/mm/hmm.c
new file mode 100644
index 0000000..ed3a847
--- /dev/null
+++ b/mm/hmm.c
@@ -0,0 +1,81 @@
+/*
+ * Copyright 2013 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * Authors: Jérôme Glisse <[email protected]>
+ */
+/*
+ * Refer to include/linux/hmm.h for information about heterogeneous memory
+ * management or HMM for short.
+ */
+#include <linux/mm.h>
+#include <linux/hmm.h>
+#include <linux/slab.h>
+#include <linux/sched.h>
+
+/*
+ * struct hmm - HMM per mm struct
+ *
+ * @mm: mm struct this HMM struct is bound to
+ */
+struct hmm {
+ struct mm_struct *mm;
+};
+
+/*
+ * hmm_register - register HMM against an mm (HMM internal)
+ *
+ * @mm: mm struct to attach to
+ *
+ * This is not intended to be used directly by device drivers. It allocates an
+ * HMM struct if mm does not have one, and initializes it.
+ */
+static struct hmm *hmm_register(struct mm_struct *mm)
+{
+ if (!mm->hmm) {
+ struct hmm *hmm = NULL;
+
+ hmm = kmalloc(sizeof(*hmm), GFP_KERNEL);
+ if (!hmm)
+ return NULL;
+ hmm->mm = mm;
+
+ spin_lock(&mm->page_table_lock);
+ if (!mm->hmm)
+ mm->hmm = hmm;
+ else
+ kfree(hmm);
+ spin_unlock(&mm->page_table_lock);
+ }
+
+ /*
+ * The hmm struct can only be freed once the mm_struct goes away,
+ * hence we should always have pre-allocated an new hmm struct
+ * above.
+ */
+ return mm->hmm;
+}
+
+void hmm_mm_destroy(struct mm_struct *mm)
+{
+ struct hmm *hmm;
+
+ /*
+ * We should not need to lock here as no one should be able to register
+ * a new HMM while an mm is being destroy. But just to be safe ...
+ */
+ spin_lock(&mm->page_table_lock);
+ hmm = mm->hmm;
+ mm->hmm = NULL;
+ spin_unlock(&mm->page_table_lock);
+ kfree(hmm);
+}
--
2.4.11
When a ZONE_DEVICE page refcount reach 1 it means it is free and nobody
is holding a reference on it (only device to which the memory belong do).
Add a callback and call it when that happen so device driver can implement
their own free page management.
Changes since v2:
- Move page refcount in put_zone_device_page()
Changes since v1:
- Do not update devm_memremap_pages() to take extra argument
Signed-off-by: Jérôme Glisse <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Ross Zwisler <[email protected]>
---
include/linux/memremap.h | 6 ++++++
kernel/memremap.c | 11 ++++++++++-
2 files changed, 16 insertions(+), 1 deletion(-)
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 29d2cca..3e04f58 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -35,19 +35,25 @@ static inline struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
}
#endif
+typedef void (*dev_page_free_t)(struct page *page, void *data);
+
/**
* struct dev_pagemap - metadata for ZONE_DEVICE mappings
+ * @page_free: free page callback when page refcount reach 1
* @altmap: pre-allocated/reserved memory for vmemmap allocations
* @res: physical address range covered by @ref
* @ref: reference count that pins the devm_memremap_pages() mapping
* @dev: host device of the mapping for debug
+ * @data: privata data pointer for page_free
* @flags: memory flags see MEMORY_* in memory_hotplug.h
*/
struct dev_pagemap {
+ dev_page_free_t page_free;
struct vmem_altmap *altmap;
const struct resource *res;
struct percpu_ref *ref;
struct device *dev;
+ void *data;
int flags;
};
diff --git a/kernel/memremap.c b/kernel/memremap.c
index c821946..19df1f5 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -190,7 +190,14 @@ EXPORT_SYMBOL(get_zone_device_page);
void put_zone_device_page(struct page *page)
{
- page_ref_dec(page);
+ int count = page_ref_dec_return(page);
+
+ /*
+ * If refcount is 1 then page is freed and refcount is stable as nobody
+ * holds a reference on the page.
+ */
+ if (page->pgmap->page_free && count == 1)
+ page->pgmap->page_free(page, page->pgmap->data);
put_dev_pagemap(page->pgmap);
}
@@ -331,6 +338,8 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
pgmap->ref = ref;
pgmap->res = &page_map->res;
pgmap->flags = MEMORY_DEVICE;
+ pgmap->page_free = NULL;
+ pgmap->data = NULL;
mutex_lock(&pgmap_lock);
error = 0;
--
2.4.11
This does not use existing page table walker because we want to share
same code for our page fault handler.
Changes since v1:
- Use spinlock instead of rcu synchronized list traversal
Signed-off-by: Jérôme Glisse <[email protected]>
Signed-off-by: Evgeny Baskakov <[email protected]>
Signed-off-by: John Hubbard <[email protected]>
Signed-off-by: Mark Hairgrove <[email protected]>
Signed-off-by: Sherry Cheung <[email protected]>
Signed-off-by: Subhash Gutti <[email protected]>
---
include/linux/hmm.h | 56 +++++++++++-
mm/hmm.c | 257 ++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 311 insertions(+), 2 deletions(-)
diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index e64f92c..6e89da4 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -86,13 +86,28 @@ struct hmm;
*
* Flags:
* HMM_PFN_VALID: pfn is valid
+ * HMM_PFN_READ: read permission set
* HMM_PFN_WRITE: CPU page table have the write permission set
+ * HMM_PFN_ERROR: corresponding CPU page table entry point to poisoned memory
+ * HMM_PFN_EMPTY: corresponding CPU page table entry is none (pte_none() true)
+ * HMM_PFN_DEVICE: this is device memory (ie a ZONE_DEVICE page)
+ * HMM_PFN_SPECIAL: corresponding CPU page table entry is special ie result of
+ * vm_insert_pfn() or vm_insert_page() and thus should not be mirror by a
+ * device (the entry will never have HMM_PFN_VALID set and the pfn value
+ * is undefine)
+ * HMM_PFN_UNADDRESSABLE: unaddressable device memory (ZONE_DEVICE)
*/
typedef unsigned long hmm_pfn_t;
#define HMM_PFN_VALID (1 << 0)
-#define HMM_PFN_WRITE (1 << 1)
-#define HMM_PFN_SHIFT 2
+#define HMM_PFN_READ (1 << 1)
+#define HMM_PFN_WRITE (1 << 2)
+#define HMM_PFN_ERROR (1 << 3)
+#define HMM_PFN_EMPTY (1 << 4)
+#define HMM_PFN_DEVICE (1 << 5)
+#define HMM_PFN_SPECIAL (1 << 6)
+#define HMM_PFN_UNADDRESSABLE (1 << 7)
+#define HMM_PFN_SHIFT 8
/*
* hmm_pfn_to_page() - return struct page pointed to by a valid hmm_pfn_t
@@ -239,6 +254,43 @@ int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm);
int hmm_mirror_register_locked(struct hmm_mirror *mirror,
struct mm_struct *mm);
void hmm_mirror_unregister(struct hmm_mirror *mirror);
+
+
+/*
+ * struct hmm_range - track invalidation lock on virtual address range
+ *
+ * @list: all range lock are on a list
+ * @start: range virtual start address (inclusive)
+ * @end: range virtual end address (exclusive)
+ * @pfns: array of pfns (big enough for the range)
+ * @valid: pfns array did not change since it has been fill by an HMM function
+ */
+struct hmm_range {
+ struct list_head list;
+ unsigned long start;
+ unsigned long end;
+ hmm_pfn_t *pfns;
+ bool valid;
+};
+
+/*
+ * To snapshot CPU page table call hmm_vma_get_pfns() then take device driver
+ * lock that serialize device page table update and call hmm_vma_range_done()
+ * to check if snapshot is still valid. The device driver page table update
+ * lock must also be use in the HMM mirror update() callback so that CPU page
+ * table invalidation serialize on it.
+ *
+ * YOU MUST CALL hmm_vma_range_dond() ONCE AND ONLY ONCE EACH TIME YOU CALL
+ * hmm_vma_get_pfns() WITHOUT ERROR !
+ *
+ * IF YOU DO NOT FOLLOW THE ABOVE RULE THE SNAPSHOT CONTENT MIGHT BE INVALID !
+ */
+int hmm_vma_get_pfns(struct vm_area_struct *vma,
+ struct hmm_range *range,
+ unsigned long start,
+ unsigned long end,
+ hmm_pfn_t *pfns);
+bool hmm_vma_range_done(struct vm_area_struct *vma, struct hmm_range *range);
#endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
diff --git a/mm/hmm.c b/mm/hmm.c
index 6a2d299..9b52d36 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -19,10 +19,15 @@
*/
#include <linux/mm.h>
#include <linux/hmm.h>
+#include <linux/rmap.h>
+#include <linux/swap.h>
#include <linux/slab.h>
#include <linux/sched.h>
+#include <linux/swapops.h>
+#include <linux/hugetlb.h>
#include <linux/mmu_notifier.h>
+
/*
* struct hmm - HMM per mm struct
*
@@ -37,6 +42,7 @@
struct hmm {
struct mm_struct *mm;
spinlock_t lock;
+ struct list_head ranges;
struct list_head mirrors;
atomic_t sequence;
wait_queue_head_t wait_queue;
@@ -65,6 +71,7 @@ static struct hmm *hmm_register(struct mm_struct *mm)
INIT_LIST_HEAD(&hmm->mirrors);
atomic_set(&hmm->sequence, 0);
hmm->mmu_notifier.ops = NULL;
+ INIT_LIST_HEAD(&hmm->ranges);
spin_lock_init(&hmm->lock);
hmm->mm = mm;
@@ -107,6 +114,22 @@ static void hmm_invalidate_range(struct hmm *hmm,
unsigned long end)
{
struct hmm_mirror *mirror;
+ struct hmm_range *range;
+
+ spin_lock(&hmm->lock);
+ list_for_each_entry(range, &hmm->ranges, list) {
+ unsigned long addr, idx, npages;
+
+ if (end < range->start || start >= range->end)
+ continue;
+
+ range->valid = false;
+ addr = max(start, range->start);
+ idx = (addr - range->start) >> PAGE_SHIFT;
+ npages = (min(range->end, end) - addr) >> PAGE_SHIFT;
+ memset(&range->pfns[idx], 0, sizeof(*range->pfns) * npages);
+ }
+ spin_unlock(&hmm->lock);
/*
* Mirror being added or removed is a rare event so list traversal isn't
@@ -264,4 +287,238 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror)
wait_event(hmm->wait_queue, !atomic_read(&hmm->notifier_count));
}
EXPORT_SYMBOL(hmm_mirror_unregister);
+
+static void hmm_pfns_empty(hmm_pfn_t *pfns,
+ unsigned long addr,
+ unsigned long end)
+{
+ for (; addr < end; addr += PAGE_SIZE, pfns++)
+ *pfns = HMM_PFN_EMPTY;
+}
+
+static void hmm_pfns_special(hmm_pfn_t *pfns,
+ unsigned long addr,
+ unsigned long end)
+{
+ for (; addr < end; addr += PAGE_SIZE, pfns++)
+ *pfns = HMM_PFN_SPECIAL;
+}
+
+static void hmm_vma_walk(struct vm_area_struct *vma,
+ unsigned long start,
+ unsigned long end,
+ hmm_pfn_t *pfns)
+{
+ unsigned long addr, next;
+ hmm_pfn_t flag;
+
+ flag = vma->vm_flags & VM_READ ? HMM_PFN_READ : 0;
+
+ for (addr = start; addr < end; addr = next) {
+ unsigned long i = (addr - start) >> PAGE_SHIFT;
+ pgd_t *pgdp;
+ pud_t *pudp;
+ pmd_t *pmdp;
+ pte_t *ptep;
+ pmd_t pmd;
+
+ /*
+ * We are accessing/faulting for a device from an unknown
+ * thread that might be foreign to the mm we are faulting
+ * against so do not call arch_vma_access_permitted() !
+ */
+
+ next = pgd_addr_end(addr, end);
+ pgdp = pgd_offset(vma->vm_mm, addr);
+ if (pgd_none(*pgdp) || pgd_bad(*pgdp)) {
+ hmm_pfns_empty(&pfns[i], addr, next);
+ continue;
+ }
+
+ next = pud_addr_end(addr, end);
+ pudp = pud_offset(pgdp, addr);
+ if (pud_none(*pudp) || pud_bad(*pudp)) {
+ hmm_pfns_empty(&pfns[i], addr, next);
+ continue;
+ }
+
+ next = pmd_addr_end(addr, end);
+ pmdp = pmd_offset(pudp, addr);
+ pmd = pmd_read_atomic(pmdp);
+ barrier();
+ if (pmd_none(pmd) || pmd_bad(pmd)) {
+ hmm_pfns_empty(&pfns[i], addr, next);
+ continue;
+ }
+ if (pmd_trans_huge(pmd) || pmd_devmap(pmd)) {
+ unsigned long pfn = pmd_pfn(pmd) + pte_index(addr);
+ hmm_pfn_t flags = flag;
+
+ if (pmd_protnone(pmd)) {
+ hmm_pfns_clear(&pfns[i], addr, next);
+ continue;
+ }
+ flags |= pmd_write(*pmdp) ? HMM_PFN_WRITE : 0;
+ flags |= pmd_devmap(pmd) ? HMM_PFN_DEVICE : 0;
+ for (; addr < next; addr += PAGE_SIZE, i++, pfn++)
+ pfns[i] = hmm_pfn_from_pfn(pfn) | flags;
+ continue;
+ }
+
+ ptep = pte_offset_map(pmdp, addr);
+ for (; addr < next; addr += PAGE_SIZE, i++, ptep++) {
+ swp_entry_t entry;
+ pte_t pte = *ptep;
+
+ pfns[i] = 0;
+
+ if (pte_none(pte)) {
+ pfns[i] = HMM_PFN_EMPTY;
+ continue;
+ }
+
+ entry = pte_to_swp_entry(pte);
+ if (!pte_present(pte) && !non_swap_entry(entry)) {
+ continue;
+ }
+
+ if (pte_present(pte)) {
+ pfns[i] = hmm_pfn_from_pfn(pte_pfn(pte))|flag;
+ pfns[i] |= pte_write(pte) ? HMM_PFN_WRITE : 0;
+ continue;
+ }
+
+ /*
+ * This is a special swap entry, ignore migration, use
+ * device and report anything else as error.
+ */
+ if (is_device_entry(entry)) {
+ pfns[i] = hmm_pfn_from_pfn(swp_offset(entry));
+ if (is_write_device_entry(entry))
+ pfns[i] |= HMM_PFN_WRITE;
+ pfns[i] |= HMM_PFN_DEVICE;
+ pfns[i] |= HMM_PFN_UNADDRESSABLE;
+ pfns[i] |= flag;
+ } else if (!is_migration_entry(entry)) {
+ pfns[i] = HMM_PFN_ERROR;
+ }
+ }
+ pte_unmap(ptep - 1);
+ }
+}
+
+/*
+ * hmm_vma_get_pfns() - snapshot CPU page table for a range of virtual address
+ * @vma: virtual memory area containing the virtual address range
+ * @range: use to track snapshot validity
+ * @start: range virtual start address (inclusive)
+ * @end: range virtual end address (exclusive)
+ * @entries: array of hmm_pfn_t provided by caller fill by function
+ * Returns: -EINVAL if invalid argument, -ENOMEM out of memory, 0 success
+ *
+ * This snapshot the CPU page table for a range of virtual address, snapshot
+ * validity is track by the range struct see hmm_vma_range_done() for further
+ * informations.
+ *
+ * The range struct is initialized and track CPU page table only if function
+ * returns success (0) then you must call hmm_vma_range_done() to stop range
+ * CPU page table update tracking.
+ *
+ * NOT CALLING hmm_vma_range_done() IF FUNCTION RETURNS 0 WILL LEAD TO SERIOUS
+ * MEMORY CORRUPTION ! YOU HAVE BEEN WARN !
+ */
+int hmm_vma_get_pfns(struct vm_area_struct *vma,
+ struct hmm_range *range,
+ unsigned long start,
+ unsigned long end,
+ hmm_pfn_t *pfns)
+{
+ struct hmm *hmm;
+
+ /* FIXME support hugetlb fs */
+ if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)) {
+ hmm_pfns_special(pfns, start, end);
+ return -EINVAL;
+ }
+
+ /* Sanity check, this really should not happen ! */
+ if (start < vma->vm_start || start >= vma->vm_end)
+ return -EINVAL;
+ if (end < vma->vm_start || end > vma->vm_end)
+ return -EINVAL;
+
+ hmm = hmm_register(vma->vm_mm);
+ if (!hmm)
+ return -ENOMEM;
+ /* Caller must have register a mirror (with hmm_mirror_register()) ! */
+ if (!hmm->mmu_notifier.ops)
+ return -EINVAL;
+
+ /* Initialize range to track CPU page table update */
+ range->start = start;
+ range->pfns = pfns;
+ range->end = end;
+ spin_lock(&hmm->lock);
+ range->valid = true;
+ list_add_rcu(&range->list, &hmm->ranges);
+ spin_unlock(&hmm->lock);
+
+ hmm_vma_walk(vma, start, end, pfns);
+ return 0;
+}
+EXPORT_SYMBOL(hmm_vma_get_pfns);
+
+/*
+ * hmm_vma_range_done() - stop tracking change to CPU page table over a range
+ * @vma: virtual memory area containing the virtual address range
+ * @range: range being track
+ * Returns: false if range data have been invalidated, true otherwise
+ *
+ * Range struct is use to track update to CPU page table after call to
+ * hmm_vma_get_pfns(). Once device driver is done using or want to lock update
+ * to data it gots from this function it calls hmm_vma_range_done() which stop
+ * the tracking.
+ *
+ * There is 2 way to use this :
+ * again:
+ * hmm_vma_get_pfns(vma, range, start, end, pfns);
+ * trans = device_build_page_table_update_transaction(pfns);
+ * device_page_table_lock();
+ * if (!hmm_vma_range_done(vma, range)) {
+ * device_page_table_unlock();
+ * goto again;
+ * }
+ * device_commit_transaction(trans);
+ * device_page_table_unlock();
+ *
+ * Or:
+ * hmm_vma_get_pfns(vma, range, start, end, pfns);
+ * device_page_table_lock();
+ * hmm_vma_range_done(vma, range);
+ * device_update_page_table(pfns);
+ * device_page_table_unlock();
+ */
+bool hmm_vma_range_done(struct vm_area_struct *vma, struct hmm_range *range)
+{
+ unsigned long npages = (range->end - range->start) >> PAGE_SHIFT;
+ struct hmm *hmm;
+
+ if (range->end <= range->start) {
+ BUG();
+ return false;
+ }
+
+ hmm = hmm_register(vma->vm_mm);
+ if (!hmm) {
+ memset(range->pfns, 0, sizeof(*range->pfns) * npages);
+ return false;
+ }
+
+ spin_lock(&hmm->lock);
+ list_del_rcu(&range->list);
+ spin_unlock(&hmm->lock);
+
+ return range->valid;
+}
+EXPORT_SYMBOL(hmm_vma_range_done);
#endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
--
2.4.11
Common case for migration of virtual address range is page are map
only once inside the vma in which migration is taking place. Because
we already walk the CPU page table for that range we can directly do
the unmap there and setup special migration swap entry.
Signed-off-by: Jérôme Glisse <[email protected]>
Signed-off-by: Evgeny Baskakov <[email protected]>
Signed-off-by: John Hubbard <[email protected]>
Signed-off-by: Mark Hairgrove <[email protected]>
Signed-off-by: Sherry Cheung <[email protected]>
Signed-off-by: Subhash Gutti <[email protected]>
---
mm/migrate.c | 111 ++++++++++++++++++++++++++++++++++++++++++++++++++---------
1 file changed, 95 insertions(+), 16 deletions(-)
diff --git a/mm/migrate.c b/mm/migrate.c
index e37d796..5a14b4ec 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2125,9 +2125,10 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
{
struct migrate_vma *migrate = walk->private;
struct mm_struct *mm = walk->vma->vm_mm;
- unsigned long addr = start;
+ unsigned long addr = start, unmapped = 0;
spinlock_t *ptl;
pte_t *ptep;
+ int ret = 0;
if (pmd_none(*pmdp) || pmd_trans_unstable(pmdp)) {
/* FIXME support THP */
@@ -2135,9 +2136,12 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
}
ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+ arch_enter_lazy_mmu_mode();
+
for (; addr < end; addr += PAGE_SIZE, ptep++) {
unsigned long flags, pfn;
struct page *page;
+ swp_entry_t entry;
pte_t pte;
int ret;
@@ -2170,17 +2174,50 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
flags = MIGRATE_PFN_VALID | MIGRATE_PFN_MIGRATE;
flags |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
+ /*
+ * Optimize for the common case where page is only mapped once
+ * in one process. If we can lock the page, then we can safely
+ * set up a special migration page table entry now.
+ */
+ if (trylock_page(page)) {
+ pte_t swp_pte;
+
+ flags |= MIGRATE_PFN_LOCKED;
+ ptep_get_and_clear(mm, addr, ptep);
+
+ /* Setup special migration page table entry */
+ entry = make_migration_entry(page, pte_write(pte));
+ swp_pte = swp_entry_to_pte(entry);
+ if (pte_soft_dirty(pte))
+ swp_pte = pte_swp_mksoft_dirty(swp_pte);
+ set_pte_at(mm, addr, ptep, swp_pte);
+
+ /*
+ * This is like regular unmap: we remove the rmap and
+ * drop page refcount. Page won't be freed, as we took
+ * a reference just above.
+ */
+ page_remove_rmap(page, false);
+ put_page(page);
+ unmapped++;
+ }
+
next:
migrate->src[migrate->npages++] = pfn | flags;
ret = migrate_vma_array_full(migrate);
if (ret) {
- pte_unmap_unlock(ptep, ptl);
- return ret;
+ ptep++;
+ break;
}
}
+ arch_leave_lazy_mmu_mode();
pte_unmap_unlock(ptep - 1, ptl);
- return 0;
+ /* Only flush the TLB if we actually modified any entries */
+ if (unmapped)
+ flush_tlb_range(walk->vma, start, end);
+
+ return ret;
}
/*
@@ -2204,7 +2241,13 @@ static void migrate_vma_collect(struct migrate_vma *migrate)
mm_walk.mm = migrate->vma->vm_mm;
mm_walk.private = migrate;
+ mmu_notifier_invalidate_range_start(mm_walk.mm,
+ migrate->start,
+ migrate->end);
walk_page_range(migrate->start, migrate->end, &mm_walk);
+ mmu_notifier_invalidate_range_end(mm_walk.mm,
+ migrate->start,
+ migrate->end);
migrate->end = migrate->start + (migrate->npages << PAGE_SHIFT);
}
@@ -2251,21 +2294,27 @@ static bool migrate_vma_check_page(struct page *page)
*/
static void migrate_vma_prepare(struct migrate_vma *migrate)
{
- unsigned long addr = migrate->start, i, size;
+ unsigned long addr = migrate->start, i, size, restore = 0;
const unsigned long npages = migrate->npages;
+ const unsigned long start = migrate->start;
bool allow_drain = true;
lru_add_drain();
- for (i = 0; i < npages && migrate->cpages; i++, addr += size) {
+ for (addr = start, i = 0; i < npages; i++, addr += size) {
struct page *page = migrate_pfn_to_page(migrate->src[i]);
+ bool remap = true;
+
size = migrate_pfn_size(migrate->src[i]);
if (!page)
continue;
- lock_page(page);
- migrate->src[i] |= MIGRATE_PFN_LOCKED;
+ if (!(migrate->src[i] & MIGRATE_PFN_LOCKED)) {
+ remap = false;
+ lock_page(page);
+ migrate->src[i] |= MIGRATE_PFN_LOCKED;
+ }
if (!PageLRU(page) && allow_drain) {
/* Drain CPU's pagevec */
@@ -2274,10 +2323,16 @@ static void migrate_vma_prepare(struct migrate_vma *migrate)
}
if (isolate_lru_page(page)) {
- migrate->src[i] = 0;
- unlock_page(page);
- migrate->cpages--;
- put_page(page);
+ if (remap) {
+ migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
+ migrate->cpages--;
+ restore++;
+ } else {
+ migrate->src[i] = 0;
+ unlock_page(page);
+ migrate->cpages--;
+ put_page(page);
+ }
continue;
}
@@ -2285,13 +2340,37 @@ static void migrate_vma_prepare(struct migrate_vma *migrate)
put_page(page);
if (!migrate_vma_check_page(page)) {
- migrate->src[i] = 0;
- unlock_page(page);
- migrate->cpages--;
+ if (remap) {
+ migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
+ migrate->cpages--;
+ restore++;
- putback_lru_page(page);
+ get_page(page);
+ putback_lru_page(page);
+ } else {
+ migrate->src[i] = 0;
+ unlock_page(page);
+ migrate->cpages--;
+
+ putback_lru_page(page);
+ }
}
}
+
+ for (i = 0, addr = start; i < npages && restore; i++, addr += size) {
+ struct page *page = migrate_pfn_to_page(migrate->src[i]);
+ size = migrate_pfn_size(migrate->src[i]);
+
+ if (!page || (migrate->src[i] & MIGRATE_PFN_MIGRATE))
+ continue;
+
+ remove_migration_pte(page, migrate->vma, addr, page);
+
+ migrate->src[i] = 0;
+ unlock_page(page);
+ put_page(page);
+ restore--;
+ }
}
/*
--
2.4.11
This patch add a new memory migration helpers, which migrate memory
backing a range of virtual address of a process to different memory
(which can be allocated through special allocator). It differs from
numa migration by working on a range of virtual address and thus by
doing migration in chunk that can be large enough to use DMA engine
or special copy offloading engine.
Expected users are any one with heterogeneous memory where different
memory have different characteristics (latency, bandwidth, ...). As
an example IBM platform with CAPI bus can make use of this feature
to migrate between regular memory and CAPI device memory. New CPU
architecture with a pool of high performance memory not manage as
cache but presented as regular memory (while being faster and with
lower latency than DDR) will also be prime user of this patch.
Migration to private device memory will be usefull for device that
have large pool of such like GPU, NVidia plans to use HMM for that.
Changes since v3:
- Rebase
Changes since v2:
- droped HMM prefix and HMM specific code
Changes since v1:
- typos fix
- split early unmap optimization for page with single mapping
Signed-off-by: Jérôme Glisse <[email protected]>
Signed-off-by: Evgeny Baskakov <[email protected]>
Signed-off-by: John Hubbard <[email protected]>
Signed-off-by: Mark Hairgrove <[email protected]>
Signed-off-by: Sherry Cheung <[email protected]>
Signed-off-by: Subhash Gutti <[email protected]>
---
include/linux/migrate.h | 73 ++++++++
mm/migrate.c | 460 ++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 533 insertions(+)
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 0a66ddd..6c610ee 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -124,4 +124,77 @@ static inline int migrate_misplaced_transhuge_page(struct mm_struct *mm,
}
#endif /* CONFIG_NUMA_BALANCING && CONFIG_TRANSPARENT_HUGEPAGE*/
+
+#define MIGRATE_PFN_VALID (1UL << (BITS_PER_LONG_LONG - 1))
+#define MIGRATE_PFN_MIGRATE (1UL << (BITS_PER_LONG_LONG - 2))
+#define MIGRATE_PFN_HUGE (1UL << (BITS_PER_LONG_LONG - 3))
+#define MIGRATE_PFN_LOCKED (1UL << (BITS_PER_LONG_LONG - 4))
+#define MIGRATE_PFN_WRITE (1UL << (BITS_PER_LONG_LONG - 5))
+#define MIGRATE_PFN_MASK ((1UL << (BITS_PER_LONG_LONG - PAGE_SHIFT)) - 1)
+
+static inline struct page *migrate_pfn_to_page(unsigned long mpfn)
+{
+ if (!(mpfn & MIGRATE_PFN_VALID))
+ return NULL;
+ return pfn_to_page(mpfn & MIGRATE_PFN_MASK);
+}
+
+static inline unsigned long migrate_pfn_size(unsigned long mpfn)
+{
+ return mpfn & MIGRATE_PFN_HUGE ? PMD_SIZE : PAGE_SIZE;
+}
+
+/*
+ * struct migrate_vma_ops - migrate operation callback
+ *
+ * @alloc_and_copy: alloc destination memoiry and copy source to it
+ * @finalize_and_map: allow caller to inspect successfull migrated page
+ *
+ * migrate_vma() allow memory migration to use DMA engine to perform copy from
+ * source to destination memory it also allow caller to use its own memory
+ * allocator for destination memory.
+ *
+ * Note that in alloc_and_copy device driver can decide not to migrate some of
+ * the entry by simply setting corresponding dst entry 0.
+ *
+ * Destination page must locked and MIGRATE_PFN_LOCKED set in the corresponding
+ * entry of dstarray. It is expected that page allocated will have an elevated
+ * refcount and that a put_page() will free the page.
+ *
+ * Device driver might want to allocate with an extra-refcount if they want to
+ * control deallocation of failed migration inside finalize_and_map() callback.
+ *
+ * The finalize_and_map() callback must use the MIGRATE_PFN_MIGRATE flag to
+ * determine which page have been successfully migrated (it is set in the src
+ * array for each entry that have been successfully migrated).
+ *
+ * For migration from device memory to system memory device driver must set any
+ * dst entry to MIGRATE_PFN_ERROR for any entry it can not migrate back due to
+ * hardware fatal failure that can not be recovered. Such failure will trigger
+ * a SIGBUS for the process trying to access such memory.
+ */
+struct migrate_vma_ops {
+ void (*alloc_and_copy)(struct vm_area_struct *vma,
+ const unsigned long *src,
+ unsigned long *dst,
+ unsigned long start,
+ unsigned long end,
+ void *private);
+ void (*finalize_and_map)(struct vm_area_struct *vma,
+ const unsigned long *src,
+ const unsigned long *dst,
+ unsigned long start,
+ unsigned long end,
+ void *private);
+};
+
+int migrate_vma(const struct migrate_vma_ops *ops,
+ struct vm_area_struct *vma,
+ unsigned long mentries,
+ unsigned long start,
+ unsigned long end,
+ unsigned long *src,
+ unsigned long *dst,
+ void *private);
+
#endif /* _LINUX_MIGRATE_H */
diff --git a/mm/migrate.c b/mm/migrate.c
index cb911ce..e37d796 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -393,6 +393,14 @@ int migrate_page_move_mapping(struct address_space *mapping,
int expected_count = 1 + extra_count;
void **pslot;
+ /*
+ * ZONE_DEVICE pages have 1 refcount always held by their device
+ *
+ * Note that DAX memory will never reach that point as it does not have
+ * the MEMORY_DEVICE_ALLOW_MIGRATE flag set (see memory_hotplug.h).
+ */
+ expected_count += is_zone_device_page(page);
+
if (!mapping) {
/* Anonymous page without mapping */
if (page_count(page) != expected_count)
@@ -2061,3 +2069,455 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
#endif /* CONFIG_NUMA_BALANCING */
#endif /* CONFIG_NUMA */
+
+
+struct migrate_vma {
+ struct vm_area_struct *vma;
+ unsigned long *dst;
+ unsigned long *src;
+ unsigned long cpages;
+ unsigned long npages;
+ unsigned long mpages;
+ unsigned long start;
+ unsigned long end;
+};
+
+static inline int migrate_vma_array_full(struct migrate_vma *migrate)
+{
+ return migrate->npages >= migrate->mpages ? -ENOSPC : 0;
+}
+
+static int migrate_vma_collect_hole(unsigned long start,
+ unsigned long end,
+ struct mm_walk *walk)
+{
+ struct migrate_vma *migrate = walk->private;
+ unsigned long addr, next;
+
+ for (addr = start & PAGE_MASK; addr < end; addr = next) {
+ unsigned long npages, i;
+ int ret;
+
+ next = pmd_addr_end(addr, end);
+ npages = (next - addr) >> PAGE_SHIFT;
+ if (npages == (PMD_SIZE >> PAGE_SHIFT)) {
+ migrate->src[migrate->npages++] = MIGRATE_PFN_HUGE;
+ ret = migrate_vma_array_full(migrate);
+ if (ret)
+ return ret;
+ } else {
+ for (i = 0; i < npages; ++i) {
+ migrate->src[migrate->npages++] = 0;
+ ret = migrate_vma_array_full(migrate);
+ if (ret)
+ return ret;
+ }
+ }
+ }
+
+ return 0;
+}
+
+static int migrate_vma_collect_pmd(pmd_t *pmdp,
+ unsigned long start,
+ unsigned long end,
+ struct mm_walk *walk)
+{
+ struct migrate_vma *migrate = walk->private;
+ struct mm_struct *mm = walk->vma->vm_mm;
+ unsigned long addr = start;
+ spinlock_t *ptl;
+ pte_t *ptep;
+
+ if (pmd_none(*pmdp) || pmd_trans_unstable(pmdp)) {
+ /* FIXME support THP */
+ return migrate_vma_collect_hole(start, end, walk);
+ }
+
+ ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+ for (; addr < end; addr += PAGE_SIZE, ptep++) {
+ unsigned long flags, pfn;
+ struct page *page;
+ pte_t pte;
+ int ret;
+
+ pte = *ptep;
+ pfn = pte_pfn(pte);
+
+ if (!pte_present(pte)) {
+ flags = pfn = 0;
+ goto next;
+ }
+
+ /* FIXME support THP */
+ page = vm_normal_page(migrate->vma, addr, pte);
+ if (!page || !page->mapping || PageTransCompound(page)) {
+ flags = pfn = 0;
+ goto next;
+ }
+
+ /*
+ * By getting a reference on the page we pin it and that blocks
+ * any kind of migration. Side effect is that it "freezes" the
+ * pte.
+ *
+ * We drop this reference after isolating the page from the lru
+ * for non device page (device page are not on the lru and thus
+ * can't be dropped from it).
+ */
+ get_page(page);
+ migrate->cpages++;
+ flags = MIGRATE_PFN_VALID | MIGRATE_PFN_MIGRATE;
+ flags |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
+
+next:
+ migrate->src[migrate->npages++] = pfn | flags;
+ ret = migrate_vma_array_full(migrate);
+ if (ret) {
+ pte_unmap_unlock(ptep, ptl);
+ return ret;
+ }
+ }
+ pte_unmap_unlock(ptep - 1, ptl);
+
+ return 0;
+}
+
+/*
+ * migrate_vma_collect() - collect page over range of virtual addresses
+ * @migrate: migrate struct containing all migration information
+ *
+ * This will walk the CPU page table. For each virtual address backed by a
+ * valid page, it updates the src array and takes a reference on the page, in
+ * order to pin the page until we lock it and unmap it.
+ */
+static void migrate_vma_collect(struct migrate_vma *migrate)
+{
+ struct mm_walk mm_walk;
+
+ mm_walk.pmd_entry = migrate_vma_collect_pmd;
+ mm_walk.pte_entry = NULL;
+ mm_walk.pte_hole = migrate_vma_collect_hole;
+ mm_walk.hugetlb_entry = NULL;
+ mm_walk.test_walk = NULL;
+ mm_walk.vma = migrate->vma;
+ mm_walk.mm = migrate->vma->vm_mm;
+ mm_walk.private = migrate;
+
+ walk_page_range(migrate->start, migrate->end, &mm_walk);
+
+ migrate->end = migrate->start + (migrate->npages << PAGE_SHIFT);
+}
+
+/*
+ * migrate_vma_check_page() - check if page is pinned or not
+ * @page: struct page to check
+ *
+ * Pinned pages cannot be migrated. This is the same test as in
+ * migrate_page_move_mapping(), except that here we allow migration of a
+ * ZONE_DEVICE page.
+ */
+static bool migrate_vma_check_page(struct page *page)
+{
+ /*
+ * One extra ref because caller holds an extra reference, either from
+ * isolate_lru_page() for a regular page, or migrate_vma_collect() for
+ * a device page.
+ */
+ int extra = 1;
+
+ /*
+ * FIXME support THP (transparent huge page), it is bit more complex to
+ * check them than regular pages, because they can be mapped with a pmd
+ * or with a pte (split pte mapping).
+ */
+ if (PageCompound(page))
+ return false;
+
+ if ((page_count(page) - extra) > page_mapcount(page))
+ return false;
+
+ return true;
+}
+
+/*
+ * migrate_vma_prepare() - lock pages and isolate them from the lru
+ * @migrate: migrate struct containing all migration informations
+ *
+ * This locks pages that have been collected by migrate_vma_collect(). Once each
+ * page is locked it is isolated from the lru (for non-device pages). Finally,
+ * the ref taken by migrate_vma_collect() is dropped, as locked pages cannot be
+ * migrated by concurrent kernel threads.
+ */
+static void migrate_vma_prepare(struct migrate_vma *migrate)
+{
+ unsigned long addr = migrate->start, i, size;
+ const unsigned long npages = migrate->npages;
+ bool allow_drain = true;
+
+ lru_add_drain();
+
+ for (i = 0; i < npages && migrate->cpages; i++, addr += size) {
+ struct page *page = migrate_pfn_to_page(migrate->src[i]);
+ size = migrate_pfn_size(migrate->src[i]);
+
+ if (!page)
+ continue;
+
+ lock_page(page);
+ migrate->src[i] |= MIGRATE_PFN_LOCKED;
+
+ if (!PageLRU(page) && allow_drain) {
+ /* Drain CPU's pagevec */
+ lru_add_drain_all();
+ allow_drain = false;
+ }
+
+ if (isolate_lru_page(page)) {
+ migrate->src[i] = 0;
+ unlock_page(page);
+ migrate->cpages--;
+ put_page(page);
+ continue;
+ }
+
+ /* Drop the reference we took in collect */
+ put_page(page);
+
+ if (!migrate_vma_check_page(page)) {
+ migrate->src[i] = 0;
+ unlock_page(page);
+ migrate->cpages--;
+
+ putback_lru_page(page);
+ }
+ }
+}
+
+/*
+ * migrate_vma_unmap() - replace page mapping with special migration pte entry
+ * @migrate: migrate struct containing all migration information
+ *
+ * Replace page mapping (CPU page table pte) with special migration pte entry
+ * and check again if it has been pinned. Pinned pages are restored because we
+ * cannot migrate them.
+ *
+ * This is the last step before we call the device driver callback to allocate
+ * destination memory and copy contents of original page over to new page.
+ */
+static void migrate_vma_unmap(struct migrate_vma *migrate)
+{
+ int flags = TTU_MIGRATION | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS;
+ unsigned long addr = migrate->start, i, restore = 0, size;
+ const unsigned long npages = migrate->npages;
+ const unsigned long start = migrate->start;
+
+ for (i = 0; i < npages && migrate->cpages; addr += size, i++) {
+ struct page *page = migrate_pfn_to_page(migrate->src[i]);
+ size = migrate_pfn_size(migrate->src[i]);
+
+ if (!page || !(migrate->src[i] & MIGRATE_PFN_MIGRATE))
+ continue;
+
+ try_to_unmap(page, flags);
+ if (page_mapped(page) || !migrate_vma_check_page(page)) {
+ migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
+ migrate->cpages--;
+ restore++;
+ }
+ }
+
+ for (addr = start, i = 0; i < npages && restore; addr += size, i++) {
+ struct page *page = migrate_pfn_to_page(migrate->src[i]);
+ size = migrate_pfn_size(migrate->src[i]);
+
+ if (!page || (migrate->src[i] & MIGRATE_PFN_MIGRATE))
+ continue;
+
+ remove_migration_ptes(page, page, false);
+
+ migrate->src[i] = 0;
+ unlock_page(page);
+ restore--;
+
+ putback_lru_page(page);
+ }
+}
+
+/*
+ * migrate_vma_pages() - migrate meta-data from src page to dst page
+ * @migrate: migrate struct containing all migration information
+ *
+ * This migrate struct page meta-data from source struct page to destination
+ * struct page. This effectively finishes the migration from source page to the
+ * destination page.
+ */
+static void migrate_vma_pages(struct migrate_vma *migrate)
+{
+ const unsigned long npages = migrate->npages;
+ unsigned long addr, i, size;
+
+ for (i = 0, addr = migrate->start; i < npages; addr += size, i++) {
+ struct page *newpage = migrate_pfn_to_page(migrate->dst[i]);
+ struct page *page = migrate_pfn_to_page(migrate->src[i]);
+ struct address_space *mapping;
+ int r;
+
+ size = migrate_pfn_size(migrate->src[i]);
+
+ if (!page || !newpage)
+ continue;
+ if (!(migrate->src[i] & MIGRATE_PFN_MIGRATE))
+ continue;
+
+ mapping = page_mapping(page);
+
+ r = migrate_page(mapping, newpage, page, MIGRATE_SYNC, false);
+ if (r != MIGRATEPAGE_SUCCESS)
+ migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
+ }
+}
+
+/*
+ * migrate_vma_finalize() - restore CPU page table entry
+ * @migrate: migrate struct containing all migration information
+ *
+ * This replaces the special migration pte entry with either a mapping to the
+ * new page if migration was successful for that page, or to the original page
+ * otherwise.
+ *
+ * This also unlocks the pages and puts them back on the lru, or drops the extra
+ * refcount, for device pages.
+ */
+static void migrate_vma_finalize(struct migrate_vma *migrate)
+{
+ const unsigned long npages = migrate->npages;
+ unsigned long addr, i, size;
+
+ for (i = 0, addr = migrate->start; i < npages; addr += size, i++) {
+ struct page *newpage = migrate_pfn_to_page(migrate->dst[i]);
+ struct page *page = migrate_pfn_to_page(migrate->src[i]);
+ size = migrate_pfn_size(migrate->src[i]);
+
+ if (!page)
+ continue;
+ if (!(migrate->src[i] & MIGRATE_PFN_MIGRATE) || !newpage) {
+ if (newpage) {
+ unlock_page(newpage);
+ put_page(newpage);
+ }
+ newpage = page;
+ }
+
+ remove_migration_ptes(page, newpage, false);
+ unlock_page(page);
+ migrate->cpages--;
+
+ putback_lru_page(page);
+
+ if (newpage != page) {
+ unlock_page(newpage);
+ putback_lru_page(newpage);
+ }
+ }
+}
+
+/*
+ * migrate_vma() - migrate a range of memory inside vma using accelerated copy
+ *
+ * @ops: migration callback for allocating destination memory and copying
+ * @vma: virtual memory area containing the range to be migrated
+ * @mentries: maximum number of entry in src or dst pfns array
+ * @start: start address of the range to migrate (inclusive)
+ * @end: end address of the range to migrate (exclusive)
+ * @src: array of hmm_pfn_t containing source pfns
+ * @dst: array of hmm_pfn_t containing destination pfns
+ * @private: pointer passed back to each of the callback
+ * Returns: 0 on success, error code otherwise
+ *
+ * This will try to migrate a range of memory using callback to allocate and
+ * copy memory from source to destination. This function will first collect,
+ * lock and unmap pages in the range and then call alloc_and_copy() callback
+ * for device driver to allocate destination memory and copy from source.
+ *
+ * Then it will proceed and try to effectively migrate the page (struct page
+ * metadata), a step that can fail for various reasons. Before updating CPU page
+ * table it will call finalize_and_map() callback so that the device driver can
+ * inspect what has been successfully migrated and update its own page tables
+ * (this latter aspect is not mandatory and only make senses for some users of
+ * this API).
+ *
+ * Finally the function update CPU page table and unlock the pages before
+ * returning 0.
+ *
+ * It will return an error code only if one of the arguments is invalid.
+ */
+int migrate_vma(const struct migrate_vma_ops *ops,
+ struct vm_area_struct *vma,
+ unsigned long mentries,
+ unsigned long start,
+ unsigned long end,
+ unsigned long *src,
+ unsigned long *dst,
+ void *private)
+{
+ struct migrate_vma migrate;
+
+ /* Sanity check the arguments */
+ start &= PAGE_MASK;
+ end &= PAGE_MASK;
+ if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL))
+ return -EINVAL;
+ if (!vma || !ops || !src || !dst || start >= end)
+ return -EINVAL;
+ if (start < vma->vm_start || start >= vma->vm_end)
+ return -EINVAL;
+ if (end <= vma->vm_start || end > vma->vm_end)
+ return -EINVAL;
+
+ memset(src, 0, sizeof(*src) * ((end - start) >> PAGE_SHIFT));
+ migrate.src = src;
+ migrate.dst = dst;
+ migrate.start = start;
+ migrate.npages = 0;
+ migrate.cpages = 0;
+ migrate.mpages = mentries;
+ migrate.end = end;
+ migrate.vma = vma;
+
+ /* Collect, and try to unmap source pages */
+ migrate_vma_collect(&migrate);
+ if (!migrate.cpages)
+ return 0;
+
+ /* Lock and isolate page */
+ migrate_vma_prepare(&migrate);
+ if (!migrate.cpages)
+ return 0;
+
+ /* Unmap pages */
+ migrate_vma_unmap(&migrate);
+ if (!migrate.cpages)
+ return 0;
+
+ /*
+ * At this point pages are locked and unmapped, and thus they have
+ * stable content and can safely be copied to destination memory that
+ * is allocated by the callback.
+ *
+ * Note that migration can fail in migrate_vma_struct_page() for each
+ * individual page.
+ */
+ ops->alloc_and_copy(vma, src, dst, start, end, private);
+
+ /* This does the real migration of struct page */
+ migrate_vma_pages(&migrate);
+
+ ops->finalize_and_map(vma, src, dst, start, end, private);
+
+ /* Unlock and remap pages */
+ migrate_vma_finalize(&migrate);
+
+ return 0;
+}
+EXPORT_SYMBOL(migrate_vma);
--
2.4.11
It does not need much, just skip populating kernel linear mapping
for range of un-addressable device memory (it is pick so that there
is no physical memory resource overlapping it). All the logic is in
share mm code.
Only support x86-64 as this feature doesn't make much sense with
constrained virtual address space of 32bits architecture.
Signed-off-by: Jérôme Glisse <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
---
arch/x86/mm/init_64.c | 22 ++++++++++++++++++----
1 file changed, 18 insertions(+), 4 deletions(-)
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 0098dc9..7c8c91c 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -644,7 +644,8 @@ static void update_end_of_memory_vars(u64 start, u64 size)
int arch_add_memory(int nid, u64 start, u64 size, int flags)
{
const int supported_flags = MEMORY_DEVICE |
- MEMORY_DEVICE_ALLOW_MIGRATE;
+ MEMORY_DEVICE_ALLOW_MIGRATE |
+ MEMORY_DEVICE_UNADDRESSABLE;
struct pglist_data *pgdat = NODE_DATA(nid);
struct zone *zone = pgdat->node_zones +
zone_for_memory(nid, start, size, ZONE_NORMAL,
@@ -659,7 +660,17 @@ int arch_add_memory(int nid, u64 start, u64 size, int flags)
return -EINVAL;
}
- init_memory_mapping(start, start + size);
+ /*
+ * We get un-addressable memory when some one is adding a ZONE_DEVICE
+ * to have struct page for a device memory which is not accessible by
+ * the CPU so it is pointless to have a linear kernel mapping of such
+ * memory.
+ *
+ * Core mm should make sure it never set a pte pointing to such fake
+ * physical range.
+ */
+ if (!(flags & MEMORY_DEVICE_UNADDRESSABLE))
+ init_memory_mapping(start, start + size);
ret = __add_pages(nid, zone, start_pfn, nr_pages);
WARN_ON_ONCE(ret);
@@ -958,7 +969,8 @@ kernel_physical_mapping_remove(unsigned long start, unsigned long end)
int __ref arch_remove_memory(u64 start, u64 size, int flags)
{
const int supported_flags = MEMORY_DEVICE |
- MEMORY_DEVICE_ALLOW_MIGRATE;
+ MEMORY_DEVICE_ALLOW_MIGRATE |
+ MEMORY_DEVICE_UNADDRESSABLE;
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
struct page *page = pfn_to_page(start_pfn);
@@ -979,7 +991,9 @@ int __ref arch_remove_memory(u64 start, u64 size, int flags)
zone = page_zone(page);
ret = __remove_pages(zone, start_pfn, nr_pages);
WARN_ON_ONCE(ret);
- kernel_physical_mapping_remove(start, start + size);
+
+ if (!(flags & MEMORY_DEVICE_UNADDRESSABLE))
+ kernel_physical_mapping_remove(start, start + size);
return ret;
}
--
2.4.11
This does not affect non ZONE_DEVICE page. In order to allow
ZONE_DEVICE page to be tracked we need to detect when refcount
of a ZONE_DEVICE page reach 1 (not 0 as non ZONE_DEVICE page).
This patch just move put_page_testzero() from put_page() to
put_zone_device_page() and only for ZONE_DEVICE. It does not
add any overhead compare to existing code.
Signed-off-by: Jérôme Glisse <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Ross Zwisler <[email protected]>
---
include/linux/mm.h | 8 +++++---
kernel/memremap.c | 2 ++
2 files changed, 7 insertions(+), 3 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5f01c88..28e8b28 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -793,11 +793,13 @@ static inline void put_page(struct page *page)
{
page = compound_head(page);
+ if (unlikely(is_zone_device_page(page))) {
+ put_zone_device_page(page);
+ return;
+ }
+
if (put_page_testzero(page))
__put_page(page);
-
- if (unlikely(is_zone_device_page(page)))
- put_zone_device_page(page);
}
#if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 40d4af8..c821946 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -190,6 +190,8 @@ EXPORT_SYMBOL(get_zone_device_page);
void put_zone_device_page(struct page *page)
{
+ page_ref_dec(page);
+
put_dev_pagemap(page->pgmap);
}
EXPORT_SYMBOL(put_zone_device_page);
--
2.4.11
Allow migration without copy in case destination page already have
source page content. This is usefull for new dma capable migration
where use device dma engine to copy pages.
This feature need carefull audit of filesystem code to make sure
that no one can write to the source page while it is unmapped and
locked. It should be safe for most filesystem but as precaution
return error until support for device migration is added to them.
Signed-off-by: Jérôme Glisse <[email protected]>
---
drivers/staging/lustre/lustre/llite/rw26.c | 8 +++--
fs/aio.c | 7 +++-
fs/btrfs/disk-io.c | 11 ++++--
fs/f2fs/data.c | 8 ++++-
fs/f2fs/f2fs.h | 2 +-
fs/hugetlbfs/inode.c | 9 +++--
fs/nfs/internal.h | 5 +--
fs/nfs/write.c | 9 +++--
fs/ubifs/file.c | 8 ++++-
include/linux/balloon_compaction.h | 3 +-
include/linux/fs.h | 13 ++++---
include/linux/migrate.h | 7 ++--
mm/balloon_compaction.c | 2 +-
mm/migrate.c | 56 +++++++++++++++++++-----------
mm/zsmalloc.c | 12 ++++++-
15 files changed, 114 insertions(+), 46 deletions(-)
diff --git a/drivers/staging/lustre/lustre/llite/rw26.c b/drivers/staging/lustre/lustre/llite/rw26.c
index d89e795..29a59bf 100644
--- a/drivers/staging/lustre/lustre/llite/rw26.c
+++ b/drivers/staging/lustre/lustre/llite/rw26.c
@@ -43,6 +43,7 @@
#include <linux/uaccess.h>
#include <linux/migrate.h>
+#include <linux/memremap.h>
#include <linux/fs.h>
#include <linux/buffer_head.h>
#include <linux/mpage.h>
@@ -642,9 +643,12 @@ static int ll_write_end(struct file *file, struct address_space *mapping,
#ifdef CONFIG_MIGRATION
static int ll_migratepage(struct address_space *mapping,
struct page *newpage, struct page *page,
- enum migrate_mode mode
- )
+ enum migrate_mode mode, bool copy)
{
+ /* Can only migrate addressable memory for now */
+ if (!is_addressable_page(newpage))
+ return -EINVAL;
+
/* Always fail page migration until we have a proper implementation */
return -EIO;
}
diff --git a/fs/aio.c b/fs/aio.c
index f52d925..fa6bb92 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -37,6 +37,7 @@
#include <linux/blkdev.h>
#include <linux/compat.h>
#include <linux/migrate.h>
+#include <linux/memremap.h>
#include <linux/ramfs.h>
#include <linux/percpu-refcount.h>
#include <linux/mount.h>
@@ -366,13 +367,17 @@ static const struct file_operations aio_ring_fops = {
#if IS_ENABLED(CONFIG_MIGRATION)
static int aio_migratepage(struct address_space *mapping, struct page *new,
- struct page *old, enum migrate_mode mode)
+ struct page *old, enum migrate_mode mode, bool copy)
{
struct kioctx *ctx;
unsigned long flags;
pgoff_t idx;
int rc;
+ /* Can only migrate addressable memory for now */
+ if (!is_addressable_page(new))
+ return -EINVAL;
+
rc = 0;
/* mapping->private_lock here protects against the kioctx teardown. */
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 08b74da..a2b75d6 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -27,6 +27,7 @@
#include <linux/kthread.h>
#include <linux/slab.h>
#include <linux/migrate.h>
+#include <linux/memremap.h>
#include <linux/ratelimit.h>
#include <linux/uuid.h>
#include <linux/semaphore.h>
@@ -1061,9 +1062,13 @@ static int btree_submit_bio_hook(struct inode *inode, struct bio *bio,
#ifdef CONFIG_MIGRATION
static int btree_migratepage(struct address_space *mapping,
- struct page *newpage, struct page *page,
- enum migrate_mode mode)
+ struct page *newpage, struct page *page,
+ enum migrate_mode mode, bool copy)
{
+ /* Can only migrate addressable memory for now */
+ if (!is_addressable_page(newpage))
+ return -EINVAL;
+
/*
* we can't safely write a btree page from here,
* we haven't done the locking hook
@@ -1077,7 +1082,7 @@ static int btree_migratepage(struct address_space *mapping,
if (page_has_private(page) &&
!try_to_release_page(page, GFP_KERNEL))
return -EAGAIN;
- return migrate_page(mapping, newpage, page, mode);
+ return migrate_page(mapping, newpage, page, mode, copy);
}
#endif
diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index 1602b4b..14208a5 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -23,6 +23,7 @@
#include <linux/memcontrol.h>
#include <linux/cleancache.h>
#include <linux/sched/signal.h>
+#include <linux/memremap.h>
#include "f2fs.h"
#include "node.h"
@@ -2049,7 +2050,8 @@ static sector_t f2fs_bmap(struct address_space *mapping, sector_t block)
#include <linux/migrate.h>
int f2fs_migrate_page(struct address_space *mapping,
- struct page *newpage, struct page *page, enum migrate_mode mode)
+ struct page *newpage, struct page *page,
+ enum migrate_mode mode, bool copy)
{
int rc, extra_count;
struct f2fs_inode_info *fi = F2FS_I(mapping->host);
@@ -2057,6 +2059,10 @@ int f2fs_migrate_page(struct address_space *mapping,
BUG_ON(PageWriteback(page));
+ /* Can only migrate addressable memory for now */
+ if (!is_addressable_page(newpage))
+ return -EINVAL;
+
/* migrating an atomic written page is safe with the inmem_lock hold */
if (atomic_written && !mutex_trylock(&fi->inmem_lock))
return -EAGAIN;
diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
index e849f83..ffa5333 100644
--- a/fs/f2fs/f2fs.h
+++ b/fs/f2fs/f2fs.h
@@ -2299,7 +2299,7 @@ void f2fs_invalidate_page(struct page *page, unsigned int offset,
int f2fs_release_page(struct page *page, gfp_t wait);
#ifdef CONFIG_MIGRATION
int f2fs_migrate_page(struct address_space *mapping, struct page *newpage,
- struct page *page, enum migrate_mode mode);
+ struct page *page, enum migrate_mode mode, bool copy);
#endif
/*
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 8f96461..13f74d6 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -35,6 +35,7 @@
#include <linux/security.h>
#include <linux/magic.h>
#include <linux/migrate.h>
+#include <linux/memremap.h>
#include <linux/uio.h>
#include <linux/uaccess.h>
@@ -842,11 +843,15 @@ static int hugetlbfs_set_page_dirty(struct page *page)
}
static int hugetlbfs_migrate_page(struct address_space *mapping,
- struct page *newpage, struct page *page,
- enum migrate_mode mode)
+ struct page *newpage, struct page *page,
+ enum migrate_mode mode, bool copy)
{
int rc;
+ /* Can only migrate addressable memory for now */
+ if (!is_addressable_page(newpage))
+ return -EINVAL;
+
rc = migrate_huge_page_move_mapping(mapping, newpage, page);
if (rc != MIGRATEPAGE_SUCCESS)
return rc;
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index 09ca509..2e23275 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -535,8 +535,9 @@ void nfs_clear_pnfs_ds_commit_verifiers(struct pnfs_ds_commit_info *cinfo)
#endif
#ifdef CONFIG_MIGRATION
-extern int nfs_migrate_page(struct address_space *,
- struct page *, struct page *, enum migrate_mode);
+extern int nfs_migrate_page(struct address_space *mapping,
+ struct page *newpage, struct page *page,
+ enum migrate_mode, bool copy);
#endif
static inline int
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index e75b056..1bc4354 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -14,6 +14,7 @@
#include <linux/writeback.h>
#include <linux/swap.h>
#include <linux/migrate.h>
+#include <linux/memremap.h>
#include <linux/sunrpc/clnt.h>
#include <linux/nfs_fs.h>
@@ -2020,8 +2021,12 @@ int nfs_wb_single_page(struct inode *inode, struct page *page, bool launder)
#ifdef CONFIG_MIGRATION
int nfs_migrate_page(struct address_space *mapping, struct page *newpage,
- struct page *page, enum migrate_mode mode)
+ struct page *page, enum migrate_mode mode, bool copy)
{
+ /* Can only migrate addressable memory for now */
+ if (!is_addressable_page(newpage))
+ return -EINVAL;
+
/*
* If PagePrivate is set, then the page is currently associated with
* an in-progress read or write request. Don't try to migrate it.
@@ -2036,7 +2041,7 @@ int nfs_migrate_page(struct address_space *mapping, struct page *newpage,
if (!nfs_fscache_release_page(page, GFP_KERNEL))
return -EBUSY;
- return migrate_page(mapping, newpage, page, mode);
+ return migrate_page(mapping, newpage, page, mode, copy);
}
#endif
diff --git a/fs/ubifs/file.c b/fs/ubifs/file.c
index d9ae86f..298fbae 100644
--- a/fs/ubifs/file.c
+++ b/fs/ubifs/file.c
@@ -53,6 +53,7 @@
#include <linux/mount.h>
#include <linux/slab.h>
#include <linux/migrate.h>
+#include <linux/memremap.h>
static int read_block(struct inode *inode, void *addr, unsigned int block,
struct ubifs_data_node *dn)
@@ -1469,10 +1470,15 @@ static int ubifs_set_page_dirty(struct page *page)
#ifdef CONFIG_MIGRATION
static int ubifs_migrate_page(struct address_space *mapping,
- struct page *newpage, struct page *page, enum migrate_mode mode)
+ struct page *newpage, struct page *page,
+ enum migrate_mode mode, bool copy)
{
int rc;
+ /* Can only migrate addressable memory for now */
+ if (!is_addressable_page(newpage))
+ return -EINVAL;
+
rc = migrate_page_move_mapping(mapping, newpage, page, NULL, mode, 0);
if (rc != MIGRATEPAGE_SUCCESS)
return rc;
diff --git a/include/linux/balloon_compaction.h b/include/linux/balloon_compaction.h
index 79542b2..27cf3e3 100644
--- a/include/linux/balloon_compaction.h
+++ b/include/linux/balloon_compaction.h
@@ -85,7 +85,8 @@ extern bool balloon_page_isolate(struct page *page,
extern void balloon_page_putback(struct page *page);
extern int balloon_page_migrate(struct address_space *mapping,
struct page *newpage,
- struct page *page, enum migrate_mode mode);
+ struct page *page, enum migrate_mode mode,
+ bool copy);
/*
* balloon_page_insert - insert a page into the balloon's page list and make
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 7251f7b..706a9a9 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -346,8 +346,9 @@ struct address_space_operations {
* migrate the contents of a page to the specified target. If
* migrate_mode is MIGRATE_ASYNC, it must not block.
*/
- int (*migratepage) (struct address_space *,
- struct page *, struct page *, enum migrate_mode);
+ int (*migratepage)(struct address_space *mapping,
+ struct page *newpage, struct page *page,
+ enum migrate_mode, bool copy);
bool (*isolate_page)(struct page *, isolate_mode_t);
void (*putback_page)(struct page *);
int (*launder_page) (struct page *);
@@ -3013,9 +3014,11 @@ extern int generic_file_fsync(struct file *, loff_t, loff_t, int);
extern int generic_check_addressable(unsigned, u64);
#ifdef CONFIG_MIGRATION
-extern int buffer_migrate_page(struct address_space *,
- struct page *, struct page *,
- enum migrate_mode);
+extern int buffer_migrate_page(struct address_space *mapping,
+ struct page *newpage,
+ struct page *page,
+ enum migrate_mode,
+ bool copy);
#else
#define buffer_migrate_page NULL
#endif
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index fa76b51..0a66ddd 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -33,8 +33,11 @@ extern char *migrate_reason_names[MR_TYPES];
#ifdef CONFIG_MIGRATION
extern void putback_movable_pages(struct list_head *l);
-extern int migrate_page(struct address_space *,
- struct page *, struct page *, enum migrate_mode);
+extern int migrate_page(struct address_space *mapping,
+ struct page *newpage,
+ struct page *page,
+ enum migrate_mode,
+ bool copy);
extern int migrate_pages(struct list_head *l, new_page_t new, free_page_t free,
unsigned long private, enum migrate_mode mode, int reason);
extern int isolate_movable_page(struct page *page, isolate_mode_t mode);
diff --git a/mm/balloon_compaction.c b/mm/balloon_compaction.c
index da91df5..ed5cacb 100644
--- a/mm/balloon_compaction.c
+++ b/mm/balloon_compaction.c
@@ -135,7 +135,7 @@ void balloon_page_putback(struct page *page)
/* move_to_new_page() counterpart for a ballooned page */
int balloon_page_migrate(struct address_space *mapping,
struct page *newpage, struct page *page,
- enum migrate_mode mode)
+ enum migrate_mode mode, bool copy)
{
struct balloon_dev_info *balloon = balloon_page_device(page);
diff --git a/mm/migrate.c b/mm/migrate.c
index 9a0897a..cb911ce 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -596,18 +596,10 @@ static void copy_huge_page(struct page *dst, struct page *src)
}
}
-/*
- * Copy the page to its new location
- */
-void migrate_page_copy(struct page *newpage, struct page *page)
+static void migrate_page_states(struct page *newpage, struct page *page)
{
int cpupid;
- if (PageHuge(page) || PageTransHuge(page))
- copy_huge_page(newpage, page);
- else
- copy_highpage(newpage, page);
-
if (PageError(page))
SetPageError(newpage);
if (PageReferenced(page))
@@ -661,6 +653,19 @@ void migrate_page_copy(struct page *newpage, struct page *page)
mem_cgroup_migrate(page, newpage);
}
+
+/*
+ * Copy the page to its new location
+ */
+void migrate_page_copy(struct page *newpage, struct page *page)
+{
+ if (PageHuge(page) || PageTransHuge(page))
+ copy_huge_page(newpage, page);
+ else
+ copy_highpage(newpage, page);
+
+ migrate_page_states(newpage, page);
+}
EXPORT_SYMBOL(migrate_page_copy);
/************************************************************
@@ -674,8 +679,8 @@ EXPORT_SYMBOL(migrate_page_copy);
* Pages are locked upon entry and exit.
*/
int migrate_page(struct address_space *mapping,
- struct page *newpage, struct page *page,
- enum migrate_mode mode)
+ struct page *newpage, struct page *page,
+ enum migrate_mode mode, bool copy)
{
int rc;
@@ -686,7 +691,11 @@ int migrate_page(struct address_space *mapping,
if (rc != MIGRATEPAGE_SUCCESS)
return rc;
- migrate_page_copy(newpage, page);
+ if (copy)
+ migrate_page_copy(newpage, page);
+ else
+ migrate_page_states(newpage, page);
+
return MIGRATEPAGE_SUCCESS;
}
EXPORT_SYMBOL(migrate_page);
@@ -698,13 +707,14 @@ EXPORT_SYMBOL(migrate_page);
* exist.
*/
int buffer_migrate_page(struct address_space *mapping,
- struct page *newpage, struct page *page, enum migrate_mode mode)
+ struct page *newpage, struct page *page,
+ enum migrate_mode mode, bool copy)
{
struct buffer_head *bh, *head;
int rc;
if (!page_has_buffers(page))
- return migrate_page(mapping, newpage, page, mode);
+ return migrate_page(mapping, newpage, page, mode, copy);
head = page_buffers(page);
@@ -736,12 +746,15 @@ int buffer_migrate_page(struct address_space *mapping,
SetPagePrivate(newpage);
- migrate_page_copy(newpage, page);
+ if (copy)
+ migrate_page_copy(newpage, page);
+ else
+ migrate_page_states(newpage, page);
bh = head;
do {
unlock_buffer(bh);
- put_bh(bh);
+ put_bh(bh);
bh = bh->b_this_page;
} while (bh != head);
@@ -796,7 +809,8 @@ static int writeout(struct address_space *mapping, struct page *page)
* Default handling if a filesystem does not provide a migration function.
*/
static int fallback_migrate_page(struct address_space *mapping,
- struct page *newpage, struct page *page, enum migrate_mode mode)
+ struct page *newpage, struct page *page,
+ enum migrate_mode mode)
{
if (PageDirty(page)) {
/* Only writeback pages in full synchronous migration */
@@ -813,7 +827,7 @@ static int fallback_migrate_page(struct address_space *mapping,
!try_to_release_page(page, GFP_KERNEL))
return -EAGAIN;
- return migrate_page(mapping, newpage, page, mode);
+ return migrate_page(mapping, newpage, page, mode, true);
}
/*
@@ -841,7 +855,7 @@ static int move_to_new_page(struct page *newpage, struct page *page,
if (likely(is_lru)) {
if (!mapping)
- rc = migrate_page(mapping, newpage, page, mode);
+ rc = migrate_page(mapping, newpage, page, mode, true);
else if (mapping->a_ops->migratepage)
/*
* Most pages have a mapping and most filesystems
@@ -851,7 +865,7 @@ static int move_to_new_page(struct page *newpage, struct page *page,
* for page migration.
*/
rc = mapping->a_ops->migratepage(mapping, newpage,
- page, mode);
+ page, mode, true);
else
rc = fallback_migrate_page(mapping, newpage,
page, mode);
@@ -868,7 +882,7 @@ static int move_to_new_page(struct page *newpage, struct page *page,
}
rc = mapping->a_ops->migratepage(mapping, newpage,
- page, mode);
+ page, mode, true);
WARN_ON_ONCE(rc == MIGRATEPAGE_SUCCESS &&
!PageIsolated(page));
}
diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index b7ee9c3..334ff64 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -52,6 +52,7 @@
#include <linux/zpool.h>
#include <linux/mount.h>
#include <linux/migrate.h>
+#include <linux/memremap.h>
#include <linux/pagemap.h>
#define ZSPAGE_MAGIC 0x58
@@ -1968,7 +1969,7 @@ bool zs_page_isolate(struct page *page, isolate_mode_t mode)
}
int zs_page_migrate(struct address_space *mapping, struct page *newpage,
- struct page *page, enum migrate_mode mode)
+ struct page *page, enum migrate_mode mode, bool copy)
{
struct zs_pool *pool;
struct size_class *class;
@@ -1986,6 +1987,15 @@ int zs_page_migrate(struct address_space *mapping, struct page *newpage,
VM_BUG_ON_PAGE(!PageMovable(page), page);
VM_BUG_ON_PAGE(!PageIsolated(page), page);
+ /*
+ * Offloading copy operation for zspage require special considerations
+ * due to locking so for now we only support regular migration. I do
+ * not expect we will ever want to support offloading copy. See hmm.h
+ * for more informations on hmm_vma_migrate() and offload copy.
+ */
+ if (!copy || !is_addressable_page(newpage))
+ return -EINVAL;
+
zspage = get_zspage(page);
/* Concurrent compactor cannot migrate any subpage in zspage */
--
2.4.11
On Thu, Mar 16, 2017 at 12:05:26PM -0400, J?r?me Glisse wrote:
>This patch add a new memory migration helpers, which migrate memory
>backing a range of virtual address of a process to different memory
>(which can be allocated through special allocator). It differs from
>numa migration by working on a range of virtual address and thus by
>doing migration in chunk that can be large enough to use DMA engine or
>special copy offloading engine.
Reviewed-by: Reza Arbab <[email protected]>
Tested-by: Reza Arbab <[email protected]>
--
Reza Arbab
On Thu, 16 Mar 2017 12:05:19 -0400 J__r__me Glisse <[email protected]> wrote:
> Cliff note:
"Cliff's notes" isn't appropriate for a large feature such as this.
Where's the long-form description? One which permits readers to fully
understand the requirements, design, alternative designs, the
implementation, the interface(s), etc?
Have you ever spoken about HMM at a conference? If so, the supporting
presentation documents might help here. That's the level of detail
which should be presented here.
> HMM offers 2 things (each standing on its own). First
> it allows to use device memory transparently inside any process
> without any modifications to process program code.
Well. What is "device memory"? That's very vague. What are the
characteristics of this memory? Why is it a requirement that
userspace code be unaltered? What are the security implications - does
the process need particular permissions to access this memory? What is
the proposed interface to set up this access?
> Second it allows to mirror process address space on a device.
Why? Why is this a requirement, how will it be used, what are the
use cases, etc?
I spent a bit of time trying to locate a decent writeup of this feature
but wasn't able to locate one. I'm not seeing a Documentation/ update
in this patchset. Perhaps if you were to sit down and write a detailed
Documentation/vm/hmm.txt then that would be a good starting point.
This stuff is important - it's not really feasible to perform a decent
review of this proposal unless the reviewer has access to this
high-level conceptual stuff.
So I'll take a look at merging this code as-is for testing purposes but
I won't be attempting to review it at this stage.
On Fri, Mar 17, 2017 at 3:24 AM, Reza Arbab <[email protected]> wrote:
> On Thu, Mar 16, 2017 at 12:05:26PM -0400, Jérôme Glisse wrote:
>>
>> This patch add a new memory migration helpers, which migrate memory
>> backing a range of virtual address of a process to different memory (which
>> can be allocated through special allocator). It differs from numa migration
>> by working on a range of virtual address and thus by doing migration in
>> chunk that can be large enough to use DMA engine or special copy offloading
>> engine.
>
>
> Reviewed-by: Reza Arbab <[email protected]>
> Tested-by: Reza Arbab <[email protected]>
>
Acked-by: Balbir Singh <[email protected]>
On Thu, 16 Mar 2017 12:05:26 -0400 J?r?me Glisse <[email protected]> wrote:
> +static inline struct page *migrate_pfn_to_page(unsigned long mpfn)
> +{
> + if (!(mpfn & MIGRATE_PFN_VALID))
> + return NULL;
> + return pfn_to_page(mpfn & MIGRATE_PFN_MASK);
> +}
i386 allnoconfig:
In file included from mm/page_alloc.c:61:
./include/linux/migrate.h: In function 'migrate_pfn_to_page':
./include/linux/migrate.h:139: warning: left shift count >= width of type
./include/linux/migrate.h:141: warning: left shift count >= width of type
./include/linux/migrate.h: In function 'migrate_pfn_size':
./include/linux/migrate.h:146: warning: left shift count >= width of type
On Thu, Mar 16, 2017 at 01:43:21PM -0700, Andrew Morton wrote:
> On Thu, 16 Mar 2017 12:05:19 -0400 J__r__me Glisse <[email protected]> wrote:
>
> > Cliff note:
>
> "Cliff's notes" isn't appropriate for a large feature such as this.
> Where's the long-form description? One which permits readers to fully
> understand the requirements, design, alternative designs, the
> implementation, the interface(s), etc?
>
> Have you ever spoken about HMM at a conference? If so, the supporting
> presentation documents might help here. That's the level of detail
> which should be presented here.
Longer description of patchset rational, motivation and design choices
were given in the first few posting of the patchset to which i included
a link in my cover letter. Also given that i presented that for last 3
or 4 years to mm summit and kernel summit i thought that by now peoples
were familiar about the topic and wanted to spare them the long version.
My bad.
I attach a patch that is a first stab at a Documentation/hmm.txt that
explain the motivation and rational behind HMM. I can probably add a
section about how to use HMM from device driver point of view.
> > HMM offers 2 things (each standing on its own). First
> > it allows to use device memory transparently inside any process
> > without any modifications to process program code.
>
> Well. What is "device memory"? That's very vague. What are the
> characteristics of this memory? Why is it a requirement that
> userspace code be unaltered? What are the security implications - does
> the process need particular permissions to access this memory? What is
> the proposed interface to set up this access?
Thing like GPU memory, think 16GBytes, 32GBytes with 1TeraBytes/s of
bandwidth so something that is just completely in a different category
than DDR3/DDR4 or PCIE bandwidth.
To allow GPU/FPGA/... to be transparently use by program we need to
avoid any requirement to modify any code. Advance in high level langage
construct (in C++ but others too) gives opportunities to compiler to
leverage GPU transparently without programmer knowledge. But for this
to happen we need a share address space ie any pointer in program must
be accessible by the device and we must also be able to migrate memory
to device memory to benefit from the device memory bandwidth.
Moreover if you think about complex software that use a plethora of
various library, you want to allow some of the library to leverage GPU
or DSP transparently without forcing the library to copy/duplicate its
input data which can be highly complex if you think of tree, list, ...
Making all this transparent from program/library point of view ease
the development of thoses. Quite frankly without that it is border line
impossible to efficiently use GPU or other device in many cases.
The device memory is treated like regular memory from kernel point of
view (except that CPU can not access it) but everything else about page
holds (read, write, execution protections ...). So there is no security
implications. Device under consideration have page table and works like
CPU from process isolation point of view (modulo hardware bug but CPU
or main memory have those same issues).
There is no propose interface here, nor i see a need for one. When the
device starts accessing a range of the process address space the device
driver can decide to migrate that range to device memory in order to
speed computations. Only the device driver has enough informations on
wether or not this is a good idea and this changes continously during
run time (depends on what other process are doing ...).
So for now like it was discuss in some CDM threads and in some previous
HMM threads i believe it is better to let the device driver decide and
keep HMM out of any policy choices. Latter down the road once we get more
devices and more real world usage we can try to figure out if there is
a good way to expose a generic memory placement hint to userspace to
allow program to improve performances by helping device driver to make
better decissions.
> > Second it allows to mirror process address space on a device.
>
> Why? Why is this a requirement, how will it be used, what are the
> use cases, etc?
>From above, the requirement is that any address the CPU can access could
also be access by the device with the same restriction (like read/write
protection). This greatly simplify use of such device, either transparently
by the compiler without programmer knowledge or through some library again
without main program developer knowledge. The whole point is to make it
easier to use thing like GPU without having to ask developer to use
special memory allocator and to duplicate their dataset.
>
> I spent a bit of time trying to locate a decent writeup of this feature
> but wasn't able to locate one. I'm not seeing a Documentation/ update
> in this patchset. Perhaps if you were to sit down and write a detailed
> Documentation/vm/hmm.txt then that would be a good starting point.
Attach is hmm.txt like i said i thought that all the previous at length
description that i have given in the numerous posting of the patchset
were enough and that i only needed to refresh peoples memory.
>
> This stuff is important - it's not really feasible to perform a decent
> review of this proposal unless the reviewer has access to this
> high-level conceptual stuff.
Does the above and the attach documentation answer your questions ? Is
there thing i should describe more thouroughly or aspect you feel are
missing ?
Cheers,
J?r?me
On 03/16/2017 04:05 PM, Andrew Morton wrote:
> On Thu, 16 Mar 2017 12:05:26 -0400 J?r?me Glisse <[email protected]> wrote:
>
>> +static inline struct page *migrate_pfn_to_page(unsigned long mpfn)
>> +{
>> + if (!(mpfn & MIGRATE_PFN_VALID))
>> + return NULL;
>> + return pfn_to_page(mpfn & MIGRATE_PFN_MASK);
>> +}
>
> i386 allnoconfig:
>
> In file included from mm/page_alloc.c:61:
> ./include/linux/migrate.h: In function 'migrate_pfn_to_page':
> ./include/linux/migrate.h:139: warning: left shift count >= width of type
> ./include/linux/migrate.h:141: warning: left shift count >= width of type
> ./include/linux/migrate.h: In function 'migrate_pfn_size':
> ./include/linux/migrate.h:146: warning: left shift count >= width of type
>
It seems clear that this was never meant to work with < 64-bit pfns:
// migrate.h excerpt:
#define MIGRATE_PFN_VALID (1UL << (BITS_PER_LONG_LONG - 1))
#define MIGRATE_PFN_MIGRATE (1UL << (BITS_PER_LONG_LONG - 2))
#define MIGRATE_PFN_HUGE (1UL << (BITS_PER_LONG_LONG - 3))
#define MIGRATE_PFN_LOCKED (1UL << (BITS_PER_LONG_LONG - 4))
#define MIGRATE_PFN_WRITE (1UL << (BITS_PER_LONG_LONG - 5))
#define MIGRATE_PFN_DEVICE (1UL << (BITS_PER_LONG_LONG - 6))
#define MIGRATE_PFN_ERROR (1UL << (BITS_PER_LONG_LONG - 7))
#define MIGRATE_PFN_MASK ((1UL << (BITS_PER_LONG_LONG - PAGE_SHIFT)) - 1)
...obviously, there is not enough room for these flags, in a 32-bit pfn.
So, given the current HMM design, I think we are going to have to provide a 32-bit version of these
routines (migrate_pfn_to_page, and related) that is a no-op, right?
thanks
John Hubbard
NVIDIA
On Fri, Mar 17, 2017 at 11:22 AM, John Hubbard <[email protected]> wrote:
> On 03/16/2017 04:05 PM, Andrew Morton wrote:
>>
>> On Thu, 16 Mar 2017 12:05:26 -0400 Jérôme Glisse <[email protected]>
>> wrote:
>>
>>> +static inline struct page *migrate_pfn_to_page(unsigned long mpfn)
>>> +{
>>> + if (!(mpfn & MIGRATE_PFN_VALID))
>>> + return NULL;
>>> + return pfn_to_page(mpfn & MIGRATE_PFN_MASK);
>>> +}
>>
>>
>> i386 allnoconfig:
>>
>> In file included from mm/page_alloc.c:61:
>> ./include/linux/migrate.h: In function 'migrate_pfn_to_page':
>> ./include/linux/migrate.h:139: warning: left shift count >= width of type
>> ./include/linux/migrate.h:141: warning: left shift count >= width of type
>> ./include/linux/migrate.h: In function 'migrate_pfn_size':
>> ./include/linux/migrate.h:146: warning: left shift count >= width of type
>>
>
> It seems clear that this was never meant to work with < 64-bit pfns:
>
> // migrate.h excerpt:
> #define MIGRATE_PFN_VALID (1UL << (BITS_PER_LONG_LONG - 1))
> #define MIGRATE_PFN_MIGRATE (1UL << (BITS_PER_LONG_LONG - 2))
> #define MIGRATE_PFN_HUGE (1UL << (BITS_PER_LONG_LONG - 3))
> #define MIGRATE_PFN_LOCKED (1UL << (BITS_PER_LONG_LONG - 4))
> #define MIGRATE_PFN_WRITE (1UL << (BITS_PER_LONG_LONG - 5))
> #define MIGRATE_PFN_DEVICE (1UL << (BITS_PER_LONG_LONG - 6))
> #define MIGRATE_PFN_ERROR (1UL << (BITS_PER_LONG_LONG - 7))
> #define MIGRATE_PFN_MASK ((1UL << (BITS_PER_LONG_LONG - PAGE_SHIFT))
> - 1)
>
> ...obviously, there is not enough room for these flags, in a 32-bit pfn.
>
> So, given the current HMM design, I think we are going to have to provide a
> 32-bit version of these routines (migrate_pfn_to_page, and related) that is
> a no-op, right?
Or make the HMM Kconfig feature 64BIT only by making it depend on 64BIT?
Balbir Singh
On 03/16/2017 05:45 PM, Balbir Singh wrote:
> On Fri, Mar 17, 2017 at 11:22 AM, John Hubbard <[email protected]> wrote:
>> On 03/16/2017 04:05 PM, Andrew Morton wrote:
>>>
>>> On Thu, 16 Mar 2017 12:05:26 -0400 Jérôme Glisse <[email protected]>
>>> wrote:
>>>
>>>> +static inline struct page *migrate_pfn_to_page(unsigned long mpfn)
>>>> +{
>>>> + if (!(mpfn & MIGRATE_PFN_VALID))
>>>> + return NULL;
>>>> + return pfn_to_page(mpfn & MIGRATE_PFN_MASK);
>>>> +}
>>>
>>>
>>> i386 allnoconfig:
>>>
>>> In file included from mm/page_alloc.c:61:
>>> ./include/linux/migrate.h: In function 'migrate_pfn_to_page':
>>> ./include/linux/migrate.h:139: warning: left shift count >= width of type
>>> ./include/linux/migrate.h:141: warning: left shift count >= width of type
>>> ./include/linux/migrate.h: In function 'migrate_pfn_size':
>>> ./include/linux/migrate.h:146: warning: left shift count >= width of type
>>>
>>
>> It seems clear that this was never meant to work with < 64-bit pfns:
>>
>> // migrate.h excerpt:
>> #define MIGRATE_PFN_VALID (1UL << (BITS_PER_LONG_LONG - 1))
>> #define MIGRATE_PFN_MIGRATE (1UL << (BITS_PER_LONG_LONG - 2))
>> #define MIGRATE_PFN_HUGE (1UL << (BITS_PER_LONG_LONG - 3))
>> #define MIGRATE_PFN_LOCKED (1UL << (BITS_PER_LONG_LONG - 4))
>> #define MIGRATE_PFN_WRITE (1UL << (BITS_PER_LONG_LONG - 5))
>> #define MIGRATE_PFN_DEVICE (1UL << (BITS_PER_LONG_LONG - 6))
>> #define MIGRATE_PFN_ERROR (1UL << (BITS_PER_LONG_LONG - 7))
>> #define MIGRATE_PFN_MASK ((1UL << (BITS_PER_LONG_LONG - PAGE_SHIFT))
>> - 1)
>>
>> ...obviously, there is not enough room for these flags, in a 32-bit pfn.
>>
>> So, given the current HMM design, I think we are going to have to provide a
>> 32-bit version of these routines (migrate_pfn_to_page, and related) that is
>> a no-op, right?
>
> Or make the HMM Kconfig feature 64BIT only by making it depend on 64BIT?
>
Yes, that was my first reaction too, but these particular routines are aspiring to be generic
routines--in fact, you have had an influence there, because these might possibly help with NUMA
migrations. :)
So it would look odd to see this:
#ifdef CONFIG_HMM
int migrate_vma(const struct migrate_vma_ops *ops,
struct vm_area_struct *vma,
unsigned long mentries,
unsigned long start,
unsigned long end,
unsigned long *src,
unsigned long *dst,
void *private)
{
//...implementation
#endif
...because migrate_vma() does not sound HMM-specific, and it is, after all, in migrate.h and
migrate.c. We probably want this a more generic approach (not sure if I've picked exactly the right
token to #ifdef on, but it's close):
#ifdef CONFIG_64BIT
int migrate_vma(const struct migrate_vma_ops *ops,
struct vm_area_struct *vma,
unsigned long mentries,
unsigned long start,
unsigned long end,
unsigned long *src,
unsigned long *dst,
void *private)
{
/* ... full implementation */
}
#else
int migrate_vma(const struct migrate_vma_ops *ops,
struct vm_area_struct *vma,
unsigned long mentries,
unsigned long start,
unsigned long end,
unsigned long *src,
unsigned long *dst,
void *private)
{
return -EINVAL; /* or something more appropriate */
}
#endif
thanks
John Hubbard
NVIDIA
>
> Balbir Singh
>
> On 03/16/2017 05:45 PM, Balbir Singh wrote:
> > On Fri, Mar 17, 2017 at 11:22 AM, John Hubbard <[email protected]> wrote:
> >> On 03/16/2017 04:05 PM, Andrew Morton wrote:
> >>>
> >>> On Thu, 16 Mar 2017 12:05:26 -0400 Jérôme Glisse <[email protected]>
> >>> wrote:
> >>>
> >>>> +static inline struct page *migrate_pfn_to_page(unsigned long mpfn)
> >>>> +{
> >>>> + if (!(mpfn & MIGRATE_PFN_VALID))
> >>>> + return NULL;
> >>>> + return pfn_to_page(mpfn & MIGRATE_PFN_MASK);
> >>>> +}
> >>>
> >>>
> >>> i386 allnoconfig:
> >>>
> >>> In file included from mm/page_alloc.c:61:
> >>> ./include/linux/migrate.h: In function 'migrate_pfn_to_page':
> >>> ./include/linux/migrate.h:139: warning: left shift count >= width of type
> >>> ./include/linux/migrate.h:141: warning: left shift count >= width of type
> >>> ./include/linux/migrate.h: In function 'migrate_pfn_size':
> >>> ./include/linux/migrate.h:146: warning: left shift count >= width of type
> >>>
> >>
> >> It seems clear that this was never meant to work with < 64-bit pfns:
> >>
> >> // migrate.h excerpt:
> >> #define MIGRATE_PFN_VALID (1UL << (BITS_PER_LONG_LONG - 1))
> >> #define MIGRATE_PFN_MIGRATE (1UL << (BITS_PER_LONG_LONG - 2))
> >> #define MIGRATE_PFN_HUGE (1UL << (BITS_PER_LONG_LONG - 3))
> >> #define MIGRATE_PFN_LOCKED (1UL << (BITS_PER_LONG_LONG - 4))
> >> #define MIGRATE_PFN_WRITE (1UL << (BITS_PER_LONG_LONG - 5))
> >> #define MIGRATE_PFN_DEVICE (1UL << (BITS_PER_LONG_LONG - 6))
> >> #define MIGRATE_PFN_ERROR (1UL << (BITS_PER_LONG_LONG - 7))
> >> #define MIGRATE_PFN_MASK ((1UL << (BITS_PER_LONG_LONG -
> >> PAGE_SHIFT))
> >> - 1)
> >>
> >> ...obviously, there is not enough room for these flags, in a 32-bit pfn.
> >>
> >> So, given the current HMM design, I think we are going to have to provide
> >> a
> >> 32-bit version of these routines (migrate_pfn_to_page, and related) that
> >> is
> >> a no-op, right?
> >
> > Or make the HMM Kconfig feature 64BIT only by making it depend on 64BIT?
> >
>
> Yes, that was my first reaction too, but these particular routines are
> aspiring to be generic
> routines--in fact, you have had an influence there, because these might
> possibly help with NUMA
> migrations. :)
>
> So it would look odd to see this:
>
> #ifdef CONFIG_HMM
> int migrate_vma(const struct migrate_vma_ops *ops,
> struct vm_area_struct *vma,
> unsigned long mentries,
> unsigned long start,
> unsigned long end,
> unsigned long *src,
> unsigned long *dst,
> void *private)
> {
> //...implementation
> #endif
>
> ...because migrate_vma() does not sound HMM-specific, and it is, after all,
> in migrate.h and
> migrate.c. We probably want this a more generic approach (not sure if I've
> picked exactly the right
> token to #ifdef on, but it's close):
>
> #ifdef CONFIG_64BIT
> int migrate_vma(const struct migrate_vma_ops *ops,
> struct vm_area_struct *vma,
> unsigned long mentries,
> unsigned long start,
> unsigned long end,
> unsigned long *src,
> unsigned long *dst,
> void *private)
> {
> /* ... full implementation */
> }
>
> #else
> int migrate_vma(const struct migrate_vma_ops *ops,
> struct vm_area_struct *vma,
> unsigned long mentries,
> unsigned long start,
> unsigned long end,
> unsigned long *src,
> unsigned long *dst,
> void *private)
> {
> return -EINVAL; /* or something more appropriate */
> }
> #endif
>
> thanks
> John Hubbard
> NVIDIA
The original intention was for it to be 64bit only, 32bit is a dying
species and before splitting out hmm_ prefix from this code and moving
it to be generic it was behind a 64bit flag.
If latter one someone really care about 32bit we can only move to u64
Cheers,
Jérôme
On Thu, 16 Mar 2017 21:52:23 -0400 (EDT) Jerome Glisse <[email protected]> wrote:
> The original intention was for it to be 64bit only, 32bit is a dying
> species and before splitting out hmm_ prefix from this code and moving
> it to be generic it was behind a 64bit flag.
>
> If latter one someone really care about 32bit we can only move to u64
I think that's the best compromise. If someone wants this on 32-bit
then they're free to get it working. That "someone" will actually be
able to test it, which you clearly won't be doing!
However, please do check that the impact of this patchset on 32-bit's
`size vmlinux' is minimal. Preferably zero.
>> Or make the HMM Kconfig feature 64BIT only by making it depend on 64BIT?
>>
>
> Yes, that was my first reaction too, but these particular routines are
> aspiring to be generic routines--in fact, you have had an influence there,
> because these might possibly help with NUMA migrations. :)
>
Yes, I still stick to them being generic, but I'd be OK if they worked
just for 64 bit systems.
Having said that even the 64 bit works version work for upto physical
sizes of 64 - PAGE_SHIFT
which is a little limiting I think.
One option is to make pfn's unsigned long long and do 32 and 64 bit computations
separately
Option 2, could be something like you said
a. Define a __weak migrate_vma to return -EINVAL
b. In a 64BIT only file define migrate_vma
Option 3
Something totally different
If we care to support 32 bit we go with 1, else option 2 is a good
starting point. There might
be other ways of doing option 2, like you've suggested
Balbir
On 17/03/17 14:42, Balbir Singh wrote:
>>> Or make the HMM Kconfig feature 64BIT only by making it depend on 64BIT?
>>>
>>
>> Yes, that was my first reaction too, but these particular routines are
>> aspiring to be generic routines--in fact, you have had an influence there,
>> because these might possibly help with NUMA migrations. :)
>>
>
> Yes, I still stick to them being generic, but I'd be OK if they worked
> just for 64 bit systems.
> Having said that even the 64 bit works version work for upto physical
> sizes of 64 - PAGE_SHIFT
> which is a little limiting I think.
>
> One option is to make pfn's unsigned long long and do 32 and 64 bit computations
> separately
>
> Option 2, could be something like you said
>
> a. Define a __weak migrate_vma to return -EINVAL
> b. In a 64BIT only file define migrate_vma
>
> Option 3
>
> Something totally different
>
> If we care to support 32 bit we go with 1, else option 2 is a good
> starting point. There might
> be other ways of doing option 2, like you've suggested
So this is what I ended up with, a quick fix for the 32 bit
build failures
Date: Fri, 17 Mar 2017 15:42:52 +1100
Subject: [PATCH] mm/hmm: Fix build on 32 bit systems
Fix build breakage of hmm-v18 in the current mmotm by
making the migrate_vma() and related functions 64
bit only. The 32 bit variant will return -EINVAL.
There are other approaches to solving this problem,
but we can enable 32 bit systems as we need them.
This patch tries to limit the impact on 32 bit systems
by turning HMM off on them and not enabling the migrate
functions.
I've built this on ppc64/i386 and x86_64
Signed-off-by: Balbir Singh <[email protected]>
---
include/linux/migrate.h | 18 +++++++++++++++++-
mm/Kconfig | 4 +++-
mm/migrate.c | 3 ++-
3 files changed, 22 insertions(+), 3 deletions(-)
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 01f4945..1888a70 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -124,7 +124,7 @@ static inline int migrate_misplaced_transhuge_page(struct mm_struct *mm,
}
#endif /* CONFIG_NUMA_BALANCING && CONFIG_TRANSPARENT_HUGEPAGE*/
-
+#ifdef CONFIG_64BIT
#define MIGRATE_PFN_VALID (1UL << (BITS_PER_LONG_LONG - 1))
#define MIGRATE_PFN_MIGRATE (1UL << (BITS_PER_LONG_LONG - 2))
#define MIGRATE_PFN_HUGE (1UL << (BITS_PER_LONG_LONG - 3))
@@ -145,6 +145,7 @@ static inline unsigned long migrate_pfn_size(unsigned long mpfn)
{
return mpfn & MIGRATE_PFN_HUGE ? PMD_SIZE : PAGE_SIZE;
}
+#endif
/*
* struct migrate_vma_ops - migrate operation callback
@@ -194,6 +195,7 @@ struct migrate_vma_ops {
void *private);
};
+#ifdef CONFIG_64BIT
int migrate_vma(const struct migrate_vma_ops *ops,
struct vm_area_struct *vma,
unsigned long mentries,
@@ -202,5 +204,19 @@ int migrate_vma(const struct migrate_vma_ops *ops,
unsigned long *src,
unsigned long *dst,
void *private);
+#else
+static inline int migrate_vma(const struct migrate_vma_ops *ops,
+ struct vm_area_struct *vma,
+ unsigned long mentries,
+ unsigned long start,
+ unsigned long end,
+ unsigned long *src,
+ unsigned long *dst,
+ void *private)
+{
+ return -EINVAL;
+}
+#endif
+
#endif /* _LINUX_MIGRATE_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index a430d51..c13677f 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -291,7 +291,7 @@ config ARCH_ENABLE_HUGEPAGE_MIGRATION
config HMM
bool
- depends on MMU
+ depends on MMU && 64BIT
config HMM_MIRROR
bool "HMM mirror CPU page table into a device page table"
@@ -307,6 +307,7 @@ config HMM_MIRROR
Second side of the equation is replicating CPU page table content for
range of virtual address. This require careful synchronization with
CPU page table update.
+ depends on 64BIT
config HMM_DEVMEM
bool "HMM device memory helpers (to leverage ZONE_DEVICE)"
@@ -314,6 +315,7 @@ config HMM_DEVMEM
help
HMM devmem are helpers to leverage new ZONE_DEVICE feature. This is
just to avoid device driver to replicate boiler plate code.
+ depends on 64BIT
config PHYS_ADDR_T_64BIT
def_bool 64BIT || ARCH_PHYS_ADDR_T_64BIT
diff --git a/mm/migrate.c b/mm/migrate.c
index b9d25d1..15f2972 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2080,7 +2080,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
#endif /* CONFIG_NUMA */
-
+#ifdef CONFIG_64BIT
struct migrate_vma {
struct vm_area_struct *vma;
unsigned long *dst;
@@ -2787,3 +2787,4 @@ int migrate_vma(const struct migrate_vma_ops *ops,
return 0;
}
EXPORT_SYMBOL(migrate_vma);
+#endif
--
2.10.2
Hi Jérôme,
On 2017/3/17 0:05, Jérôme Glisse wrote:
> This introduce a dummy HMM device class so device driver can use it to
> create hmm_device for the sole purpose of registering device memory.
May I ask where is the latest dummy HMM device driver?
I can only get this one: https://patchwork.kernel.org/patch/4352061/
Thanks,
Bob
> It is usefull to device driver that want to manage multiple physical
> device memory under same struct device umbrella.
>
> Changed since v1:
> - Improve commit message
> - Add drvdata parameter to set on struct device
>
> Signed-off-by: Jérôme Glisse <[email protected]>
> Signed-off-by: Evgeny Baskakov <[email protected]>
> Signed-off-by: John Hubbard <[email protected]>
> Signed-off-by: Mark Hairgrove <[email protected]>
> Signed-off-by: Sherry Cheung <[email protected]>
> Signed-off-by: Subhash Gutti <[email protected]>
> ---
> include/linux/hmm.h | 22 +++++++++++-
> mm/hmm.c | 96 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 117 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index 3054ce7..e4e6b36 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -79,11 +79,11 @@
>
> #if IS_ENABLED(CONFIG_HMM)
>
> +#include <linux/device.h>
> #include <linux/migrate.h>
> #include <linux/memremap.h>
> #include <linux/completion.h>
>
> -
> struct hmm;
>
> /*
> @@ -433,6 +433,26 @@ static inline unsigned long hmm_devmem_page_get_drvdata(struct page *page)
>
> return drvdata[1];
> }
> +
> +
> +/*
> + * struct hmm_device - fake device to hang device memory onto
> + *
> + * @device: device struct
> + * @minor: device minor number
> + */
> +struct hmm_device {
> + struct device device;
> + unsigned minor;
> +};
> +
> +/*
> + * Device driver that wants to handle multiple devices memory through a single
> + * fake device can use hmm_device to do so. This is purely a helper and it
> + * is not needed to make use of any HMM functionality.
> + */
> +struct hmm_device *hmm_device_new(void *drvdata);
> +void hmm_device_put(struct hmm_device *hmm_device);
> #endif /* IS_ENABLED(CONFIG_HMM_DEVMEM) */
>
>
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 019f379..c477bd1 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -24,6 +24,7 @@
> #include <linux/slab.h>
> #include <linux/sched.h>
> #include <linux/mmzone.h>
> +#include <linux/module.h>
> #include <linux/pagemap.h>
> #include <linux/swapops.h>
> #include <linux/hugetlb.h>
> @@ -1132,4 +1133,99 @@ int hmm_devmem_fault_range(struct hmm_devmem *devmem,
> return 0;
> }
> EXPORT_SYMBOL(hmm_devmem_fault_range);
> +
> +/*
> + * A device driver that wants to handle multiple devices memory through a
> + * single fake device can use hmm_device to do so. This is purely a helper
> + * and it is not needed to make use of any HMM functionality.
> + */
> +#define HMM_DEVICE_MAX 256
> +
> +static DECLARE_BITMAP(hmm_device_mask, HMM_DEVICE_MAX);
> +static DEFINE_SPINLOCK(hmm_device_lock);
> +static struct class *hmm_device_class;
> +static dev_t hmm_device_devt;
> +
> +static void hmm_device_release(struct device *device)
> +{
> + struct hmm_device *hmm_device;
> +
> + hmm_device = container_of(device, struct hmm_device, device);
> + spin_lock(&hmm_device_lock);
> + clear_bit(hmm_device->minor, hmm_device_mask);
> + spin_unlock(&hmm_device_lock);
> +
> + kfree(hmm_device);
> +}
> +
> +struct hmm_device *hmm_device_new(void *drvdata)
> +{
> + struct hmm_device *hmm_device;
> + int ret;
> +
> + hmm_device = kzalloc(sizeof(*hmm_device), GFP_KERNEL);
> + if (!hmm_device)
> + return ERR_PTR(-ENOMEM);
> +
> + ret = alloc_chrdev_region(&hmm_device->device.devt,0,1,"hmm_device");
> + if (ret < 0) {
> + kfree(hmm_device);
> + return NULL;
> + }
> +
> + spin_lock(&hmm_device_lock);
> + hmm_device->minor=find_first_zero_bit(hmm_device_mask,HMM_DEVICE_MAX);
> + if (hmm_device->minor >= HMM_DEVICE_MAX) {
> + spin_unlock(&hmm_device_lock);
> + kfree(hmm_device);
> + return NULL;
> + }
> + set_bit(hmm_device->minor, hmm_device_mask);
> + spin_unlock(&hmm_device_lock);
> +
> + dev_set_name(&hmm_device->device, "hmm_device%d", hmm_device->minor);
> + hmm_device->device.devt = MKDEV(MAJOR(hmm_device_devt),
> + hmm_device->minor);
> + hmm_device->device.release = hmm_device_release;
> + dev_set_drvdata(&hmm_device->device, drvdata);
> + hmm_device->device.class = hmm_device_class;
> + device_initialize(&hmm_device->device);
> +
> + return hmm_device;
> +}
> +EXPORT_SYMBOL(hmm_device_new);
> +
> +void hmm_device_put(struct hmm_device *hmm_device)
> +{
> + put_device(&hmm_device->device);
> +}
> +EXPORT_SYMBOL(hmm_device_put);
> +
> +static int __init hmm_init(void)
> +{
> + int ret;
> +
> + ret = alloc_chrdev_region(&hmm_device_devt, 0,
> + HMM_DEVICE_MAX,
> + "hmm_device");
> + if (ret)
> + return ret;
> +
> + hmm_device_class = class_create(THIS_MODULE, "hmm_device");
> + if (IS_ERR(hmm_device_class)) {
> + unregister_chrdev_region(hmm_device_devt, HMM_DEVICE_MAX);
> + return PTR_ERR(hmm_device_class);
> + }
> + return 0;
> +}
> +
> +static void __exit hmm_exit(void)
> +{
> + unregister_chrdev_region(hmm_device_devt, HMM_DEVICE_MAX);
> + class_destroy(hmm_device_class);
> +}
> +
> +module_init(hmm_init);
> +module_exit(hmm_exit);
> +MODULE_LICENSE("GPL");
> #endif /* IS_ENABLED(CONFIG_HMM_DEVMEM) */
>
On 03/16/2017 09:51 PM, Balbir Singh wrote:
[...]
> So this is what I ended up with, a quick fix for the 32 bit
> build failures
>
> Date: Fri, 17 Mar 2017 15:42:52 +1100
> Subject: [PATCH] mm/hmm: Fix build on 32 bit systems
>
> Fix build breakage of hmm-v18 in the current mmotm by
> making the migrate_vma() and related functions 64
> bit only. The 32 bit variant will return -EINVAL.
> There are other approaches to solving this problem,
> but we can enable 32 bit systems as we need them.
>
> This patch tries to limit the impact on 32 bit systems
> by turning HMM off on them and not enabling the migrate
> functions.
>
> I've built this on ppc64/i386 and x86_64
>
> Signed-off-by: Balbir Singh <[email protected]>
> ---
> include/linux/migrate.h | 18 +++++++++++++++++-
> mm/Kconfig | 4 +++-
> mm/migrate.c | 3 ++-
> 3 files changed, 22 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index 01f4945..1888a70 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -124,7 +124,7 @@ static inline int migrate_misplaced_transhuge_page(struct mm_struct *mm,
> }
> #endif /* CONFIG_NUMA_BALANCING && CONFIG_TRANSPARENT_HUGEPAGE*/
>
> -
> +#ifdef CONFIG_64BIT
> #define MIGRATE_PFN_VALID (1UL << (BITS_PER_LONG_LONG - 1))
> #define MIGRATE_PFN_MIGRATE (1UL << (BITS_PER_LONG_LONG - 2))
> #define MIGRATE_PFN_HUGE (1UL << (BITS_PER_LONG_LONG - 3))
As long as we're getting this accurate, should we make that 1ULL, in all of the
MIGRATE_PFN_* defines? The 1ULL is what determines the type of the resulting number,
so it's one more tiny piece of type correctness that is good to have.
The rest of this fix looks good, and the above is not technically necessary (the
code that uses it will force its own type anyway), so:
Reviewed-by: John Hubbard <[email protected]>
thanks
John Hubbard
NVIDIA
> @@ -145,6 +145,7 @@ static inline unsigned long migrate_pfn_size(unsigned long mpfn)
> {
> return mpfn & MIGRATE_PFN_HUGE ? PMD_SIZE : PAGE_SIZE;
> }
> +#endif
>
> /*
> * struct migrate_vma_ops - migrate operation callback
> @@ -194,6 +195,7 @@ struct migrate_vma_ops {
> void *private);
> };
>
> +#ifdef CONFIG_64BIT
> int migrate_vma(const struct migrate_vma_ops *ops,
> struct vm_area_struct *vma,
> unsigned long mentries,
> @@ -202,5 +204,19 @@ int migrate_vma(const struct migrate_vma_ops *ops,
> unsigned long *src,
> unsigned long *dst,
> void *private);
> +#else
> +static inline int migrate_vma(const struct migrate_vma_ops *ops,
> + struct vm_area_struct *vma,
> + unsigned long mentries,
> + unsigned long start,
> + unsigned long end,
> + unsigned long *src,
> + unsigned long *dst,
> + void *private)
> +{
> + return -EINVAL;
> +}
> +#endif
> +
>
> #endif /* _LINUX_MIGRATE_H */
> diff --git a/mm/Kconfig b/mm/Kconfig
> index a430d51..c13677f 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -291,7 +291,7 @@ config ARCH_ENABLE_HUGEPAGE_MIGRATION
>
> config HMM
> bool
> - depends on MMU
> + depends on MMU && 64BIT
>
> config HMM_MIRROR
> bool "HMM mirror CPU page table into a device page table"
> @@ -307,6 +307,7 @@ config HMM_MIRROR
> Second side of the equation is replicating CPU page table content for
> range of virtual address. This require careful synchronization with
> CPU page table update.
> + depends on 64BIT
>
> config HMM_DEVMEM
> bool "HMM device memory helpers (to leverage ZONE_DEVICE)"
> @@ -314,6 +315,7 @@ config HMM_DEVMEM
> help
> HMM devmem are helpers to leverage new ZONE_DEVICE feature. This is
> just to avoid device driver to replicate boiler plate code.
> + depends on 64BIT
>
> config PHYS_ADDR_T_64BIT
> def_bool 64BIT || ARCH_PHYS_ADDR_T_64BIT
> diff --git a/mm/migrate.c b/mm/migrate.c
> index b9d25d1..15f2972 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -2080,7 +2080,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
>
> #endif /* CONFIG_NUMA */
>
> -
> +#ifdef CONFIG_64BIT
> struct migrate_vma {
> struct vm_area_struct *vma;
> unsigned long *dst;
> @@ -2787,3 +2787,4 @@ int migrate_vma(const struct migrate_vma_ops *ops,
> return 0;
> }
> EXPORT_SYMBOL(migrate_vma);
> +#endif
>
On 2017/3/17 7:49, Jerome Glisse wrote:
> On Thu, Mar 16, 2017 at 01:43:21PM -0700, Andrew Morton wrote:
>> On Thu, 16 Mar 2017 12:05:19 -0400 J__r__me Glisse <[email protected]> wrote:
>>
>>> Cliff note:
>>
>> "Cliff's notes" isn't appropriate for a large feature such as this.
>> Where's the long-form description? One which permits readers to fully
>> understand the requirements, design, alternative designs, the
>> implementation, the interface(s), etc?
>>
>> Have you ever spoken about HMM at a conference? If so, the supporting
>> presentation documents might help here. That's the level of detail
>> which should be presented here.
>
> Longer description of patchset rational, motivation and design choices
> were given in the first few posting of the patchset to which i included
> a link in my cover letter. Also given that i presented that for last 3
> or 4 years to mm summit and kernel summit i thought that by now peoples
> were familiar about the topic and wanted to spare them the long version.
> My bad.
>
> I attach a patch that is a first stab at a Documentation/hmm.txt that
> explain the motivation and rational behind HMM. I can probably add a
> section about how to use HMM from device driver point of view.
>
Please, that would be very helpful!
> +3) Share address space and migration
> +
> +HMM intends to provide two main features. First one is to share the address
> +space by duplication the CPU page table into the device page table so same
> +address point to same memory and this for any valid main memory address in
> +the process address space.
Is this an optional feature?
I mean the device don't have to duplicate the CPU page table.
But only make use of the second(migration) feature.
> +The second mechanism HMM provide is a new kind of ZONE_DEVICE memory that does
> +allow to allocate a struct page for each page of the device memory. Those page
> +are special because the CPU can not map them. They however allow to migrate
> +main memory to device memory using exhisting migration mechanism and everything
> +looks like if page was swap out to disk from CPU point of view. Using a struct
> +page gives the easiest and cleanest integration with existing mm mechanisms.
> +Again here HMM only provide helpers, first to hotplug new ZONE_DEVICE memory
> +for the device memory and second to perform migration. Policy decision of what
> +and when to migrate things is left to the device driver.
> +
> +Note that any CPU acess to a device page trigger a page fault which initiate a
> +migration back to system memory so that CPU can access it.
A bit confused here, do you mean CPU access to a main memory page but that page has been migrated to device memory?
Then a page fault will be triggered and initiate a migration back.
Thanks,
Bob
On 2017/3/17 7:49, Jerome Glisse wrote:
> On Thu, Mar 16, 2017 at 01:43:21PM -0700, Andrew Morton wrote:
>> On Thu, 16 Mar 2017 12:05:19 -0400 J__r__me Glisse <[email protected]> wrote:
>>
>>> Cliff note:
>>
>> "Cliff's notes" isn't appropriate for a large feature such as this.
>> Where's the long-form description? One which permits readers to fully
>> understand the requirements, design, alternative designs, the
>> implementation, the interface(s), etc?
>>
>> Have you ever spoken about HMM at a conference? If so, the supporting
>> presentation documents might help here. That's the level of detail
>> which should be presented here.
>
> Longer description of patchset rational, motivation and design choices
> were given in the first few posting of the patchset to which i included
> a link in my cover letter. Also given that i presented that for last 3
> or 4 years to mm summit and kernel summit i thought that by now peoples
> were familiar about the topic and wanted to spare them the long version.
> My bad.
>
> I attach a patch that is a first stab at a Documentation/hmm.txt that
> explain the motivation and rational behind HMM. I can probably add a
> section about how to use HMM from device driver point of view.
>
And a simple example program/pseudo-code make use of the device memory
would also very useful for person don't have GPU programming experience :)
Regards,
Bob
On Fri, Mar 17, 2017 at 04:29:10PM +0800, Bob Liu wrote:
> On 2017/3/17 7:49, Jerome Glisse wrote:
> > On Thu, Mar 16, 2017 at 01:43:21PM -0700, Andrew Morton wrote:
> >> On Thu, 16 Mar 2017 12:05:19 -0400 J__r__me Glisse <[email protected]> wrote:
> >>
> >>> Cliff note:
> >>
> >> "Cliff's notes" isn't appropriate for a large feature such as this.
> >> Where's the long-form description? One which permits readers to fully
> >> understand the requirements, design, alternative designs, the
> >> implementation, the interface(s), etc?
> >>
> >> Have you ever spoken about HMM at a conference? If so, the supporting
> >> presentation documents might help here. That's the level of detail
> >> which should be presented here.
> >
> > Longer description of patchset rational, motivation and design choices
> > were given in the first few posting of the patchset to which i included
> > a link in my cover letter. Also given that i presented that for last 3
> > or 4 years to mm summit and kernel summit i thought that by now peoples
> > were familiar about the topic and wanted to spare them the long version.
> > My bad.
> >
> > I attach a patch that is a first stab at a Documentation/hmm.txt that
> > explain the motivation and rational behind HMM. I can probably add a
> > section about how to use HMM from device driver point of view.
> >
>
> Please, that would be very helpful!
>
> > +3) Share address space and migration
> > +
> > +HMM intends to provide two main features. First one is to share the address
> > +space by duplication the CPU page table into the device page table so same
> > +address point to same memory and this for any valid main memory address in
> > +the process address space.
>
> Is this an optional feature?
> I mean the device don't have to duplicate the CPU page table.
> But only make use of the second(migration) feature.
Correct each feature can be use on its own without the other.
> > +The second mechanism HMM provide is a new kind of ZONE_DEVICE memory that does
> > +allow to allocate a struct page for each page of the device memory. Those page
> > +are special because the CPU can not map them. They however allow to migrate
> > +main memory to device memory using exhisting migration mechanism and everything
> > +looks like if page was swap out to disk from CPU point of view. Using a struct
> > +page gives the easiest and cleanest integration with existing mm mechanisms.
> > +Again here HMM only provide helpers, first to hotplug new ZONE_DEVICE memory
> > +for the device memory and second to perform migration. Policy decision of what
> > +and when to migrate things is left to the device driver.
> > +
> > +Note that any CPU acess to a device page trigger a page fault which initiate a
> > +migration back to system memory so that CPU can access it.
>
> A bit confused here, do you mean CPU access to a main memory page but that page has
> been migrated to device memory?
> Then a page fault will be triggered and initiate a migration back.
If you migrate the page backing address A from a main memory page to a device page
and then CPU try to access address A then you get a page fault because device memory
is not accessible by CPU. The page fault is exactly as if the page was swap out to
disk from kernel point of view.
At any point in time there is only one and one page backing an address either a
regular main memory page or device page. There is no change here to this fundamental
fact in respect to mm. The only difference is that device page are not accessible
by CPU.
Cheers,
J?r?me
On Fri, Mar 17, 2017 at 04:39:28PM +0800, Bob Liu wrote:
> On 2017/3/17 7:49, Jerome Glisse wrote:
> > On Thu, Mar 16, 2017 at 01:43:21PM -0700, Andrew Morton wrote:
> >> On Thu, 16 Mar 2017 12:05:19 -0400 J__r__me Glisse <[email protected]> wrote:
> >>
> >>> Cliff note:
> >>
> >> "Cliff's notes" isn't appropriate for a large feature such as this.
> >> Where's the long-form description? One which permits readers to fully
> >> understand the requirements, design, alternative designs, the
> >> implementation, the interface(s), etc?
> >>
> >> Have you ever spoken about HMM at a conference? If so, the supporting
> >> presentation documents might help here. That's the level of detail
> >> which should be presented here.
> >
> > Longer description of patchset rational, motivation and design choices
> > were given in the first few posting of the patchset to which i included
> > a link in my cover letter. Also given that i presented that for last 3
> > or 4 years to mm summit and kernel summit i thought that by now peoples
> > were familiar about the topic and wanted to spare them the long version.
> > My bad.
> >
> > I attach a patch that is a first stab at a Documentation/hmm.txt that
> > explain the motivation and rational behind HMM. I can probably add a
> > section about how to use HMM from device driver point of view.
> >
>
> And a simple example program/pseudo-code make use of the device memory
> would also very useful for person don't have GPU programming experience :)
Like i said there is no userspace API to this. Right now it is under
driver control what and when to migrate. So this is specific to each
driver and without a driver which use this feature nothing happen.
Each driver will expose its own API that probably won't be expose to
the end user but to the user space driver (OpenCL, Cuda, C++, OpenMP,
...). We are not sure what kind of API we will expose in the nouveau
driver this still need to be discuss. Same for the AMD driver.
Cheers,
J?r?me
On Fri, Mar 17, 2017 at 02:55:57PM +0800, Bob Liu wrote:
> Hi J?r?me,
>
> On 2017/3/17 0:05, J?r?me Glisse wrote:
> > This introduce a dummy HMM device class so device driver can use it to
> > create hmm_device for the sole purpose of registering device memory.
>
> May I ask where is the latest dummy HMM device driver?
> I can only get this one: https://patchwork.kernel.org/patch/4352061/
https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-next
This is a 4.10 tree but the dummy driver there apply on top of v18
https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v18
This is really an example driver it doesn't do anything useful beside
help in testing and debugging.
Cheers,
J?r?me
On Thu, Mar 16, 2017 at 12:05:20PM -0400, J?r?me Glisse wrote:
> When hotpluging memory we want more informations on the type of memory and
> its properties. Replace the device boolean flag by an int and define a set
> of flags.
>
> New property for device memory is an opt-in flag to allow page migration
> from and to a ZONE_DEVICE. Existing user of ZONE_DEVICE are not expecting
> page migration to work for their pages. New changes to page migration i
> changing that and we now need a flag to explicitly opt-in page migration.
>
> Changes since v2:
> - pr_err() in case of hotplug failure
>
> Changes since v1:
> - Improved commit message
> - Improved define name
> - Improved comments
> - Typos
>
> Signed-off-by: J?r?me Glisse <[email protected]>
Fairly minor but it's standard for flags to be unsigned due to
uncertainity about what happens when a signed type is bit shifted.
May not apply to your case but fairly trivial to address.
On Thu, Mar 16, 2017 at 12:05:21PM -0400, J?r?me Glisse wrote:
> This does not affect non ZONE_DEVICE page. In order to allow
> ZONE_DEVICE page to be tracked we need to detect when refcount
> of a ZONE_DEVICE page reach 1 (not 0 as non ZONE_DEVICE page).
>
> This patch just move put_page_testzero() from put_page() to
> put_zone_device_page() and only for ZONE_DEVICE. It does not
> add any overhead compare to existing code.
>
> Signed-off-by: J?r?me Glisse <[email protected]>
> Cc: Dan Williams <[email protected]>
> Cc: Ross Zwisler <[email protected]>
> ---
> include/linux/mm.h | 8 +++++---
> kernel/memremap.c | 2 ++
> 2 files changed, 7 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 5f01c88..28e8b28 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -793,11 +793,13 @@ static inline void put_page(struct page *page)
> {
> page = compound_head(page);
>
> + if (unlikely(is_zone_device_page(page))) {
> + put_zone_device_page(page);
> + return;
> + }
> +
> if (put_page_testzero(page))
> __put_page(page);
> -
> - if (unlikely(is_zone_device_page(page)))
> - put_zone_device_page(page);
> }
>
> #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
> diff --git a/kernel/memremap.c b/kernel/memremap.c
> index 40d4af8..c821946 100644
> --- a/kernel/memremap.c
> +++ b/kernel/memremap.c
> @@ -190,6 +190,8 @@ EXPORT_SYMBOL(get_zone_device_page);
>
> void put_zone_device_page(struct page *page)
> {
> + page_ref_dec(page);
> +
> put_dev_pagemap(page->pgmap);
> }
> EXPORT_SYMBOL(put_zone_device_page);
So the page refcount goes to zero but where did the __put_page call go? I
haven't read the full series yet but I do note the next patch introduces
a callback. Maybe callbacks free the page but it looks optional. Maybe
it gets fixed later in the series but the changelog should at least say
this is not bisect safe and as this looks like a memory leak.
On Thu, Mar 16, 2017 at 12:05:22PM -0400, J?r?me Glisse wrote:
> When a ZONE_DEVICE page refcount reach 1 it means it is free and nobody
> is holding a reference on it (only device to which the memory belong do).
> Add a callback and call it when that happen so device driver can implement
> their own free page management.
>
If it does not implement it's own management then it still needs to be
freed to the main allocator.
Nits mainly
On Thu, Mar 16, 2017 at 12:05:23PM -0400, J?r?me Glisse wrote:
> This add support for un-addressable device memory. Such memory is hotpluged
hotplugged
> only so we can have struct page but we should never map them as such memory
> can not be accessed by CPU. For that reason it uses a special swap entry for
> CPU page table entry.
>
> This patch implement all the logic from special swap type to handling CPU
> page fault through a callback specified in the ZONE_DEVICE pgmap struct.
>
> Architecture that wish to support un-addressable device memory should make
> sure to never populate the kernel linar mapping for the physical range.
>
> This feature potentially breaks memory hotplug unless every driver using it
> magically predicts the future addresses of where memory will be hotplugged.
>
Note in the changelog that enabling this option reduces the maximum
number of swapfiles that can be activated.
Also, you have to read quite a lot of the patch before you learn that
the struct pages are required for migration. It's be nice to lead with
why struct pages are required for memory that can never be
CPU-accessible.
> <SNIP>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 45e91dd..ba564bc 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -51,6 +51,17 @@ static inline int current_is_kswapd(void)
> */
>
> /*
> + * Un-addressable device memory support
> + */
> +#ifdef CONFIG_DEVICE_UNADDRESSABLE
> +#define SWP_DEVICE_NUM 2
> +#define SWP_DEVICE_WRITE (MAX_SWAPFILES + SWP_HWPOISON_NUM + SWP_MIGRATION_NUM)
> +#define SWP_DEVICE (MAX_SWAPFILES + SWP_HWPOISON_NUM + SWP_MIGRATION_NUM + 1)
> +#else
> +#define SWP_DEVICE_NUM 0
> +#endif
> +
> +/*
> * NUMA node memory migration support
> */
> #ifdef CONFIG_MIGRATION
> @@ -72,7 +83,8 @@ static inline int current_is_kswapd(void)
> #endif
>
> #define MAX_SWAPFILES \
> - ((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
> + ((1 << MAX_SWAPFILES_SHIFT) - SWP_DEVICE_NUM - \
> + SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
>
> /*
> * Magic header for a swap area. The first part of the union is
Max swap count reduced here and it looks fine other than the limitation
should be clear.
> @@ -435,8 +447,8 @@ static inline void show_swap_cache_info(void)
> {
> }
>
> -#define free_swap_and_cache(swp) is_migration_entry(swp)
> -#define swapcache_prepare(swp) is_migration_entry(swp)
> +#define free_swap_and_cache(e) (is_migration_entry(e) || is_device_entry(e))
> +#define swapcache_prepare(e) (is_migration_entry(e) || is_device_entry(e))
>
> static inline int add_swap_count_continuation(swp_entry_t swp, gfp_t gfp_mask)
> {
> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> index 5c3a5f3..0e339f0 100644
> --- a/include/linux/swapops.h
> +++ b/include/linux/swapops.h
> @@ -100,6 +100,73 @@ static inline void *swp_to_radix_entry(swp_entry_t entry)
> return (void *)(value | RADIX_TREE_EXCEPTIONAL_ENTRY);
> }
>
> +#if IS_ENABLED(CONFIG_DEVICE_UNADDRESSABLE)
> +static inline swp_entry_t make_device_entry(struct page *page, bool write)
> +{
> + return swp_entry(write?SWP_DEVICE_WRITE:SWP_DEVICE, page_to_pfn(page));
> +}
> +
Minor naming nit. Migration has READ and WRITE migration types but this
has SWP_DEVICE and SWP_DEVICE_WRITE. This was the first time it was
clear there are READ/WRITE types but with different naming.
> +static inline bool is_device_entry(swp_entry_t entry)
> +{
> + int type = swp_type(entry);
> + return type == SWP_DEVICE || type == SWP_DEVICE_WRITE;
> +}
> +
> +static inline void make_device_entry_read(swp_entry_t *entry)
> +{
> + *entry = swp_entry(SWP_DEVICE, swp_offset(*entry));
> +}
> +
> +static inline bool is_write_device_entry(swp_entry_t entry)
> +{
> + return unlikely(swp_type(entry) == SWP_DEVICE_WRITE);
> +}
> +
> +static inline struct page *device_entry_to_page(swp_entry_t entry)
> +{
> + return pfn_to_page(swp_offset(entry));
> +}
> +
Otherwise, looks ok and fairly standard.
> {
> diff --git a/kernel/memremap.c b/kernel/memremap.c
> index 19df1f5..d42f039f 100644
> --- a/kernel/memremap.c
> +++ b/kernel/memremap.c
> @@ -18,6 +18,8 @@
> #include <linux/io.h>
> #include <linux/mm.h>
> #include <linux/memory_hotplug.h>
> +#include <linux/swap.h>
> +#include <linux/swapops.h>
>
> #ifndef ioremap_cache
> /* temporary while we convert existing ioremap_cache users to memremap */
> @@ -203,6 +205,21 @@ void put_zone_device_page(struct page *page)
> }
> EXPORT_SYMBOL(put_zone_device_page);
>
> +#if IS_ENABLED(CONFIG_DEVICE_UNADDRESSABLE)
> +int device_entry_fault(struct vm_area_struct *vma,
> + unsigned long addr,
> + swp_entry_t entry,
> + unsigned flags,
> + pmd_t *pmdp)
> +{
> + struct page *page = device_entry_to_page(entry);
> +
> + BUG_ON(!page->pgmap->page_fault);
> + return page->pgmap->page_fault(vma, addr, page, flags, pmdp);
> +}
The BUG_ON is overkill. It'll trigger a NULL pointer exception
immediately and would likely be a fairly obvious driver bug.
More importantly, there should be a description of what the
responsibilities of page_fault are. Saying it's a fault helpful when it
could say here (or in the struct description) that the handled is
responsible for migrating the data from device memory to CPU-accessible
memory.
What is expected to happen if migration fails? Think something crazy like
a process that is memcg limited, is mostly anonymous memory and there is
no swap. Is it acceptable for the application to be killed?
> +EXPORT_SYMBOL(device_entry_fault);
> +#endif /* CONFIG_DEVICE_UNADDRESSABLE */
> +
> static void pgmap_radix_release(struct resource *res)
> {
> resource_size_t key, align_start, align_size, align_end;
> @@ -258,7 +275,7 @@ static void devm_memremap_pages_release(struct device *dev, void *data)
>
> lock_device_hotplug();
> mem_hotplug_begin();
> - arch_remove_memory(align_start, align_size, MEMORY_DEVICE);
> + arch_remove_memory(align_start, align_size, pgmap->flags);
> mem_hotplug_done();
> unlock_device_hotplug();
>
> @@ -338,6 +355,7 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
> pgmap->ref = ref;
> pgmap->res = &page_map->res;
> pgmap->flags = MEMORY_DEVICE;
> + pgmap->page_fault = NULL;
> pgmap->page_free = NULL;
> pgmap->data = NULL;
>
> @@ -378,7 +396,7 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
>
> lock_device_hotplug();
> mem_hotplug_begin();
> - error = arch_add_memory(nid, align_start, align_size, MEMORY_DEVICE);
> + error = arch_add_memory(nid, align_start, align_size, pgmap->flags);
> mem_hotplug_done();
> unlock_device_hotplug();
> if (error)
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 9b8fccb..9502315 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -700,6 +700,18 @@ config ZONE_DEVICE
>
> If FS_DAX is enabled, then say Y.
>
> +config DEVICE_UNADDRESSABLE
> + bool "Un-addressable device memory (GPU memory, ...)"
> + depends on ZONE_DEVICE
> +
> + help
> + Allow to create struct page for un-addressable device memory
> + ie memory that is only accessible by the device (or group of
> + devices).
> +
> + Having struct page is necessary for process memory migration
> + to device memory.
> +
> config FRAME_VECTOR
> bool
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 235ba51..33aff303 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -49,6 +49,7 @@
> #include <linux/swap.h>
> #include <linux/highmem.h>
> #include <linux/pagemap.h>
> +#include <linux/memremap.h>
> #include <linux/ksm.h>
> #include <linux/rmap.h>
> #include <linux/export.h>
> @@ -927,6 +928,25 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> pte = pte_swp_mksoft_dirty(pte);
> set_pte_at(src_mm, addr, src_pte, pte);
> }
> + } else if (is_device_entry(entry)) {
> + page = device_entry_to_page(entry);
> +
> + /*
> + * Update rss count even for un-addressable page as
> + * they should be consider just like any other page.
> + */
> + get_page(page);
> + rss[mm_counter(page)]++;
> + page_dup_rmap(page, false);
> +
> + if (is_write_device_entry(entry) &&
> + is_cow_mapping(vm_flags)) {
> + make_device_entry_read(&entry);
> + pte = swp_entry_to_pte(entry);
> + if (pte_swp_soft_dirty(*src_pte))
> + pte = pte_swp_mksoft_dirty(pte);
> + set_pte_at(src_mm, addr, src_pte, pte);
> + }
> }
> goto out_set_pte;
> }
I'm curious about the soft dirty page handling part. One of the main
reasons for that is for features like checkpoint/restore but the memory in
question in inaccessible so how can should that be handled? Superficially,
it looks like soft dirty handling with inaccessible memory is a no-go. I
would think that is even a reasonable restriction but it should be clear.
Presumably if an entry needs COW at some point in the future, it's the
device callback that handles it.
> @@ -3679,6 +3735,7 @@ static int wp_huge_pud(struct vm_fault *vmf, pud_t orig_pud)
> static int handle_pte_fault(struct vm_fault *vmf)
> {
> pte_t entry;
> + struct page *page;
>
> if (unlikely(pmd_none(*vmf->pmd))) {
> /*
> @@ -3729,9 +3786,16 @@ static int handle_pte_fault(struct vm_fault *vmf)
> if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
> return do_numa_page(vmf);
>
> + /* Catch mapping of un-addressable memory this should never happen */
> + entry = vmf->orig_pte;
> + page = pfn_to_page(pte_pfn(entry));
> + if (!is_addressable_page(page)) {
> + print_bad_pte(vmf->vma, vmf->address, entry, page);
> + return VM_FAULT_SIGBUS;
> + }
> +
You're adding a new pfn_to_page on every PTE fault and this happens
unconditionally whether DEVICE_INACCESSIBLE is configured or not and it's
for a debugging check. It's also a seriously paranoid check because at
this point the PTE is present so something seriously bad happened in a
fault handler in the past.
Consider removing this entirely or making it a debug-only check. What
happens if such a PTE is accessed anyway and the CPU accesses it?
SIGBUS? If it's harmless (other than an application crash), remove it.
If it's actively dangerous then move this behind a static branch in a
separate patch that is activate iff there is cpu-inaccessible memory in
the system.
> vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
> spin_lock(vmf->ptl);
> - entry = vmf->orig_pte;
> if (unlikely(!pte_same(*vmf->pte, entry)))
> goto unlock;
> if (vmf->flags & FAULT_FLAG_WRITE) {
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 46960b3..4dcc003 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -152,7 +152,7 @@ void mem_hotplug_done(void)
> /* add this memory to iomem resource */
> static struct resource *register_memory_resource(u64 start, u64 size)
> {
> - struct resource *res;
> + struct resource *res, *conflict;
> res = kzalloc(sizeof(struct resource), GFP_KERNEL);
> if (!res)
> return ERR_PTR(-ENOMEM);
> @@ -161,7 +161,13 @@ static struct resource *register_memory_resource(u64 start, u64 size)
> res->start = start;
> res->end = start + size - 1;
> res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
> - if (request_resource(&iomem_resource, res) < 0) {
> + conflict = request_resource_conflict(&iomem_resource, res);
> + if (conflict) {
> + if (conflict->desc == IORES_DESC_UNADDRESSABLE_MEMORY) {
> + pr_debug("Device un-addressable memory block "
> + "memory hotplug at %#010llx !\n",
> + (unsigned long long)start);
> + }
> pr_debug("System RAM resource %pR cannot be added\n", res);
> kfree(res);
> return ERR_PTR(-EEXIST);
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 8edd0d5..50ac297 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -126,6 +126,18 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>
> pages++;
> }
> +
> + if (is_write_device_entry(entry)) {
> + pte_t newpte;
> +
> + make_device_entry_read(&entry);
> + newpte = swp_entry_to_pte(entry);
> + if (pte_swp_soft_dirty(oldpte))
> + newpte = pte_swp_mksoft_dirty(newpte);
> + set_pte_at(mm, addr, pte, newpte);
> +
> + pages++;
> + }
> }
> } while (pte++, addr += PAGE_SIZE, addr != end);
> arch_leave_lazy_mmu_mode();
Again, the soft dirty handling puzzles me.
Overall though, other than the pfn_to_page stuck into the page fault
path for a debugging check I don't have major objections.
On Thu, Mar 16, 2017 at 12:05:25PM -0400, J?r?me Glisse wrote:
> Allow migration without copy in case destination page already have
> source page content. This is usefull for new dma capable migration
> where use device dma engine to copy pages.
>
> This feature need carefull audit of filesystem code to make sure
> that no one can write to the source page while it is unmapped and
> locked. It should be safe for most filesystem but as precaution
> return error until support for device migration is added to them.
>
> Signed-off-by: J?r?me Glisse <[email protected]>
I really dislike the amount of boilerplace code this creates and the fact
that additional headers are needed for that boilerplate. As it's only of
relevance to DMA capable migration, why not simply infer from that if it's
an option instead of updating all supporters of migration?
If that is unsuitable, create a new migreate_mode for a no-copy
migration. You'll need to alter some sites that check the migrate_mode
and it *may* be easier to convert migrate_mode to a bitmask but overall
it would be less boilerplate and confined to just the migration code.
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 9a0897a..cb911ce 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -596,18 +596,10 @@ static void copy_huge_page(struct page *dst, struct page *src)
> }
> }
>
> -/*
> - * Copy the page to its new location
> - */
> -void migrate_page_copy(struct page *newpage, struct page *page)
> +static void migrate_page_states(struct page *newpage, struct page *page)
> {
> int cpupid;
>
> - if (PageHuge(page) || PageTransHuge(page))
> - copy_huge_page(newpage, page);
> - else
> - copy_highpage(newpage, page);
> -
> if (PageError(page))
> SetPageError(newpage);
> if (PageReferenced(page))
> @@ -661,6 +653,19 @@ void migrate_page_copy(struct page *newpage, struct page *page)
>
> mem_cgroup_migrate(page, newpage);
> }
> +
> +/*
> + * Copy the page to its new location
> + */
> +void migrate_page_copy(struct page *newpage, struct page *page)
> +{
> + if (PageHuge(page) || PageTransHuge(page))
> + copy_huge_page(newpage, page);
> + else
> + copy_highpage(newpage, page);
> +
> + migrate_page_states(newpage, page);
> +}
> EXPORT_SYMBOL(migrate_page_copy);
>
> /************************************************************
> @@ -674,8 +679,8 @@ EXPORT_SYMBOL(migrate_page_copy);
> * Pages are locked upon entry and exit.
> */
> int migrate_page(struct address_space *mapping,
> - struct page *newpage, struct page *page,
> - enum migrate_mode mode)
> + struct page *newpage, struct page *page,
> + enum migrate_mode mode, bool copy)
> {
> int rc;
>
> @@ -686,7 +691,11 @@ int migrate_page(struct address_space *mapping,
> if (rc != MIGRATEPAGE_SUCCESS)
> return rc;
>
> - migrate_page_copy(newpage, page);
> + if (copy)
> + migrate_page_copy(newpage, page);
> + else
> + migrate_page_states(newpage, page);
> +
> return MIGRATEPAGE_SUCCESS;
> }
> EXPORT_SYMBOL(migrate_page);
Other than some reshuffling, this is the place where the new copy
parameters it used and it has the mode parameter. At worst you end up
creating a helper to check two potential migrate modes to have either
ASYNC, SYNC or SYNC_LIGHT semantics. I expect you want SYNC symantics.
This patch is huge relative to the small thing it acatually requires.
On Thu, Mar 16, 2017 at 12:05:29PM -0400, J?r?me Glisse wrote:
> This is useful for NVidia GPU >= Pascal, Mellanox IB >= mlx5 and more
> hardware in the future.
>
Insert boiler plate comment about in kernel user being ready here.
> +#if IS_ENABLED(CONFIG_HMM_MIRROR)
> <SNIP>
> + */
> +struct hmm_mirror_ops {
> + /* update() - update virtual address range of memory
> + *
> + * @mirror: pointer to struct hmm_mirror
> + * @update: update's type (turn read only, unmap, ...)
> + * @start: virtual start address of the range to update
> + * @end: virtual end address of the range to update
> + *
> + * This callback is call when the CPU page table is updated, the device
> + * driver must update device page table accordingly to update's action.
> + *
> + * Device driver callback must wait until the device has fully updated
> + * its view for the range. Note we plan to make this asynchronous in
> + * later patches, so that multiple devices can schedule update to their
> + * page tables, and once all device have schedule the update then we
> + * wait for them to propagate.
That sort of future TODO comment may be more appropriate in the
changelog. It doesn't help understand what update is for.
"update" is also unnecessarily vague. sync_cpu_device_pagetables may be
unnecessarily long but at least it's a hint about what's going on. It's
also not super clear from the comment but I assume this is called via a
mmu notifier. If so, that should mentioned here.
> +int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm);
> +int hmm_mirror_register_locked(struct hmm_mirror *mirror,
> + struct mm_struct *mm);
Just a note to say this looks very dangerous and it's not clear why an
unlocked version should ever be used. It's the HMM mirror list that is
being protected but to external callers, that type is opaque so how can
they ever safely use the unlocked version?
Recommend getting rid of it unless there are really great reasons why an
external user of this API should be able to poke at HMM internals.
> diff --git a/mm/Kconfig b/mm/Kconfig
> index fe8ad24..8ae7600 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -293,6 +293,21 @@ config HMM
> bool
> depends on MMU
>
> +config HMM_MIRROR
> + bool "HMM mirror CPU page table into a device page table"
> + select HMM
> + select MMU_NOTIFIER
> + help
> + HMM mirror is a set of helpers to mirror CPU page table into a device
> + page table. There is two side, first keep both page table synchronize
Word smithing, "there are two sides". There are a number of places in
earlier patches where there are minor typos that are worth searching
for. This one really stuck out though.
> /*
> * struct hmm - HMM per mm struct
> *
> * @mm: mm struct this HMM struct is bound to
> + * @lock: lock protecting mirrors list
> + * @mirrors: list of mirrors for this mm
> + * @wait_queue: wait queue
> + * @sequence: we track updates to the CPU page table with a sequence number
> + * @mmu_notifier: mmu notifier to track updates to CPU page table
> + * @notifier_count: number of currently active notifiers
> */
> struct hmm {
> struct mm_struct *mm;
> + spinlock_t lock;
> + struct list_head mirrors;
> + atomic_t sequence;
> + wait_queue_head_t wait_queue;
> + struct mmu_notifier mmu_notifier;
> + atomic_t notifier_count;
> };
Minor nit but notifier_count might be better named nr_notifiers because
it was not clear at all what the difference between the sequence count
and the notifier count is.
Also do a pahole check on this struct. There is a potential for holes with
the embedded structs and I also think that having the notifier_count and
sequence on separate cache lines just unnecessarily increases the cache
footprint. They are generally updated together so you might as well take
one cache miss to cover them both.
>
> /*
> @@ -47,6 +60,12 @@ static struct hmm *hmm_register(struct mm_struct *mm)
> hmm = kmalloc(sizeof(*hmm), GFP_KERNEL);
> if (!hmm)
> return NULL;
> + init_waitqueue_head(&hmm->wait_queue);
> + atomic_set(&hmm->notifier_count, 0);
> + INIT_LIST_HEAD(&hmm->mirrors);
> + atomic_set(&hmm->sequence, 0);
> + hmm->mmu_notifier.ops = NULL;
> + spin_lock_init(&hmm->lock);
> hmm->mm = mm;
>
> spin_lock(&mm->page_table_lock);
> @@ -79,3 +98,170 @@ void hmm_mm_destroy(struct mm_struct *mm)
> spin_unlock(&mm->page_table_lock);
> kfree(hmm);
> }
> +
> +
> +#if IS_ENABLED(CONFIG_HMM_MIRROR)
> +static void hmm_invalidate_range(struct hmm *hmm,
> + enum hmm_update action,
> + unsigned long start,
> + unsigned long end)
> +{
> + struct hmm_mirror *mirror;
> +
> + /*
> + * Mirror being added or removed is a rare event so list traversal isn't
> + * protected by a lock, we rely on simple rules. All list modification
> + * are done using list_add_rcu() and list_del_rcu() under a spinlock to
> + * protect from concurrent addition or removal but not traversal.
> + *
> + * Because hmm_mirror_unregister() waits for all running invalidation to
> + * complete (and thus all list traversals to finish), none of the mirror
> + * structs can be freed from under us while traversing the list and thus
> + * it is safe to dereference their list pointer even if they were just
> + * removed.
> + */
> + list_for_each_entry (mirror, &hmm->mirrors, list)
> + mirror->ops->update(mirror, action, start, end);
> +}
Double check this very carefully because I believe it's wrong. If the
update side is protected by a spin lock then the traversal must be done
using list_for_each_entry_rcu with the the rcu read lock held. I see no
indication from this context that you have the necessary protection in
place.
> +
> +static void hmm_invalidate_page(struct mmu_notifier *mn,
> + struct mm_struct *mm,
> + unsigned long addr)
> +{
> + unsigned long start = addr & PAGE_MASK;
> + unsigned long end = start + PAGE_SIZE;
> + struct hmm *hmm = mm->hmm;
> +
> + VM_BUG_ON(!hmm);
> +
> + atomic_inc(&hmm->notifier_count);
> + atomic_inc(&hmm->sequence);
> + hmm_invalidate_range(mm->hmm, HMM_UPDATE_INVALIDATE, start, end);
> + atomic_dec(&hmm->notifier_count);
> + wake_up(&hmm->wait_queue);
> +}
> +
The wait queue made me search further and I see it's only necessary for an
unregister but you end up with a bunch of atomic operations as a result
of it and a lot of wakeup checks that are almost never useful. That is a
real shame in itself but I'm not convinced it's right either.
It's not clear why both notifier count and sequence are needed considering
that nothing special is done with the sequence value. The comment says
updates are tracked with it but not why or what happens when that sequence
counter wraps. I strong suspect it can be removed.
Most importantly, with the lack of RCU locking in the invalidate_range
function, it's not clear at all how this is safe. The unregister function
says it's to keep the traversal safe but it does the list modification
before the wait so nothing is protected really.
You either need to keep the locking simple here or get the RCU details right.
If it's important that unregister return with no invalidating happening
on the mirror being unregistered then that can be achieved with a
sychronize_rcu() after the list update assuming that the invalidations
take the rcu read lock properly.
On Thu, Mar 16, 2017 at 12:05:28PM -0400, J?r?me Glisse wrote:
> HMM provides 3 separate types of functionality:
> - Mirroring: synchronize CPU page table and device page table
> - Device memory: allocating struct page for device memory
> - Migration: migrating regular memory to device memory
>
> This patch introduces some common helpers and definitions to all of
> those 3 functionality.
>
> Signed-off-by: J?r?me Glisse <[email protected]>
> Signed-off-by: Evgeny Baskakov <[email protected]>
> Signed-off-by: John Hubbard <[email protected]>
> Signed-off-by: Mark Hairgrove <[email protected]>
> Signed-off-by: Sherry Cheung <[email protected]>
> Signed-off-by: Subhash Gutti <[email protected]>
> ---
> <SNIP>
> + * Mirroring:
> + *
> + * HMM provides helpers to mirror a process address space on a device. For this,
> + * it provides several helpers to order device page table updates with respect
> + * to CPU page table updates. The requirement is that for any given virtual
> + * address the CPU and device page table cannot point to different physical
> + * pages. It uses the mmu_notifier API behind the scenes.
> + *
> + * Device memory:
> + *
> + * HMM provides helpers to help leverage device memory. Device memory is, at any
> + * given time, either CPU-addressable like regular memory, or completely
> + * unaddressable. In both cases the device memory is associated with dedicated
> + * struct pages (which are allocated as if for hotplugged memory). Device memory
> + * management is under the responsibility of the device driver. HMM only
> + * allocates and initializes the struct pages associated with the device memory,
> + * by hotplugging a ZONE_DEVICE memory range.
> + *
> + * Allocating struct pages for device memory allows us to use device memory
> + * almost like regular CPU memory. Unlike regular memory, however, it cannot be
> + * added to the lru, nor can any memory allocation can use device memory
> + * directly. Device memory will only end up in use by a process if the device
> + * driver migrates some of the process memory from regular memory to device
> + * memory.
> + *
> + * Migration:
> + *
> + * The existing memory migration mechanism (mm/migrate.c) does not allow using
> + * anything other than the CPU to copy from source to destination memory.
> + * Moreover, existing code does not provide a way to migrate based on a virtual
> + * address range. Existing code only supports struct-page-based migration. Also,
> + * the migration flow does not allow for graceful failure at intermediate stages
> + * of the migration process.
> + *
> + * HMM solves all of the above, by providing a simple API:
> + *
> + * hmm_vma_migrate(ops, vma, src_pfns, dst_pfns, start, end, private);
> + *
> + * finalize_and_map(). The first, alloc_and_copy(), allocates the destination
Somethinig is missing from that sentence. It doesn't parse as it is. The
previous helper was migrate_vma from two patches ago so something has
gone side-ways. I think you meant to explain the ops parameter here but
it got munged.
If so, it would best to put the explanation of the API and ops with
their declarations and reference them from here. Someone looking to
understand migrate_vma() will not necessarily be inspired to check hmm.h
for the information.
> <SNIP>
Various helpers looked ok
> +/* Below are for HMM internal use only! Not to be used by device driver! */
> +void hmm_mm_destroy(struct mm_struct *mm);
> +
This will be ignored at least once :) . Not that it matters as assuming
the driver is a module, it'll not resolve the symbol,
> @@ -495,6 +496,10 @@ struct mm_struct {
> atomic_long_t hugetlb_usage;
> #endif
> struct work_struct async_put_work;
> +#if IS_ENABLED(CONFIG_HMM)
> + /* HMM need to track few things per mm */
> + struct hmm *hmm;
> +#endif
> };
>
Inevitable really but not too bad in comparison to putting pfn_to_page
unnecessarily in the fault path or updating every migration user.
> @@ -289,6 +289,10 @@ config MIGRATION
> config ARCH_ENABLE_HUGEPAGE_MIGRATION
> bool
>
> +config HMM
> + bool
> + depends on MMU
> +
> config PHYS_ADDR_T_64BIT
> def_bool 64BIT || ARCH_PHYS_ADDR_T_64BIT
>
That's a bit sparse in terms of documentation and information
distribution maintainers if they want to enable this.
> <SNIP>
>
> +void hmm_mm_destroy(struct mm_struct *mm)
> +{
> + struct hmm *hmm;
> +
> + /*
> + * We should not need to lock here as no one should be able to register
> + * a new HMM while an mm is being destroy. But just to be safe ...
> + */
> + spin_lock(&mm->page_table_lock);
> + hmm = mm->hmm;
> + mm->hmm = NULL;
> + spin_unlock(&mm->page_table_lock);
> + kfree(hmm);
> +}
Eh?
I actually reacted very badly to the locking before I read the comment
to the extent I searched the patch again looking for the locking
documentation that said this was needed.
Ditch that lock, it's called from mmdrop context and the register
handler doesn't have the same locking. It's also expanding the scope of
what that lock is for into an area it does not belong.
Everything else looked fine.
On Thu, Mar 16, 2017 at 12:05:30PM -0400, J?r?me Glisse wrote:
> This does not use existing page table walker because we want to share
> same code for our page fault handler.
>
> Changes since v1:
> - Use spinlock instead of rcu synchronized list traversal
>
Didn't look too closely here because pagetable walking code tends to have
few surprises but presumably there will be some 5-level page table
updates.
On Thu, Mar 16, 2017 at 12:05:19PM -0400, J?r?me Glisse wrote:
> Cliff note: HMM offers 2 things (each standing on its own). First
> it allows to use device memory transparently inside any process
> without any modifications to process program code. Second it allows
> to mirror process address space on a device.
>
> Changes since v17:
> - typos
> - ZONE_DEVICE page refcount move put_zone_device_page()
>
> Work is still underway to use this feature inside the upstream
> nouveau driver. It has been tested with closed source driver
> and test are still underway on top of new kernel. So far we have
> found no issues. I expect to get a tested-by soon. Also this
> feature is not only useful for NVidia GPU, i expect AMD GPU will
> need it too if they want to support some of the new industry API.
> I also expect some FPGA company to use it and probably other
> hardware.
>
> That being said I don't expect i will ever get a review-by anyone
> for reasons beyond my control.
I spent the length of time a battery lasts reading the patches during my
flight to LSF/MM showing that you can get people to review anything if
you lock them in a metal box for a few hours.
I only got as far as patch 13 before running low on time but decided to send
what I have anyway so you have the feedback before the LSF/MM topic. The
remaining patches are HMM specific and the intent was review how much the
core mm is affected and how hard this would be to maintain. I was less
concerned with the HMM internals itself but I assume that the authors
writing driver support can supply tested-by's.
Overall HMM is fairly well isolated. The drivers can cause new and
interesting damage through the MMU notifiers and fault handling but that is
a driver, not a core, issue. There is new core code but most of it is active
only if a driver is so most people won't notice. Fast paths generally remain
unaffected except for one major case covered in the review. I also didn't
like the migrate_page API update and suggested an alternative. Most of the
other overhead is very minor. My expection is that most core code does not
have to care about HMM and while there is a risk that a driver can cause
damage through the notifiers, that is completely the responsibility of the
driver. Maybe some buglets exist in the new core migration code but again,
most people won't notice unless a suitable driver is loaded.
On that basis, if you address the major aspects of this review, I don't
have an objection at the moment to HMM being merged unlike the
objections I had to the CDM preparation patches that modified zonelist
handling, nodes and the page allocator fast paths.
It still leaves the problem of no in-kernel user of the API. The catch-22 has
now existed for years that driver support won't exist until it's merged and
it won't get merged without drivers. I won't object strongly on that basis
any more but others might. Maybe if this passes Andrew's review it could
be staged in mmotm until a driver or something like CDM is ready? That
would at least give a tree for driver authors to work against with the
resonable expectation that both HMM + driver would go in at the same time.