Sorry i made horrible mistake on names in v4, i completly miss-
understood the suggestion. So here i repost with proper naming.
This is the only change since v3. Again sorry about the noise
with v4.
Changes since v4:
- s/DEVICE_HOST/DEVICE_PUBLIC
Git tree:
https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-cdm-v5
Cache coherent device memory apply to architecture with system bus
like CAPI or CCIX. Device connected to such system bus can expose
their memory to the system and allow cache coherent access to it
from the CPU.
Even if for all intent and purposes device memory behave like regular
memory, we still want to manage it in isolation from regular memory.
Several reasons for that, first and foremost this memory is less
reliable than regular memory if the device hangs because of invalid
commands we can loose access to device memory. Second CPU access to
this memory is expected to be slower than to regular memory. Third
having random memory into device means that some of the bus bandwith
wouldn't be available to the device but would be use by CPU access.
This is why we want to manage such memory in isolation from regular
memory. Kernel should not try to use this memory even as last resort
when running out of memory, at least for now.
This patchset add a new type of ZONE_DEVICE memory (DEVICE_HOST)
that is use to represent CDM memory. This patchset build on top of
the HMM patchset that already introduce a new type of ZONE_DEVICE
memory for private device memory (see HMM patchset).
The end result is that with this patchset if a device is in use in
a process you might have private anonymous memory or file back
page memory using ZONE_DEVICE (DEVICE_HOST). Thus care must be
taken to not overwritte lru fields of such pages.
Hence all core mm changes are done to address assumption that any
process memory is back by a regular struct page that is part of
the lru. ZONE_DEVICE page are not on the lru and the lru pointer
of struct page are use to store device specific informations.
Thus this patchset update all code path that would make assumptions
about lruness of a process page.
patch 01 - rename DEVICE_PUBLIC to DEVICE_HOST to free DEVICE_PUBLIC name
patch 02 - add DEVICE_PUBLIC type to ZONE_DEVICE (all core mm changes)
patch 03 - add an helper to HMM for hotplug of CDM memory
patch 04 - preparatory patch for memory controller changes (memch)
patch 05 - update memory controller to properly handle
ZONE_DEVICE pages when uncharging
patch 06 - documentation patch
Previous posting:
v1 https://lkml.org/lkml/2017/4/7/638
v2 https://lwn.net/Articles/725412/
v3 https://lwn.net/Articles/727114/
v4 https://lwn.net/Articles/727692/
Jérôme Glisse (6):
mm/zone-device: rename DEVICE_PUBLIC to DEVICE_HOST
mm/device-public-memory: device memory cache coherent with CPU v4
mm/hmm: add new helper to hotplug CDM memory region v3
mm/memcontrol: allow to uncharge page without using page->lru field
mm/memcontrol: support MEMORY_DEVICE_PRIVATE and MEMORY_DEVICE_PUBLIC
v3
mm/hmm: documents how device memory is accounted in rss and memcg
Documentation/vm/hmm.txt | 40 ++++++++
fs/proc/task_mmu.c | 2 +-
include/linux/hmm.h | 7 +-
include/linux/ioport.h | 1 +
include/linux/memremap.h | 25 ++++-
include/linux/mm.h | 20 ++--
kernel/memremap.c | 19 ++--
mm/Kconfig | 11 +++
mm/gup.c | 7 ++
mm/hmm.c | 89 ++++++++++++++++--
mm/madvise.c | 2 +-
mm/memcontrol.c | 231 ++++++++++++++++++++++++++++++-----------------
mm/memory.c | 46 +++++++++-
mm/migrate.c | 57 +++++++-----
mm/swap.c | 11 +++
15 files changed, 434 insertions(+), 134 deletions(-)
--
2.13.0
Platform with advance system bus (like CAPI or CCIX) allow device
memory to be accessible from CPU in a cache coherent fashion. Add
a new type of ZONE_DEVICE to represent such memory. The use case
are the same as for the un-addressable device memory but without
all the corners cases.
Changed since v3:
- s/public/public (going back)
Changed since v2:
- s/public/public
- add proper include in migrate.c and drop useless #if/#endif
Changed since v1:
- Kconfig and #if/#else cleanup
Signed-off-by: Jérôme Glisse <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Aneesh Kumar <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Ross Zwisler <[email protected]>
---
fs/proc/task_mmu.c | 2 +-
include/linux/hmm.h | 4 ++--
include/linux/ioport.h | 1 +
include/linux/memremap.h | 21 ++++++++++++++++++
include/linux/mm.h | 20 ++++++++++-------
kernel/memremap.c | 15 ++++++++-----
mm/Kconfig | 11 ++++++++++
mm/gup.c | 7 ++++++
mm/hmm.c | 4 ++--
mm/madvise.c | 2 +-
mm/memory.c | 46 +++++++++++++++++++++++++++++++++-----
mm/migrate.c | 57 ++++++++++++++++++++++++++++++------------------
mm/swap.c | 11 ++++++++++
13 files changed, 156 insertions(+), 45 deletions(-)
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 957b6ea80d5f..1f38f2c7cc34 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1182,7 +1182,7 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
if (pm->show_pfn)
frame = pte_pfn(pte);
flags |= PM_PRESENT;
- page = vm_normal_page(vma, addr, pte);
+ page = _vm_normal_page(vma, addr, pte, true);
if (pte_soft_dirty(pte))
flags |= PM_SOFT_DIRTY;
} else if (is_swap_pte(pte)) {
diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 458d0d6d82f3..a40288309fd2 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -327,7 +327,7 @@ int hmm_vma_fault(struct vm_area_struct *vma,
#endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
-#if IS_ENABLED(CONFIG_DEVICE_PRIVATE)
+#if IS_ENABLED(CONFIG_DEVICE_PRIVATE) || IS_ENABLED(CONFIG_DEVICE_PUBLIC)
struct hmm_devmem;
struct page *hmm_vma_alloc_locked_page(struct vm_area_struct *vma,
@@ -443,7 +443,7 @@ struct hmm_device {
*/
struct hmm_device *hmm_device_new(void *drvdata);
void hmm_device_put(struct hmm_device *hmm_device);
-#endif /* IS_ENABLED(CONFIG_DEVICE_PRIVATE) */
+#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
/* Below are for HMM internal use only! Not to be used by device driver! */
diff --git a/include/linux/ioport.h b/include/linux/ioport.h
index 3a4f69137bc2..f5cf32e80041 100644
--- a/include/linux/ioport.h
+++ b/include/linux/ioport.h
@@ -131,6 +131,7 @@ enum {
IORES_DESC_PERSISTENT_MEMORY = 4,
IORES_DESC_PERSISTENT_MEMORY_LEGACY = 5,
IORES_DESC_DEVICE_PRIVATE_MEMORY = 6,
+ IORES_DESC_DEVICE_PUBLIC_MEMORY = 7,
};
/* helpers to define resources */
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index ae5ff92f72b4..c7b4c75ae3f8 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -57,10 +57,18 @@ static inline struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
*
* A more complete discussion of unaddressable memory may be found in
* include/linux/hmm.h and Documentation/vm/hmm.txt.
+ *
+ * MEMORY_DEVICE_PUBLIC:
+ * Device memory that is cache coherent from device and CPU point of view. This
+ * is use on platform that have an advance system bus (like CAPI or CCIX). A
+ * driver can hotplug the device memory using ZONE_DEVICE and with that memory
+ * type. Any page of a process can be migrated to such memory. However no one
+ * should be allow to pin such memory so that it can always be evicted.
*/
enum memory_type {
MEMORY_DEVICE_HOST = 0,
MEMORY_DEVICE_PRIVATE,
+ MEMORY_DEVICE_PUBLIC,
};
/*
@@ -92,6 +100,8 @@ enum memory_type {
* The page_free() callback is called once the page refcount reaches 1
* (ZONE_DEVICE pages never reach 0 refcount unless there is a refcount bug.
* This allows the device driver to implement its own memory management.)
+ *
+ * For MEMORY_DEVICE_PUBLIC only the page_free() callback matter.
*/
typedef int (*dev_page_fault_t)(struct vm_area_struct *vma,
unsigned long addr,
@@ -134,6 +144,12 @@ static inline bool is_device_private_page(const struct page *page)
return is_zone_device_page(page) &&
page->pgmap->type == MEMORY_DEVICE_PRIVATE;
}
+
+static inline bool is_device_public_page(const struct page *page)
+{
+ return is_zone_device_page(page) &&
+ page->pgmap->type == MEMORY_DEVICE_PUBLIC;
+}
#else
static inline void *devm_memremap_pages(struct device *dev,
struct resource *res, struct percpu_ref *ref,
@@ -157,6 +173,11 @@ static inline bool is_device_private_page(const struct page *page)
{
return false;
}
+
+static inline bool is_device_public_page(const struct page *page)
+{
+ return false;
+}
#endif
/**
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 330a216ac315..980354828177 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -796,15 +796,16 @@ static inline bool is_zone_device_page(const struct page *page)
}
#endif
-#ifdef CONFIG_DEVICE_PRIVATE
-void put_zone_device_private_page(struct page *page);
+#if IS_ENABLED(CONFIG_DEVICE_PRIVATE) || IS_ENABLED(CONFIG_DEVICE_PUBLIC)
+void put_zone_device_private_or_public_page(struct page *page);
#else
-static inline void put_zone_device_private_page(struct page *page)
+static inline void put_zone_device_private_or_public_page(struct page *page)
{
}
-#endif
+#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
static inline bool is_device_private_page(const struct page *page);
+static inline bool is_device_public_page(const struct page *page);
DECLARE_STATIC_KEY_FALSE(device_private_key);
@@ -830,8 +831,9 @@ static inline void put_page(struct page *page)
* include/linux/memremap.h and HMM for details.
*/
if (static_branch_unlikely(&device_private_key) &&
- unlikely(is_device_private_page(page))) {
- put_zone_device_private_page(page);
+ unlikely(is_device_private_page(page) ||
+ is_device_public_page(page))) {
+ put_zone_device_private_or_public_page(page);
return;
}
@@ -1220,8 +1222,10 @@ struct zap_details {
pgoff_t last_index; /* Highest page->index to unmap */
};
-struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
- pte_t pte);
+struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
+ pte_t pte, bool with_public_device);
+#define vm_normal_page(vma, addr, pte) _vm_normal_page(vma, addr, pte, false)
+
struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
pmd_t pmd);
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 4e07525aa273..25c098151ed2 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -465,8 +465,8 @@ struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
#endif /* CONFIG_ZONE_DEVICE */
-#ifdef CONFIG_DEVICE_PRIVATE
-void put_zone_device_private_page(struct page *page)
+#if IS_ENABLED(CONFIG_DEVICE_PRIVATE) || IS_ENABLED(CONFIG_DEVICE_PUBLIC)
+void put_zone_device_private_or_public_page(struct page *page)
{
int count = page_ref_dec_return(page);
@@ -474,10 +474,15 @@ void put_zone_device_private_page(struct page *page)
* If refcount is 1 then page is freed and refcount is stable as nobody
* holds a reference on the page.
*/
- if (count == 1)
+ if (count == 1) {
+ /* Clear Active bit in case of parallel mark_page_accessed */
+ __ClearPageActive(page);
+ __ClearPageWaiters(page);
+
page->pgmap->page_free(page, page->pgmap->data);
+ }
else if (!count)
__put_page(page);
}
-EXPORT_SYMBOL(put_zone_device_private_page);
-#endif /* CONFIG_DEVICE_PRIVATE */
+EXPORT_SYMBOL(put_zone_device_private_or_public_page);
+#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
diff --git a/mm/Kconfig b/mm/Kconfig
index 5960617ef781..424ef60547f8 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -716,12 +716,23 @@ config ZONE_DEVICE
config DEVICE_PRIVATE
bool "Unaddressable device memory (GPU memory, ...)"
depends on ARCH_HAS_HMM
+ select HMM
help
Allows creation of struct pages to represent unaddressable device
memory; i.e., memory that is only accessible from the device (or
group of devices).
+config DEVICE_PUBLIC
+ bool "Addressable device memory (like GPU memory)"
+ depends on ARCH_HAS_HMM
+ select HMM
+
+ help
+ Allows creation of struct pages to represent addressable device
+ memory; i.e., memory that is accessible from both the device and
+ the CPU
+
config FRAME_VECTOR
bool
diff --git a/mm/gup.c b/mm/gup.c
index 23f01c40c88f..2f8e8604ff80 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -438,6 +438,13 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address,
if ((gup_flags & FOLL_DUMP) || !is_zero_pfn(pte_pfn(*pte)))
goto unmap;
*page = pte_page(*pte);
+
+ /*
+ * This should never happen (a device public page in the gate
+ * area).
+ */
+ if (is_device_public_page(*page))
+ goto unmap;
}
get_page(*page);
out:
diff --git a/mm/hmm.c b/mm/hmm.c
index 4e01c9ba9cc1..eadf70829c34 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -747,7 +747,7 @@ EXPORT_SYMBOL(hmm_vma_fault);
#endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
-#if IS_ENABLED(CONFIG_DEVICE_PRIVATE)
+#if IS_ENABLED(CONFIG_DEVICE_PRIVATE) || IS_ENABLED(CONFIG_DEVICE_PUBLIC)
struct page *hmm_vma_alloc_locked_page(struct vm_area_struct *vma,
unsigned long addr)
{
@@ -1190,4 +1190,4 @@ static int __init hmm_init(void)
}
device_initcall(hmm_init);
-#endif /* IS_ENABLED(CONFIG_DEVICE_PRIVATE) */
+#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
diff --git a/mm/madvise.c b/mm/madvise.c
index 9976852f1e1c..197277156ce3 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -343,7 +343,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
continue;
}
- page = vm_normal_page(vma, addr, ptent);
+ page = _vm_normal_page(vma, addr, ptent, true);
if (!page)
continue;
diff --git a/mm/memory.c b/mm/memory.c
index 781935e83ff3..709d7d237234 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -789,8 +789,8 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
#else
# define HAVE_PTE_SPECIAL 0
#endif
-struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
- pte_t pte)
+struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
+ pte_t pte, bool with_public_device)
{
unsigned long pfn = pte_pfn(pte);
@@ -801,8 +801,31 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
return vma->vm_ops->find_special_page(vma, addr);
if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
return NULL;
- if (!is_zero_pfn(pfn))
- print_bad_pte(vma, addr, pte, NULL);
+ if (is_zero_pfn(pfn))
+ return NULL;
+
+ /*
+ * Device public pages are special pages (they are ZONE_DEVICE
+ * pages but different from persistent memory). They behave
+ * allmost like normal pages. The difference is that they are
+ * not on the lru and thus should never be involve with any-
+ * thing that involve lru manipulation (mlock, numa balancing,
+ * ...).
+ *
+ * This is why we still want to return NULL for such page from
+ * vm_normal_page() so that we do not have to special case all
+ * call site of vm_normal_page().
+ */
+ if (likely(pfn < highest_memmap_pfn)) {
+ struct page *page = pfn_to_page(pfn);
+
+ if (is_device_public_page(page)) {
+ if (with_public_device)
+ return page;
+ return NULL;
+ }
+ }
+ print_bad_pte(vma, addr, pte, NULL);
return NULL;
}
@@ -983,6 +1006,19 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
get_page(page);
page_dup_rmap(page, false);
rss[mm_counter(page)]++;
+ } else if (pte_devmap(pte)) {
+ page = pte_page(pte);
+
+ /*
+ * Cache coherent device memory behave like regular page and
+ * not like persistent memory page. For more informations see
+ * MEMORY_DEVICE_CACHE_COHERENT in memory_hotplug.h
+ */
+ if (is_device_public_page(page)) {
+ get_page(page);
+ page_dup_rmap(page, false);
+ rss[mm_counter(page)]++;
+ }
}
out_set_pte:
@@ -1236,7 +1272,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
if (pte_present(ptent)) {
struct page *page;
- page = vm_normal_page(vma, addr, ptent);
+ page = _vm_normal_page(vma, addr, ptent, true);
if (unlikely(details) && page) {
/*
* unmap_shared_mapping_pages() wants to
diff --git a/mm/migrate.c b/mm/migrate.c
index 643ea61ca9bb..fbf0b86deecd 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -36,6 +36,7 @@
#include <linux/hugetlb.h>
#include <linux/hugetlb_cgroup.h>
#include <linux/gfp.h>
+#include <linux/pfn_t.h>
#include <linux/memremap.h>
#include <linux/userfaultfd_k.h>
#include <linux/balloon_compaction.h>
@@ -229,12 +230,16 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
if (is_write_migration_entry(entry))
pte = maybe_mkwrite(pte, vma);
- if (unlikely(is_zone_device_page(new)) &&
- is_device_private_page(new)) {
- entry = make_device_private_entry(new, pte_write(pte));
- pte = swp_entry_to_pte(entry);
- if (pte_swp_soft_dirty(*pvmw.pte))
- pte = pte_mksoft_dirty(pte);
+ if (unlikely(is_zone_device_page(new))) {
+ if (is_device_private_page(new)) {
+ entry = make_device_private_entry(new, pte_write(pte));
+ pte = swp_entry_to_pte(entry);
+ if (pte_swp_soft_dirty(*pvmw.pte))
+ pte = pte_mksoft_dirty(pte);
+ } else if (is_device_public_page(new)) {
+ pte = pte_mkdevmap(pte);
+ flush_dcache_page(new);
+ }
} else
flush_dcache_page(new);
@@ -408,12 +413,11 @@ int migrate_page_move_mapping(struct address_space *mapping,
void **pslot;
/*
- * ZONE_DEVICE pages have 1 refcount always held by their device
- *
- * Note that DAX memory will never reach that point as it does not have
- * the MEMORY_DEVICE_ALLOW_MIGRATE flag set (see memory_hotplug.h).
+ * Device public or private pages have an extra refcount as they are
+ * ZONE_DEVICE pages.
*/
- expected_count += is_zone_device_page(page);
+ expected_count += is_device_private_page(page);
+ expected_count += is_device_public_page(page);
if (!mapping) {
/* Anonymous page without mapping */
@@ -2087,7 +2091,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
#endif /* CONFIG_NUMA */
-
struct migrate_vma {
struct vm_area_struct *vma;
unsigned long *dst;
@@ -2186,7 +2189,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
if (is_write_device_private_entry(entry))
mpfn |= MIGRATE_PFN_WRITE;
} else {
- page = vm_normal_page(migrate->vma, addr, pte);
+ page = _vm_normal_page(migrate->vma, addr, pte, true);
mpfn = migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
}
@@ -2311,13 +2314,18 @@ static bool migrate_vma_check_page(struct page *page)
/* Page from ZONE_DEVICE have one extra reference */
if (is_zone_device_page(page)) {
- if (is_device_private_page(page)) {
+ if (is_device_private_page(page) ||
+ is_device_public_page(page))
extra++;
- } else
+ else
/* Other ZONE_DEVICE memory type are not supported */
return false;
}
+ /* For file back page */
+ if (page_mapping(page))
+ extra += 1 + page_has_private(page);
+
if ((page_count(page) - extra) > page_mapcount(page))
return false;
@@ -2541,11 +2549,18 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
*/
__SetPageUptodate(page);
- if (is_zone_device_page(page) && is_device_private_page(page)) {
- swp_entry_t swp_entry;
-
- swp_entry = make_device_private_entry(page, vma->vm_flags & VM_WRITE);
- entry = swp_entry_to_pte(swp_entry);
+ if (is_zone_device_page(page)) {
+ if (is_device_private_page(page)) {
+ swp_entry_t swp_entry;
+
+ swp_entry = make_device_private_entry(page, vma->vm_flags & VM_WRITE);
+ entry = swp_entry_to_pte(swp_entry);
+ } else if (is_device_public_page(page)) {
+ entry = pte_mkold(mk_pte(page, READ_ONCE(vma->vm_page_prot)));
+ if (vma->vm_flags & VM_WRITE)
+ entry = pte_mkwrite(pte_mkdirty(entry));
+ entry = pte_mkdevmap(entry);
+ }
} else {
entry = mk_pte(page, vma->vm_page_prot);
if (vma->vm_flags & VM_WRITE)
@@ -2631,7 +2646,7 @@ static void migrate_vma_pages(struct migrate_vma *migrate)
migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
continue;
}
- } else {
+ } else if (!is_device_public_page(newpage)) {
/*
* Other types of ZONE_DEVICE page are not
* supported.
diff --git a/mm/swap.c b/mm/swap.c
index 60b1d2a75852..eac0e35f854f 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -765,6 +765,17 @@ void release_pages(struct page **pages, int nr, bool cold)
if (is_huge_zero_page(page))
continue;
+ /* Device public page can not be huge page */
+ if (is_device_public_page(page)) {
+ if (locked_pgdat) {
+ spin_unlock_irqrestore(&locked_pgdat->lru_lock,
+ flags);
+ locked_pgdat = NULL;
+ }
+ put_zone_device_private_or_public_page(page);
+ continue;
+ }
+
page = compound_head(page);
if (!put_page_testzero(page))
continue;
--
2.13.0
HMM pages (private or public device pages) are ZONE_DEVICE page and
thus you can not use page->lru fields of those pages. This patch
re-arrange the uncharge to allow single page to be uncharge without
modifying the lru field of the struct page.
There is no change to memcontrol logic, it is the same as it was
before this patch.
Signed-off-by: Jérôme Glisse <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: [email protected]
---
mm/memcontrol.c | 168 +++++++++++++++++++++++++++++++-------------------------
1 file changed, 92 insertions(+), 76 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3df3c04d73ab..c709fdceac13 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5509,48 +5509,102 @@ void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg,
cancel_charge(memcg, nr_pages);
}
-static void uncharge_batch(struct mem_cgroup *memcg, unsigned long pgpgout,
- unsigned long nr_anon, unsigned long nr_file,
- unsigned long nr_kmem, unsigned long nr_huge,
- unsigned long nr_shmem, struct page *dummy_page)
+struct uncharge_gather {
+ struct mem_cgroup *memcg;
+ unsigned long pgpgout;
+ unsigned long nr_anon;
+ unsigned long nr_file;
+ unsigned long nr_kmem;
+ unsigned long nr_huge;
+ unsigned long nr_shmem;
+ struct page *dummy_page;
+};
+
+static inline void uncharge_gather_clear(struct uncharge_gather *ug)
{
- unsigned long nr_pages = nr_anon + nr_file + nr_kmem;
+ memset(ug, 0, sizeof(*ug));
+}
+
+static void uncharge_batch(const struct uncharge_gather *ug)
+{
+ unsigned long nr_pages = ug->nr_anon + ug->nr_file + ug->nr_kmem;
unsigned long flags;
- if (!mem_cgroup_is_root(memcg)) {
- page_counter_uncharge(&memcg->memory, nr_pages);
+ if (!mem_cgroup_is_root(ug->memcg)) {
+ page_counter_uncharge(&ug->memcg->memory, nr_pages);
if (do_memsw_account())
- page_counter_uncharge(&memcg->memsw, nr_pages);
- if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && nr_kmem)
- page_counter_uncharge(&memcg->kmem, nr_kmem);
- memcg_oom_recover(memcg);
+ page_counter_uncharge(&ug->memcg->memsw, nr_pages);
+ if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && ug->nr_kmem)
+ page_counter_uncharge(&ug->memcg->kmem, ug->nr_kmem);
+ memcg_oom_recover(ug->memcg);
}
local_irq_save(flags);
- __this_cpu_sub(memcg->stat->count[MEMCG_RSS], nr_anon);
- __this_cpu_sub(memcg->stat->count[MEMCG_CACHE], nr_file);
- __this_cpu_sub(memcg->stat->count[MEMCG_RSS_HUGE], nr_huge);
- __this_cpu_sub(memcg->stat->count[NR_SHMEM], nr_shmem);
- __this_cpu_add(memcg->stat->events[PGPGOUT], pgpgout);
- __this_cpu_add(memcg->stat->nr_page_events, nr_pages);
- memcg_check_events(memcg, dummy_page);
+ __this_cpu_sub(ug->memcg->stat->count[MEMCG_RSS], ug->nr_anon);
+ __this_cpu_sub(ug->memcg->stat->count[MEMCG_CACHE], ug->nr_file);
+ __this_cpu_sub(ug->memcg->stat->count[MEMCG_RSS_HUGE], ug->nr_huge);
+ __this_cpu_sub(ug->memcg->stat->count[NR_SHMEM], ug->nr_shmem);
+ __this_cpu_add(ug->memcg->stat->events[PGPGOUT], ug->pgpgout);
+ __this_cpu_add(ug->memcg->stat->nr_page_events, nr_pages);
+ memcg_check_events(ug->memcg, ug->dummy_page);
local_irq_restore(flags);
- if (!mem_cgroup_is_root(memcg))
- css_put_many(&memcg->css, nr_pages);
+ if (!mem_cgroup_is_root(ug->memcg))
+ css_put_many(&ug->memcg->css, nr_pages);
+}
+
+static void uncharge_page(struct page *page, struct uncharge_gather *ug)
+{
+ VM_BUG_ON_PAGE(PageLRU(page), page);
+ VM_BUG_ON_PAGE(!PageHWPoison(page) && page_count(page), page);
+
+ if (!page->mem_cgroup)
+ return;
+
+ /*
+ * Nobody should be changing or seriously looking at
+ * page->mem_cgroup at this point, we have fully
+ * exclusive access to the page.
+ */
+
+ if (ug->memcg != page->mem_cgroup) {
+ if (ug->memcg) {
+ uncharge_batch(ug);
+ uncharge_gather_clear(ug);
+ }
+ ug->memcg = page->mem_cgroup;
+ }
+
+ if (!PageKmemcg(page)) {
+ unsigned int nr_pages = 1;
+
+ if (PageTransHuge(page)) {
+ nr_pages <<= compound_order(page);
+ ug->nr_huge += nr_pages;
+ }
+ if (PageAnon(page))
+ ug->nr_anon += nr_pages;
+ else {
+ ug->nr_file += nr_pages;
+ if (PageSwapBacked(page))
+ ug->nr_shmem += nr_pages;
+ }
+ ug->pgpgout++;
+ } else {
+ ug->nr_kmem += 1 << compound_order(page);
+ __ClearPageKmemcg(page);
+ }
+
+ ug->dummy_page = page;
+ page->mem_cgroup = NULL;
}
static void uncharge_list(struct list_head *page_list)
{
- struct mem_cgroup *memcg = NULL;
- unsigned long nr_shmem = 0;
- unsigned long nr_anon = 0;
- unsigned long nr_file = 0;
- unsigned long nr_huge = 0;
- unsigned long nr_kmem = 0;
- unsigned long pgpgout = 0;
+ struct uncharge_gather ug;
struct list_head *next;
- struct page *page;
+
+ uncharge_gather_clear(&ug);
/*
* Note that the list can be a single page->lru; hence the
@@ -5558,57 +5612,16 @@ static void uncharge_list(struct list_head *page_list)
*/
next = page_list->next;
do {
+ struct page *page;
+
page = list_entry(next, struct page, lru);
next = page->lru.next;
- VM_BUG_ON_PAGE(PageLRU(page), page);
- VM_BUG_ON_PAGE(!PageHWPoison(page) && page_count(page), page);
-
- if (!page->mem_cgroup)
- continue;
-
- /*
- * Nobody should be changing or seriously looking at
- * page->mem_cgroup at this point, we have fully
- * exclusive access to the page.
- */
-
- if (memcg != page->mem_cgroup) {
- if (memcg) {
- uncharge_batch(memcg, pgpgout, nr_anon, nr_file,
- nr_kmem, nr_huge, nr_shmem, page);
- pgpgout = nr_anon = nr_file = nr_kmem = 0;
- nr_huge = nr_shmem = 0;
- }
- memcg = page->mem_cgroup;
- }
-
- if (!PageKmemcg(page)) {
- unsigned int nr_pages = 1;
-
- if (PageTransHuge(page)) {
- nr_pages <<= compound_order(page);
- nr_huge += nr_pages;
- }
- if (PageAnon(page))
- nr_anon += nr_pages;
- else {
- nr_file += nr_pages;
- if (PageSwapBacked(page))
- nr_shmem += nr_pages;
- }
- pgpgout++;
- } else {
- nr_kmem += 1 << compound_order(page);
- __ClearPageKmemcg(page);
- }
-
- page->mem_cgroup = NULL;
+ uncharge_page(page, &ug);
} while (next != page_list);
- if (memcg)
- uncharge_batch(memcg, pgpgout, nr_anon, nr_file,
- nr_kmem, nr_huge, nr_shmem, page);
+ if (ug.memcg)
+ uncharge_batch(&ug);
}
/**
@@ -5620,6 +5633,8 @@ static void uncharge_list(struct list_head *page_list)
*/
void mem_cgroup_uncharge(struct page *page)
{
+ struct uncharge_gather ug;
+
if (mem_cgroup_disabled())
return;
@@ -5627,8 +5642,9 @@ void mem_cgroup_uncharge(struct page *page)
if (!page->mem_cgroup)
return;
- INIT_LIST_HEAD(&page->lru);
- uncharge_list(&page->lru);
+ uncharge_gather_clear(&ug);
+ uncharge_page(page, &ug);
+ uncharge_batch(&ug);
}
/**
--
2.13.0
For now we account device memory exactly like a regular page in
respect to rss counters and memory cgroup. We do this so that any
existing application that starts using device memory without knowing
about it will keep running unimpacted. This also simplify migration
code.
We will likely revisit this choice once we gain more experience with
how device memory is use and how it impacts overall memory resource
management. For now we believe this is a good enough choice.
Note that device memory can not be pin. Nor by device driver, nor
by GUP thus device memory can always be free and unaccounted when
a process exit.
Signed-off-by: Jérôme Glisse <[email protected]>
Cc: Michal Hocko <[email protected]>
---
Documentation/vm/hmm.txt | 40 ++++++++++++++++++++++++++++++++++++++++
1 file changed, 40 insertions(+)
diff --git a/Documentation/vm/hmm.txt b/Documentation/vm/hmm.txt
index 192dcdb38bd1..4d3aac9f4a5d 100644
--- a/Documentation/vm/hmm.txt
+++ b/Documentation/vm/hmm.txt
@@ -15,6 +15,15 @@ section present the new migration helper that allow to leverage the device DMA
engine.
+1) Problems of using device specific memory allocator:
+2) System bus, device memory characteristics
+3) Share address space and migration
+4) Address space mirroring implementation and API
+5) Represent and manage device memory from core kernel point of view
+6) Migrate to and from device memory
+7) Memory cgroup (memcg) and rss accounting
+
+
-------------------------------------------------------------------------------
1) Problems of using device specific memory allocator:
@@ -342,3 +351,34 @@ that happens then the finalize_and_map() can catch any pages that was not
migrated. Note those page were still copied to new page and thus we wasted
bandwidth but this is considered as a rare event and a price that we are
willing to pay to keep all the code simpler.
+
+
+-------------------------------------------------------------------------------
+
+7) Memory cgroup (memcg) and rss accounting
+
+For now device memory is accounted as any regular page in rss counters (either
+anonymous if device page is use for anonymous, file if device page is use for
+file back page or shmem if device page is use for share memory). This is a
+deliberate choice to keep existing application that might start using device
+memory without knowing about it to keep runing unimpacted.
+
+Drawbacks is that OOM killer might kill an application using a lot of device
+memory and not a lot of regular system memory and thus not freeing much system
+memory. We want to gather more real world experience on how application and
+system react under memory pressure in the presence of device memory before
+deciding to account device memory differently.
+
+
+Same decision was made for memory cgroup. Device memory page are accounted
+against same memory cgroup a regular page would be accounted to. This does
+simplify migration to and from device memory. This also means that migration
+back from device memory to regular memory can not fail because it would
+go above memory cgroup limit. We might revisit this choice latter on once we
+get more experience in how device memory is use and its impact on memory
+resource control.
+
+
+Note that device memory can never be pin nor by device driver nor through GUP
+and thus such memory is always free upon process exit. Or when last reference
+is drop in case of share memory or file back memory.
--
2.13.0
HMM pages (private or public device pages) are ZONE_DEVICE page and
thus need special handling when it comes to lru or refcount. This
patch make sure that memcontrol properly handle those when it face
them. Those pages are use like regular pages in a process address
space either as anonymous page or as file back page. So from memcg
point of view we want to handle them like regular page for now at
least.
Changed since v2:
- s/host/public
Changed since v1:
- s/public/host
- add comments explaining how device memory behave and why
Signed-off-by: Jérôme Glisse <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: [email protected]
---
kernel/memremap.c | 2 ++
mm/memcontrol.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++++++-----
2 files changed, 60 insertions(+), 5 deletions(-)
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 25c098151ed2..4d74b4a4f8f5 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -479,6 +479,8 @@ void put_zone_device_private_or_public_page(struct page *page)
__ClearPageActive(page);
__ClearPageWaiters(page);
+ mem_cgroup_uncharge(page);
+
page->pgmap->page_free(page, page->pgmap->data);
}
else if (!count)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c709fdceac13..858842a741bf 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4391,12 +4391,13 @@ enum mc_target_type {
MC_TARGET_NONE = 0,
MC_TARGET_PAGE,
MC_TARGET_SWAP,
+ MC_TARGET_DEVICE,
};
static struct page *mc_handle_present_pte(struct vm_area_struct *vma,
unsigned long addr, pte_t ptent)
{
- struct page *page = vm_normal_page(vma, addr, ptent);
+ struct page *page = _vm_normal_page(vma, addr, ptent, true);
if (!page || !page_mapped(page))
return NULL;
@@ -4407,13 +4408,20 @@ static struct page *mc_handle_present_pte(struct vm_area_struct *vma,
if (!(mc.flags & MOVE_FILE))
return NULL;
}
- if (!get_page_unless_zero(page))
+ if (is_device_public_page(page)) {
+ /*
+ * MEMORY_DEVICE_PUBLIC means ZONE_DEVICE page and which have a
+ * refcount of 1 when free (unlike normal page)
+ */
+ if (!page_ref_add_unless(page, 1, 1))
+ return NULL;
+ } else if (!get_page_unless_zero(page))
return NULL;
return page;
}
-#ifdef CONFIG_SWAP
+#if defined(CONFIG_SWAP) || defined(CONFIG_DEVICE_PRIVATE)
static struct page *mc_handle_swap_pte(struct vm_area_struct *vma,
pte_t ptent, swp_entry_t *entry)
{
@@ -4422,6 +4430,23 @@ static struct page *mc_handle_swap_pte(struct vm_area_struct *vma,
if (!(mc.flags & MOVE_ANON) || non_swap_entry(ent))
return NULL;
+
+ /*
+ * Handle MEMORY_DEVICE_PRIVATE which are ZONE_DEVICE page belonging to
+ * a device and because they are not accessible by CPU they are store
+ * as special swap entry in the CPU page table.
+ */
+ if (is_device_private_entry(ent)) {
+ page = device_private_entry_to_page(ent);
+ /*
+ * MEMORY_DEVICE_PRIVATE means ZONE_DEVICE page and which have
+ * a refcount of 1 when free (unlike normal page)
+ */
+ if (!page_ref_add_unless(page, 1, 1))
+ return NULL;
+ return page;
+ }
+
/*
* Because lookup_swap_cache() updates some statistics counter,
* we call find_get_page() with swapper_space directly.
@@ -4582,6 +4607,13 @@ static int mem_cgroup_move_account(struct page *page,
* 2(MC_TARGET_SWAP): if the swap entry corresponding to this pte is a
* target for charge migration. if @target is not NULL, the entry is stored
* in target->ent.
+ * 3(MC_TARGET_DEVICE): like MC_TARGET_PAGE but page is MEMORY_DEVICE_PUBLIC
+ * or MEMORY_DEVICE_PRIVATE (so ZONE_DEVICE page and thus not on the lru).
+ * For now we such page is charge like a regular page would be as for all
+ * intent and purposes it is just special memory taking the place of a
+ * regular page. See Documentations/vm/hmm.txt and include/linux/hmm.h for
+ * more informations on this type of memory how it is use and why it is
+ * charge like this.
*
* Called with pte lock held.
*/
@@ -4610,6 +4642,9 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,
*/
if (page->mem_cgroup == mc.from) {
ret = MC_TARGET_PAGE;
+ if (is_device_private_page(page) ||
+ is_device_public_page(page))
+ ret = MC_TARGET_DEVICE;
if (target)
target->page = page;
}
@@ -4669,6 +4704,11 @@ static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd,
ptl = pmd_trans_huge_lock(pmd, vma);
if (ptl) {
+ /*
+ * Note their can not be MC_TARGET_DEVICE for now as we do not
+ * support transparent huge page with MEMORY_DEVICE_PUBLIC or
+ * MEMORY_DEVICE_PRIVATE but this might change.
+ */
if (get_mctgt_type_thp(vma, addr, *pmd, NULL) == MC_TARGET_PAGE)
mc.precharge += HPAGE_PMD_NR;
spin_unlock(ptl);
@@ -4884,6 +4924,14 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
putback_lru_page(page);
}
put_page(page);
+ } else if (target_type == MC_TARGET_DEVICE) {
+ page = target.page;
+ if (!mem_cgroup_move_account(page, true,
+ mc.from, mc.to)) {
+ mc.precharge -= HPAGE_PMD_NR;
+ mc.moved_charge += HPAGE_PMD_NR;
+ }
+ put_page(page);
}
spin_unlock(ptl);
return 0;
@@ -4895,12 +4943,16 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
for (; addr != end; addr += PAGE_SIZE) {
pte_t ptent = *(pte++);
+ bool device = false;
swp_entry_t ent;
if (!mc.precharge)
break;
switch (get_mctgt_type(vma, addr, ptent, &target)) {
+ case MC_TARGET_DEVICE:
+ device = true;
+ /* fall through */
case MC_TARGET_PAGE:
page = target.page;
/*
@@ -4911,7 +4963,7 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
*/
if (PageTransCompound(page))
goto put;
- if (isolate_lru_page(page))
+ if (!device && isolate_lru_page(page))
goto put;
if (!mem_cgroup_move_account(page, false,
mc.from, mc.to)) {
@@ -4919,7 +4971,8 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
/* we uncharge from mc.from later. */
mc.moved_charge++;
}
- putback_lru_page(page);
+ if (!device)
+ putback_lru_page(page);
put: /* get_mctgt_type() gets the page */
put_page(page);
break;
--
2.13.0
Unlike unaddressable memory, coherent device memory has a real
resource associated with it on the system (as CPU can address
it). Add a new helper to hotplug such memory within the HMM
framework.
Changed since v2:
- s/host/public
Changed since v1:
- s/public/host
Signed-off-by: Jérôme Glisse <[email protected]>
Reviewed-by: Balbir Singh <[email protected]>
---
include/linux/hmm.h | 3 ++
mm/hmm.c | 85 +++++++++++++++++++++++++++++++++++++++++++++++++----
2 files changed, 83 insertions(+), 5 deletions(-)
diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index a40288309fd2..e44cb8edb137 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -392,6 +392,9 @@ struct hmm_devmem {
struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
struct device *device,
unsigned long size);
+struct hmm_devmem *hmm_devmem_add_resource(const struct hmm_devmem_ops *ops,
+ struct device *device,
+ struct resource *res);
void hmm_devmem_remove(struct hmm_devmem *devmem);
/*
diff --git a/mm/hmm.c b/mm/hmm.c
index eadf70829c34..28e54e3b4e1d 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -849,7 +849,11 @@ static void hmm_devmem_release(struct device *dev, void *data)
zone = page_zone(page);
mem_hotplug_begin();
- __remove_pages(zone, start_pfn, npages);
+ if (resource->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY)
+ __remove_pages(zone, start_pfn, npages);
+ else
+ arch_remove_memory(start_pfn << PAGE_SHIFT,
+ npages << PAGE_SHIFT);
mem_hotplug_done();
hmm_devmem_radix_release(resource);
@@ -885,7 +889,11 @@ static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
if (is_ram == REGION_INTERSECTS)
return -ENXIO;
- devmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
+ if (devmem->resource->desc == IORES_DESC_DEVICE_PUBLIC_MEMORY)
+ devmem->pagemap.type = MEMORY_DEVICE_PUBLIC;
+ else
+ devmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
+
devmem->pagemap.res = devmem->resource;
devmem->pagemap.page_fault = hmm_devmem_fault;
devmem->pagemap.page_free = hmm_devmem_free;
@@ -924,8 +932,11 @@ static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
nid = numa_mem_id();
mem_hotplug_begin();
- ret = add_pages(nid, align_start >> PAGE_SHIFT,
- align_size >> PAGE_SHIFT, false);
+ if (devmem->pagemap.type == MEMORY_DEVICE_PUBLIC)
+ ret = arch_add_memory(nid, align_start, align_size, false);
+ else
+ ret = add_pages(nid, align_start >> PAGE_SHIFT,
+ align_size >> PAGE_SHIFT, false);
if (ret) {
mem_hotplug_done();
goto error_add_memory;
@@ -1075,6 +1086,67 @@ struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
}
EXPORT_SYMBOL(hmm_devmem_add);
+struct hmm_devmem *hmm_devmem_add_resource(const struct hmm_devmem_ops *ops,
+ struct device *device,
+ struct resource *res)
+{
+ struct hmm_devmem *devmem;
+ int ret;
+
+ if (res->desc != IORES_DESC_DEVICE_PUBLIC_MEMORY)
+ return ERR_PTR(-EINVAL);
+
+ static_branch_enable(&device_private_key);
+
+ devmem = devres_alloc_node(&hmm_devmem_release, sizeof(*devmem),
+ GFP_KERNEL, dev_to_node(device));
+ if (!devmem)
+ return ERR_PTR(-ENOMEM);
+
+ init_completion(&devmem->completion);
+ devmem->pfn_first = -1UL;
+ devmem->pfn_last = -1UL;
+ devmem->resource = res;
+ devmem->device = device;
+ devmem->ops = ops;
+
+ ret = percpu_ref_init(&devmem->ref, &hmm_devmem_ref_release,
+ 0, GFP_KERNEL);
+ if (ret)
+ goto error_percpu_ref;
+
+ ret = devm_add_action(device, hmm_devmem_ref_exit, &devmem->ref);
+ if (ret)
+ goto error_devm_add_action;
+
+
+ devmem->pfn_first = devmem->resource->start >> PAGE_SHIFT;
+ devmem->pfn_last = devmem->pfn_first +
+ (resource_size(devmem->resource) >> PAGE_SHIFT);
+
+ ret = hmm_devmem_pages_create(devmem);
+ if (ret)
+ goto error_devm_add_action;
+
+ devres_add(device, devmem);
+
+ ret = devm_add_action(device, hmm_devmem_ref_kill, &devmem->ref);
+ if (ret) {
+ hmm_devmem_remove(devmem);
+ return ERR_PTR(ret);
+ }
+
+ return devmem;
+
+error_devm_add_action:
+ hmm_devmem_ref_kill(&devmem->ref);
+ hmm_devmem_ref_exit(&devmem->ref);
+error_percpu_ref:
+ devres_free(devmem);
+ return ERR_PTR(ret);
+}
+EXPORT_SYMBOL(hmm_devmem_add_resource);
+
/*
* hmm_devmem_remove() - remove device memory (kill and free ZONE_DEVICE)
*
@@ -1088,6 +1160,7 @@ void hmm_devmem_remove(struct hmm_devmem *devmem)
{
resource_size_t start, size;
struct device *device;
+ bool cdm = false;
if (!devmem)
return;
@@ -1096,11 +1169,13 @@ void hmm_devmem_remove(struct hmm_devmem *devmem)
start = devmem->resource->start;
size = resource_size(devmem->resource);
+ cdm = devmem->resource->desc == IORES_DESC_DEVICE_PUBLIC_MEMORY;
hmm_devmem_ref_kill(&devmem->ref);
hmm_devmem_ref_exit(&devmem->ref);
hmm_devmem_pages_remove(devmem);
- devm_release_mem_region(device, start, size);
+ if (!cdm)
+ devm_release_mem_region(device, start, size);
}
EXPORT_SYMBOL(hmm_devmem_remove);
--
2.13.0
Existing user of ZONE_DEVICE in its DEVICE_PUBLIC variant are not tie
to specific device and behave more like host memory. This patch rename
DEVICE_PUBLIC to DEVICE_HOST and free the name DEVICE_PUBLIC to be use
for cache coherent device memory that has strong tie with the device
on which the memory is (for instance on board GPU memory).
There is no functional change here.
Signed-off-by: Jérôme Glisse <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Ross Zwisler <[email protected]>
---
include/linux/memremap.h | 4 ++--
kernel/memremap.c | 2 +-
2 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 57546a07a558..ae5ff92f72b4 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -41,7 +41,7 @@ static inline struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
* Specialize ZONE_DEVICE memory into multiple types each having differents
* usage.
*
- * MEMORY_DEVICE_PUBLIC:
+ * MEMORY_DEVICE_HOST:
* Persistent device memory (pmem): struct page might be allocated in different
* memory and architecture might want to perform special actions. It is similar
* to regular memory, in that the CPU can access it transparently. However,
@@ -59,7 +59,7 @@ static inline struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
* include/linux/hmm.h and Documentation/vm/hmm.txt.
*/
enum memory_type {
- MEMORY_DEVICE_PUBLIC = 0,
+ MEMORY_DEVICE_HOST = 0,
MEMORY_DEVICE_PRIVATE,
};
diff --git a/kernel/memremap.c b/kernel/memremap.c
index b9baa6c07918..4e07525aa273 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -350,7 +350,7 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
}
pgmap->ref = ref;
pgmap->res = &page_map->res;
- pgmap->type = MEMORY_DEVICE_PUBLIC;
+ pgmap->type = MEMORY_DEVICE_HOST;
pgmap->page_fault = NULL;
pgmap->page_free = NULL;
pgmap->data = NULL;
--
2.13.0
On Thu, 2017-07-13 at 17:15 -0400, Jérôme Glisse wrote:
> Platform with advance system bus (like CAPI or CCIX) allow device
> memory to be accessible from CPU in a cache coherent fashion. Add
> a new type of ZONE_DEVICE to represent such memory. The use case
> are the same as for the un-addressable device memory but without
> all the corners cases.
>
> Changed since v3:
> - s/public/public (going back)
> Changed since v2:
> - s/public/public
> - add proper include in migrate.c and drop useless #if/#endif
> Changed since v1:
> - Kconfig and #if/#else cleanup
>
> Signed-off-by: Jérôme Glisse <[email protected]>
> Cc: Balbir Singh <[email protected]>
> Cc: Aneesh Kumar <[email protected]>
> Cc: Paul E. McKenney <[email protected]>
> Cc: Benjamin Herrenschmidt <[email protected]>
> Cc: Dan Williams <[email protected]>
> Cc: Ross Zwisler <[email protected]>
> ---
Acked-by: Balbir Singh <[email protected]>
On Thu 13-07-17 17:15:32, J?r?me Glisse wrote:
> For now we account device memory exactly like a regular page in
> respect to rss counters and memory cgroup. We do this so that any
> existing application that starts using device memory without knowing
> about it will keep running unimpacted. This also simplify migration
> code.
>
> We will likely revisit this choice once we gain more experience with
> how device memory is use and how it impacts overall memory resource
> management. For now we believe this is a good enough choice.
>
> Note that device memory can not be pin. Nor by device driver, nor
> by GUP thus device memory can always be free and unaccounted when
> a process exit.
I have to look at the implementation but this gives a good idea of what
is going on and why.
> Signed-off-by: J?r?me Glisse <[email protected]>
> Cc: Michal Hocko <[email protected]>
Acked-by: Michal Hocko <[email protected]>
> ---
> Documentation/vm/hmm.txt | 40 ++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 40 insertions(+)
>
> diff --git a/Documentation/vm/hmm.txt b/Documentation/vm/hmm.txt
> index 192dcdb38bd1..4d3aac9f4a5d 100644
> --- a/Documentation/vm/hmm.txt
> +++ b/Documentation/vm/hmm.txt
> @@ -15,6 +15,15 @@ section present the new migration helper that allow to leverage the device DMA
> engine.
>
>
> +1) Problems of using device specific memory allocator:
> +2) System bus, device memory characteristics
> +3) Share address space and migration
> +4) Address space mirroring implementation and API
> +5) Represent and manage device memory from core kernel point of view
> +6) Migrate to and from device memory
> +7) Memory cgroup (memcg) and rss accounting
> +
> +
> -------------------------------------------------------------------------------
>
> 1) Problems of using device specific memory allocator:
> @@ -342,3 +351,34 @@ that happens then the finalize_and_map() can catch any pages that was not
> migrated. Note those page were still copied to new page and thus we wasted
> bandwidth but this is considered as a rare event and a price that we are
> willing to pay to keep all the code simpler.
> +
> +
> +-------------------------------------------------------------------------------
> +
> +7) Memory cgroup (memcg) and rss accounting
> +
> +For now device memory is accounted as any regular page in rss counters (either
> +anonymous if device page is use for anonymous, file if device page is use for
> +file back page or shmem if device page is use for share memory). This is a
> +deliberate choice to keep existing application that might start using device
> +memory without knowing about it to keep runing unimpacted.
> +
> +Drawbacks is that OOM killer might kill an application using a lot of device
> +memory and not a lot of regular system memory and thus not freeing much system
> +memory. We want to gather more real world experience on how application and
> +system react under memory pressure in the presence of device memory before
> +deciding to account device memory differently.
> +
> +
> +Same decision was made for memory cgroup. Device memory page are accounted
> +against same memory cgroup a regular page would be accounted to. This does
> +simplify migration to and from device memory. This also means that migration
> +back from device memory to regular memory can not fail because it would
> +go above memory cgroup limit. We might revisit this choice latter on once we
> +get more experience in how device memory is use and its impact on memory
> +resource control.
> +
> +
> +Note that device memory can never be pin nor by device driver nor through GUP
> +and thus such memory is always free upon process exit. Or when last reference
> +is drop in case of share memory or file back memory.
> --
> 2.13.0
--
Michal Hocko
SUSE Labs
On Thu, 2017-07-13 at 17:15 -0400, Jérôme Glisse wrote:
> Existing user of ZONE_DEVICE in its DEVICE_PUBLIC variant are not tie
> to specific device and behave more like host memory. This patch rename
> DEVICE_PUBLIC to DEVICE_HOST and free the name DEVICE_PUBLIC to be use
> for cache coherent device memory that has strong tie with the device
> on which the memory is (for instance on board GPU memory).
>
> There is no functional change here.
>
> Signed-off-by: Jérôme Glisse <[email protected]>
> Cc: Dan Williams <[email protected]>
> Cc: Ross Zwisler <[email protected]>
> ---
Acked-by: Balbir Singh <[email protected]>
On Thu, 2017-07-13 at 17:15 -0400, Jérôme Glisse wrote:
> HMM pages (private or public device pages) are ZONE_DEVICE page and
> thus you can not use page->lru fields of those pages. This patch
> re-arrange the uncharge to allow single page to be uncharge without
> modifying the lru field of the struct page.
>
> There is no change to memcontrol logic, it is the same as it was
> before this patch.
>
> Signed-off-by: Jérôme Glisse <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Vladimir Davydov <[email protected]>
> Cc: [email protected]
> ---
Acked-by: Balbir Singh <[email protected]>
On Thu, 2017-07-13 at 17:15 -0400, Jérôme Glisse wrote:
> HMM pages (private or public device pages) are ZONE_DEVICE page and
> thus need special handling when it comes to lru or refcount. This
> patch make sure that memcontrol properly handle those when it face
> them. Those pages are use like regular pages in a process address
> space either as anonymous page or as file back page. So from memcg
> point of view we want to handle them like regular page for now at
> least.
>
> Changed since v2:
> - s/host/public
> Changed since v1:
> - s/public/host
> - add comments explaining how device memory behave and why
>
> Signed-off-by: Jérôme Glisse <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Vladimir Davydov <[email protected]>
> Cc: [email protected]
> ---
Acked-by: Balbir Singh <[email protected]>
On 2017/7/14 5:15, Jérôme Glisse wrote:
> Sorry i made horrible mistake on names in v4, i completly miss-
> understood the suggestion. So here i repost with proper naming.
> This is the only change since v3. Again sorry about the noise
> with v4.
>
> Changes since v4:
> - s/DEVICE_HOST/DEVICE_PUBLIC
>
> Git tree:
> https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-cdm-v5
>
>
> Cache coherent device memory apply to architecture with system bus
> like CAPI or CCIX. Device connected to such system bus can expose
> their memory to the system and allow cache coherent access to it
> from the CPU.
>
> Even if for all intent and purposes device memory behave like regular
> memory, we still want to manage it in isolation from regular memory.
> Several reasons for that, first and foremost this memory is less
> reliable than regular memory if the device hangs because of invalid
> commands we can loose access to device memory. Second CPU access to
> this memory is expected to be slower than to regular memory. Third
> having random memory into device means that some of the bus bandwith
> wouldn't be available to the device but would be use by CPU access.
>
> This is why we want to manage such memory in isolation from regular
> memory. Kernel should not try to use this memory even as last resort
> when running out of memory, at least for now.
>
I think set a very large node distance for "Cache Coherent Device Memory" may be a easier way to address these concerns.
--
Regards,
Bob Liu
> This patchset add a new type of ZONE_DEVICE memory (DEVICE_HOST)
> that is use to represent CDM memory. This patchset build on top of
> the HMM patchset that already introduce a new type of ZONE_DEVICE
> memory for private device memory (see HMM patchset).
>
> The end result is that with this patchset if a device is in use in
> a process you might have private anonymous memory or file back
> page memory using ZONE_DEVICE (DEVICE_HOST). Thus care must be
> taken to not overwritte lru fields of such pages.
>
> Hence all core mm changes are done to address assumption that any
> process memory is back by a regular struct page that is part of
> the lru. ZONE_DEVICE page are not on the lru and the lru pointer
> of struct page are use to store device specific informations.
>
> Thus this patchset update all code path that would make assumptions
> about lruness of a process page.
>
> patch 01 - rename DEVICE_PUBLIC to DEVICE_HOST to free DEVICE_PUBLIC name
> patch 02 - add DEVICE_PUBLIC type to ZONE_DEVICE (all core mm changes)
> patch 03 - add an helper to HMM for hotplug of CDM memory
> patch 04 - preparatory patch for memory controller changes (memch)
> patch 05 - update memory controller to properly handle
> ZONE_DEVICE pages when uncharging
> patch 06 - documentation patch
>
> Previous posting:
> v1 https://lkml.org/lkml/2017/4/7/638
> v2 https://lwn.net/Articles/725412/
> v3 https://lwn.net/Articles/727114/
> v4 https://lwn.net/Articles/727692/
>
> Jérôme Glisse (6):
> mm/zone-device: rename DEVICE_PUBLIC to DEVICE_HOST
> mm/device-public-memory: device memory cache coherent with CPU v4
> mm/hmm: add new helper to hotplug CDM memory region v3
> mm/memcontrol: allow to uncharge page without using page->lru field
> mm/memcontrol: support MEMORY_DEVICE_PRIVATE and MEMORY_DEVICE_PUBLIC
> v3
> mm/hmm: documents how device memory is accounted in rss and memcg
>
> Documentation/vm/hmm.txt | 40 ++++++++
> fs/proc/task_mmu.c | 2 +-
> include/linux/hmm.h | 7 +-
> include/linux/ioport.h | 1 +
> include/linux/memremap.h | 25 ++++-
> include/linux/mm.h | 20 ++--
> kernel/memremap.c | 19 ++--
> mm/Kconfig | 11 +++
> mm/gup.c | 7 ++
> mm/hmm.c | 89 ++++++++++++++++--
> mm/madvise.c | 2 +-
> mm/memcontrol.c | 231 ++++++++++++++++++++++++++++++-----------------
> mm/memory.c | 46 +++++++++-
> mm/migrate.c | 57 +++++++-----
> mm/swap.c | 11 +++
> 15 files changed, 434 insertions(+), 134 deletions(-)
>
On Tue, Jul 18, 2017 at 11:26:51AM +0800, Bob Liu wrote:
> On 2017/7/14 5:15, J?r?me Glisse wrote:
> > Sorry i made horrible mistake on names in v4, i completly miss-
> > understood the suggestion. So here i repost with proper naming.
> > This is the only change since v3. Again sorry about the noise
> > with v4.
> >
> > Changes since v4:
> > - s/DEVICE_HOST/DEVICE_PUBLIC
> >
> > Git tree:
> > https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-cdm-v5
> >
> >
> > Cache coherent device memory apply to architecture with system bus
> > like CAPI or CCIX. Device connected to such system bus can expose
> > their memory to the system and allow cache coherent access to it
> > from the CPU.
> >
> > Even if for all intent and purposes device memory behave like regular
> > memory, we still want to manage it in isolation from regular memory.
> > Several reasons for that, first and foremost this memory is less
> > reliable than regular memory if the device hangs because of invalid
> > commands we can loose access to device memory. Second CPU access to
> > this memory is expected to be slower than to regular memory. Third
> > having random memory into device means that some of the bus bandwith
> > wouldn't be available to the device but would be use by CPU access.
> >
> > This is why we want to manage such memory in isolation from regular
> > memory. Kernel should not try to use this memory even as last resort
> > when running out of memory, at least for now.
> >
>
> I think set a very large node distance for "Cache Coherent Device Memory"
> may be a easier way to address these concerns.
Such approach was discuss at length in the past see links below. Outcome
of discussion:
- CPU less node are bad
- device memory can be unreliable (device hang) no way for application
to understand that
- application and driver NUMA madvise/mbind/mempolicy ... can conflict
with each other and no way the kernel can figure out which should
apply
- NUMA as it is now would not work as we need further isolation that
what a large node distance would provide
Probably few others argument i forget.
https://lists.gt.net/linux/kernel/2551369
https://groups.google.com/forum/#!topic/linux.kernel/Za_e8C3XnRs%5B1-25%5D
https://lwn.net/Articles/720380/
Cheers,
J?r?me
On 2017/7/18 23:38, Jerome Glisse wrote:
> On Tue, Jul 18, 2017 at 11:26:51AM +0800, Bob Liu wrote:
>> On 2017/7/14 5:15, J?r?me Glisse wrote:
>>> Sorry i made horrible mistake on names in v4, i completly miss-
>>> understood the suggestion. So here i repost with proper naming.
>>> This is the only change since v3. Again sorry about the noise
>>> with v4.
>>>
>>> Changes since v4:
>>> - s/DEVICE_HOST/DEVICE_PUBLIC
>>>
>>> Git tree:
>>> https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-cdm-v5
>>>
>>>
>>> Cache coherent device memory apply to architecture with system bus
>>> like CAPI or CCIX. Device connected to such system bus can expose
>>> their memory to the system and allow cache coherent access to it
>>> from the CPU.
>>>
>>> Even if for all intent and purposes device memory behave like regular
>>> memory, we still want to manage it in isolation from regular memory.
>>> Several reasons for that, first and foremost this memory is less
>>> reliable than regular memory if the device hangs because of invalid
>>> commands we can loose access to device memory. Second CPU access to
>>> this memory is expected to be slower than to regular memory. Third
>>> having random memory into device means that some of the bus bandwith
>>> wouldn't be available to the device but would be use by CPU access.
>>>
>>> This is why we want to manage such memory in isolation from regular
>>> memory. Kernel should not try to use this memory even as last resort
>>> when running out of memory, at least for now.
>>>
>>
>> I think set a very large node distance for "Cache Coherent Device Memory"
>> may be a easier way to address these concerns.
>
> Such approach was discuss at length in the past see links below. Outcome
> of discussion:
> - CPU less node are bad
> - device memory can be unreliable (device hang) no way for application
> to understand that
Device memory can also be more reliable if using high quality and expensive memory.
> - application and driver NUMA madvise/mbind/mempolicy ... can conflict
> with each other and no way the kernel can figure out which should
> apply
> - NUMA as it is now would not work as we need further isolation that
> what a large node distance would provide
>
Agree, that's where we need spend time on.
One drawback of HMM-CDM I'm worry about is one more extra copy.
In the cache coherent case, CPU can write data to device memory directly then start fpga/GPU/other accelerators.
Thanks,
Bob Liu
On Wed, Jul 19, 2017 at 09:46:10AM +0800, Bob Liu wrote:
> On 2017/7/18 23:38, Jerome Glisse wrote:
> > On Tue, Jul 18, 2017 at 11:26:51AM +0800, Bob Liu wrote:
> >> On 2017/7/14 5:15, J?r?me Glisse wrote:
> >>> Sorry i made horrible mistake on names in v4, i completly miss-
> >>> understood the suggestion. So here i repost with proper naming.
> >>> This is the only change since v3. Again sorry about the noise
> >>> with v4.
> >>>
> >>> Changes since v4:
> >>> - s/DEVICE_HOST/DEVICE_PUBLIC
> >>>
> >>> Git tree:
> >>> https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-cdm-v5
> >>>
> >>>
> >>> Cache coherent device memory apply to architecture with system bus
> >>> like CAPI or CCIX. Device connected to such system bus can expose
> >>> their memory to the system and allow cache coherent access to it
> >>> from the CPU.
> >>>
> >>> Even if for all intent and purposes device memory behave like regular
> >>> memory, we still want to manage it in isolation from regular memory.
> >>> Several reasons for that, first and foremost this memory is less
> >>> reliable than regular memory if the device hangs because of invalid
> >>> commands we can loose access to device memory. Second CPU access to
> >>> this memory is expected to be slower than to regular memory. Third
> >>> having random memory into device means that some of the bus bandwith
> >>> wouldn't be available to the device but would be use by CPU access.
> >>>
> >>> This is why we want to manage such memory in isolation from regular
> >>> memory. Kernel should not try to use this memory even as last resort
> >>> when running out of memory, at least for now.
> >>>
> >>
> >> I think set a very large node distance for "Cache Coherent Device Memory"
> >> may be a easier way to address these concerns.
> >
> > Such approach was discuss at length in the past see links below. Outcome
> > of discussion:
> > - CPU less node are bad
> > - device memory can be unreliable (device hang) no way for application
> > to understand that
>
> Device memory can also be more reliable if using high quality and expensive memory.
Even ECC memory does not compensate for device hang. When your GPU lockups
you might need to re-init GPU from scratch after which the content of the
device memory is unreliable. During init the device memory might not get
proper clock or proper refresh cycle and thus is susceptible to corruption.
>
> > - application and driver NUMA madvise/mbind/mempolicy ... can conflict
> > with each other and no way the kernel can figure out which should
> > apply
> > - NUMA as it is now would not work as we need further isolation that
> > what a large node distance would provide
> >
>
> Agree, that's where we need spend time on.
>
> One drawback of HMM-CDM I'm worry about is one more extra copy.
> In the cache coherent case, CPU can write data to device memory
> directly then start fpga/GPU/other accelerators.
There is not necessarily an extra copy. Device driver can pre-allocate
virtual address range of a process with device memory. Device page fault
can directly allocate device memory. Once allocated CPU access will use
the device memory.
There is plan to allow other allocation (CPU page fault, file cache, ...)
to also use device memory directly. We just don't know what kind of
userspace API will fit best for that so at first it might be hidden behind
device driver specific ioctl.
J?r?me
On 2017/7/19 10:25, Jerome Glisse wrote:
> On Wed, Jul 19, 2017 at 09:46:10AM +0800, Bob Liu wrote:
>> On 2017/7/18 23:38, Jerome Glisse wrote:
>>> On Tue, Jul 18, 2017 at 11:26:51AM +0800, Bob Liu wrote:
>>>> On 2017/7/14 5:15, J?r?me Glisse wrote:
>>>>> Sorry i made horrible mistake on names in v4, i completly miss-
>>>>> understood the suggestion. So here i repost with proper naming.
>>>>> This is the only change since v3. Again sorry about the noise
>>>>> with v4.
>>>>>
>>>>> Changes since v4:
>>>>> - s/DEVICE_HOST/DEVICE_PUBLIC
>>>>>
>>>>> Git tree:
>>>>> https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-cdm-v5
>>>>>
>>>>>
>>>>> Cache coherent device memory apply to architecture with system bus
>>>>> like CAPI or CCIX. Device connected to such system bus can expose
>>>>> their memory to the system and allow cache coherent access to it
>>>>> from the CPU.
>>>>>
>>>>> Even if for all intent and purposes device memory behave like regular
>>>>> memory, we still want to manage it in isolation from regular memory.
>>>>> Several reasons for that, first and foremost this memory is less
>>>>> reliable than regular memory if the device hangs because of invalid
>>>>> commands we can loose access to device memory. Second CPU access to
>>>>> this memory is expected to be slower than to regular memory. Third
>>>>> having random memory into device means that some of the bus bandwith
>>>>> wouldn't be available to the device but would be use by CPU access.
>>>>>
>>>>> This is why we want to manage such memory in isolation from regular
>>>>> memory. Kernel should not try to use this memory even as last resort
>>>>> when running out of memory, at least for now.
>>>>>
>>>>
>>>> I think set a very large node distance for "Cache Coherent Device Memory"
>>>> may be a easier way to address these concerns.
>>>
>>> Such approach was discuss at length in the past see links below. Outcome
>>> of discussion:
>>> - CPU less node are bad
>>> - device memory can be unreliable (device hang) no way for application
>>> to understand that
>>
>> Device memory can also be more reliable if using high quality and expensive memory.
>
> Even ECC memory does not compensate for device hang. When your GPU lockups
> you might need to re-init GPU from scratch after which the content of the
> device memory is unreliable. During init the device memory might not get
> proper clock or proper refresh cycle and thus is susceptible to corruption.
>
>>
>>> - application and driver NUMA madvise/mbind/mempolicy ... can conflict
>>> with each other and no way the kernel can figure out which should
>>> apply
>>> - NUMA as it is now would not work as we need further isolation that
>>> what a large node distance would provide
>>>
>>
>> Agree, that's where we need spend time on.
>>
>> One drawback of HMM-CDM I'm worry about is one more extra copy.
>> In the cache coherent case, CPU can write data to device memory
>> directly then start fpga/GPU/other accelerators.
>
> There is not necessarily an extra copy. Device driver can pre-allocate
> virtual address range of a process with device memory. Device page fault
Okay, I get your point.
But the typical use case is CPU allocate a memory and prepare/write data then launch GPU "cuda kernel".
How to control the allocation go to device memory e.g HBM or system DDR at the beginning without user explicit advise?
If goes to DDR by default, there is an extra copy. If goes to HBM by default, the HBM may be waste.
> can directly allocate device memory. Once allocated CPU access will use
> the device memory.
>
Then it's more like replace the numa node solution(CDM) with ZONE_DEVICE(type MEMORY_DEVICE_PUBLIC).
But the problem is the same, e.g how to make sure the device memory say HBM won't be occupied by normal CPU allocation.
Things will be more complex if there are multi GPU connected by nvlink(also cache coherent) in a system, each GPU has their own HBM.
How to decide allocate physical memory from local HBM/DDR or remote HBM/DDR?
If using numa(CDM) approach there are NUMA mempolicy and autonuma mechanism at least.
Thanks,
Bob
On Wed, Jul 19, 2017 at 05:09:04PM +0800, Bob Liu wrote:
> On 2017/7/19 10:25, Jerome Glisse wrote:
> > On Wed, Jul 19, 2017 at 09:46:10AM +0800, Bob Liu wrote:
> >> On 2017/7/18 23:38, Jerome Glisse wrote:
> >>> On Tue, Jul 18, 2017 at 11:26:51AM +0800, Bob Liu wrote:
> >>>> On 2017/7/14 5:15, J?r?me Glisse wrote:
> >>>>> Sorry i made horrible mistake on names in v4, i completly miss-
> >>>>> understood the suggestion. So here i repost with proper naming.
> >>>>> This is the only change since v3. Again sorry about the noise
> >>>>> with v4.
> >>>>>
> >>>>> Changes since v4:
> >>>>> - s/DEVICE_HOST/DEVICE_PUBLIC
> >>>>>
> >>>>> Git tree:
> >>>>> https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-cdm-v5
> >>>>>
> >>>>>
> >>>>> Cache coherent device memory apply to architecture with system bus
> >>>>> like CAPI or CCIX. Device connected to such system bus can expose
> >>>>> their memory to the system and allow cache coherent access to it
> >>>>> from the CPU.
> >>>>>
> >>>>> Even if for all intent and purposes device memory behave like regular
> >>>>> memory, we still want to manage it in isolation from regular memory.
> >>>>> Several reasons for that, first and foremost this memory is less
> >>>>> reliable than regular memory if the device hangs because of invalid
> >>>>> commands we can loose access to device memory. Second CPU access to
> >>>>> this memory is expected to be slower than to regular memory. Third
> >>>>> having random memory into device means that some of the bus bandwith
> >>>>> wouldn't be available to the device but would be use by CPU access.
> >>>>>
> >>>>> This is why we want to manage such memory in isolation from regular
> >>>>> memory. Kernel should not try to use this memory even as last resort
> >>>>> when running out of memory, at least for now.
> >>>>>
> >>>>
> >>>> I think set a very large node distance for "Cache Coherent Device Memory"
> >>>> may be a easier way to address these concerns.
> >>>
> >>> Such approach was discuss at length in the past see links below. Outcome
> >>> of discussion:
> >>> - CPU less node are bad
> >>> - device memory can be unreliable (device hang) no way for application
> >>> to understand that
> >>
> >> Device memory can also be more reliable if using high quality and expensive memory.
> >
> > Even ECC memory does not compensate for device hang. When your GPU lockups
> > you might need to re-init GPU from scratch after which the content of the
> > device memory is unreliable. During init the device memory might not get
> > proper clock or proper refresh cycle and thus is susceptible to corruption.
> >
> >>
> >>> - application and driver NUMA madvise/mbind/mempolicy ... can conflict
> >>> with each other and no way the kernel can figure out which should
> >>> apply
> >>> - NUMA as it is now would not work as we need further isolation that
> >>> what a large node distance would provide
> >>>
> >>
> >> Agree, that's where we need spend time on.
> >>
> >> One drawback of HMM-CDM I'm worry about is one more extra copy.
> >> In the cache coherent case, CPU can write data to device memory
> >> directly then start fpga/GPU/other accelerators.
> >
> > There is not necessarily an extra copy. Device driver can pre-allocate
> > virtual address range of a process with device memory. Device page fault
>
> Okay, I get your point. But the typical use case is CPU allocate a memory
> and prepare/write data then launch GPU "cuda kernel".
I don't think we should make to many assumption on what is typical case.
GPU compute is fast evolving and they are new domains where it is apply
for instance some folks use it to process network stream and the network
adapter directly write into GPU memory so there is never a CPU copy of
it. So i rather not make any restrictive assumption on how it will be use.
> How to control the allocation go to device memory e.g HBM or system
> DDR at the beginning without user explicit advise? If goes to DDR by
> default, there is an extra copy. If goes to HBM by default, the HBM
> may be waste.
Yes it is a hard problem to solve. We are working with NVidia and IBM
on this and there are several path. But as first solution we will rely
on hint/directive given by userspace program through existing GPGPU API
like CUDA or OpenCL. They are plan to have hardware monitor bus traffic
to gather statistics and do automatic memory placement from thos.
> > can directly allocate device memory. Once allocated CPU access will use
> > the device memory.
> >
>
> Then it's more like replace the numa node solution(CDM) with ZONE_DEVICE
> (type MEMORY_DEVICE_PUBLIC). But the problem is the same, e.g how to make
> sure the device memory say HBM won't be occupied by normal CPU allocation.
> Things will be more complex if there are multi GPU connected by nvlink
> (also cache coherent) in a system, each GPU has their own HBM.
>
> How to decide allocate physical memory from local HBM/DDR or remote HBM/
> DDR?
>
> If using numa(CDM) approach there are NUMA mempolicy and autonuma mechanism
> at least.
NUMA is not as easy as you think. First like i said we want the device
memory to be isolated from most existing mm mechanism. Because memory
is unreliable and also because device might need to be able to evict
memory to make contiguous physical memory allocation for graphics.
Second device driver are not integrated that closely within mm and the
scheduler kernel code to allow to efficiently plug in device access
notification to page (ie to update struct page so that numa worker
thread can migrate memory base on accurate informations).
Third it can be hard to decide who win between CPU and device access
when it comes to updating thing like last CPU id.
Fourth there is no such thing like device id ie equivalent of CPU id.
If we were to add something the CPU id field in flags of struct page
would not be big enough so this can have repercusion on struct page
size. This is not an easy sell.
They are other issues i can't think of right now. I think for now it
is easier and better to take the HMM-CDM approach and latter down the
road once we have more existing user to start thinking about numa or
numa like solution.
Bottom line is we spend time thinking about this and yes numa make
sense from conceptual point of view but they are many things we do
not know to feel confident that we can make something good with numa
as it is.
Cheers,
J?r?me
On 2017/7/20 23:03, Jerome Glisse wrote:
> On Wed, Jul 19, 2017 at 05:09:04PM +0800, Bob Liu wrote:
>> On 2017/7/19 10:25, Jerome Glisse wrote:
>>> On Wed, Jul 19, 2017 at 09:46:10AM +0800, Bob Liu wrote:
>>>> On 2017/7/18 23:38, Jerome Glisse wrote:
>>>>> On Tue, Jul 18, 2017 at 11:26:51AM +0800, Bob Liu wrote:
>>>>>> On 2017/7/14 5:15, J?r?me Glisse wrote:
>>>>>>> Sorry i made horrible mistake on names in v4, i completly miss-
>>>>>>> understood the suggestion. So here i repost with proper naming.
>>>>>>> This is the only change since v3. Again sorry about the noise
>>>>>>> with v4.
>>>>>>>
>>>>>>> Changes since v4:
>>>>>>> - s/DEVICE_HOST/DEVICE_PUBLIC
>>>>>>>
>>>>>>> Git tree:
>>>>>>> https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-cdm-v5
>>>>>>>
>>>>>>>
>>>>>>> Cache coherent device memory apply to architecture with system bus
>>>>>>> like CAPI or CCIX. Device connected to such system bus can expose
>>>>>>> their memory to the system and allow cache coherent access to it
>>>>>>> from the CPU.
>>>>>>>
>>>>>>> Even if for all intent and purposes device memory behave like regular
>>>>>>> memory, we still want to manage it in isolation from regular memory.
>>>>>>> Several reasons for that, first and foremost this memory is less
>>>>>>> reliable than regular memory if the device hangs because of invalid
>>>>>>> commands we can loose access to device memory. Second CPU access to
>>>>>>> this memory is expected to be slower than to regular memory. Third
>>>>>>> having random memory into device means that some of the bus bandwith
>>>>>>> wouldn't be available to the device but would be use by CPU access.
>>>>>>>
>>>>>>> This is why we want to manage such memory in isolation from regular
>>>>>>> memory. Kernel should not try to use this memory even as last resort
>>>>>>> when running out of memory, at least for now.
>>>>>>>
>>>>>>
>>>>>> I think set a very large node distance for "Cache Coherent Device Memory"
>>>>>> may be a easier way to address these concerns.
>>>>>
>>>>> Such approach was discuss at length in the past see links below. Outcome
>>>>> of discussion:
>>>>> - CPU less node are bad
>>>>> - device memory can be unreliable (device hang) no way for application
>>>>> to understand that
>>>>
>>>> Device memory can also be more reliable if using high quality and expensive memory.
>>>
>>> Even ECC memory does not compensate for device hang. When your GPU lockups
>>> you might need to re-init GPU from scratch after which the content of the
>>> device memory is unreliable. During init the device memory might not get
>>> proper clock or proper refresh cycle and thus is susceptible to corruption.
>>>
>>>>
>>>>> - application and driver NUMA madvise/mbind/mempolicy ... can conflict
>>>>> with each other and no way the kernel can figure out which should
>>>>> apply
>>>>> - NUMA as it is now would not work as we need further isolation that
>>>>> what a large node distance would provide
>>>>>
>>>>
>>>> Agree, that's where we need spend time on.
>>>>
>>>> One drawback of HMM-CDM I'm worry about is one more extra copy.
>>>> In the cache coherent case, CPU can write data to device memory
>>>> directly then start fpga/GPU/other accelerators.
>>>
>>> There is not necessarily an extra copy. Device driver can pre-allocate
>>> virtual address range of a process with device memory. Device page fault
>>
>> Okay, I get your point. But the typical use case is CPU allocate a memory
>> and prepare/write data then launch GPU "cuda kernel".
>
> I don't think we should make to many assumption on what is typical case.
> GPU compute is fast evolving and they are new domains where it is apply
> for instance some folks use it to process network stream and the network
> adapter directly write into GPU memory so there is never a CPU copy of
> it. So i rather not make any restrictive assumption on how it will be use.
>
>> How to control the allocation go to device memory e.g HBM or system
>> DDR at the beginning without user explicit advise? If goes to DDR by
>> default, there is an extra copy. If goes to HBM by default, the HBM
>> may be waste.
>
> Yes it is a hard problem to solve. We are working with NVidia and IBM
> on this and there are several path. But as first solution we will rely
> on hint/directive given by userspace program through existing GPGPU API
> like CUDA or OpenCL. They are plan to have hardware monitor bus traffic
> to gather statistics and do automatic memory placement from thos.
>
>
>>> can directly allocate device memory. Once allocated CPU access will use
>>> the device memory.
>>>
>>
>> Then it's more like replace the numa node solution(CDM) with ZONE_DEVICE
>> (type MEMORY_DEVICE_PUBLIC). But the problem is the same, e.g how to make
>> sure the device memory say HBM won't be occupied by normal CPU allocation.
>> Things will be more complex if there are multi GPU connected by nvlink
>> (also cache coherent) in a system, each GPU has their own HBM.
>>
>> How to decide allocate physical memory from local HBM/DDR or remote HBM/
>> DDR?
>>
>> If using numa(CDM) approach there are NUMA mempolicy and autonuma mechanism
>> at least.
>
> NUMA is not as easy as you think. First like i said we want the device
> memory to be isolated from most existing mm mechanism. Because memory
> is unreliable and also because device might need to be able to evict
> memory to make contiguous physical memory allocation for graphics.
>
Right, but we need isolation any way.
For hmm-cdm, the isolation is not adding device memory to lru list, and many
if (is_device_public_page(page)) ...
But how to evict device memory?
> Second device driver are not integrated that closely within mm and the
> scheduler kernel code to allow to efficiently plug in device access
> notification to page (ie to update struct page so that numa worker
> thread can migrate memory base on accurate informations).
>
> Third it can be hard to decide who win between CPU and device access
> when it comes to updating thing like last CPU id.
>
> Fourth there is no such thing like device id ie equivalent of CPU id.
> If we were to add something the CPU id field in flags of struct page
> would not be big enough so this can have repercusion on struct page
> size. This is not an easy sell.
>
> They are other issues i can't think of right now. I think for now it
My opinion is most of the issues are the same no matter use CDM or HMM-CDM.
I just care about a more complete solution no matter CDM,HMM-CDM or other ways.
HMM or HMM-CDM depends on device driver, but haven't see a public/full driver to
demonstrate the whole solution works fine.
Cheers,
Bob
> is easier and better to take the HMM-CDM approach and latter down the
> road once we have more existing user to start thinking about numa or
> numa like solution.
>
> Bottom line is we spend time thinking about this and yes numa make
> sense from conceptual point of view but they are many things we do
> not know to feel confident that we can make something good with numa
> as it is.
On Fri, Jul 21, 2017 at 09:15:29AM +0800, Bob Liu wrote:
> On 2017/7/20 23:03, Jerome Glisse wrote:
> > On Wed, Jul 19, 2017 at 05:09:04PM +0800, Bob Liu wrote:
> >> On 2017/7/19 10:25, Jerome Glisse wrote:
> >>> On Wed, Jul 19, 2017 at 09:46:10AM +0800, Bob Liu wrote:
> >>>> On 2017/7/18 23:38, Jerome Glisse wrote:
> >>>>> On Tue, Jul 18, 2017 at 11:26:51AM +0800, Bob Liu wrote:
> >>>>>> On 2017/7/14 5:15, J?r?me Glisse wrote:
[...]
> >> Then it's more like replace the numa node solution(CDM) with ZONE_DEVICE
> >> (type MEMORY_DEVICE_PUBLIC). But the problem is the same, e.g how to make
> >> sure the device memory say HBM won't be occupied by normal CPU allocation.
> >> Things will be more complex if there are multi GPU connected by nvlink
> >> (also cache coherent) in a system, each GPU has their own HBM.
> >>
> >> How to decide allocate physical memory from local HBM/DDR or remote HBM/
> >> DDR?
> >>
> >> If using numa(CDM) approach there are NUMA mempolicy and autonuma mechanism
> >> at least.
> >
> > NUMA is not as easy as you think. First like i said we want the device
> > memory to be isolated from most existing mm mechanism. Because memory
> > is unreliable and also because device might need to be able to evict
> > memory to make contiguous physical memory allocation for graphics.
> >
>
> Right, but we need isolation any way.
> For hmm-cdm, the isolation is not adding device memory to lru list, and many
> if (is_device_public_page(page)) ...
>
> But how to evict device memory?
What you mean by evict ? Device driver can evict whenever they see the need
to do so. CPU page fault will evict too. Process exit or munmap() will free
the device memory.
Are you refering to evict in the sense of memory reclaim under pressure ?
So the way it flows for memory pressure is that if device driver want to
make room it can evict stuff to system memory and if there is not enough
system memory than thing get reclaim as usual before device driver can
make progress on device memory reclaim.
> > Second device driver are not integrated that closely within mm and the
> > scheduler kernel code to allow to efficiently plug in device access
> > notification to page (ie to update struct page so that numa worker
> > thread can migrate memory base on accurate informations).
> >
> > Third it can be hard to decide who win between CPU and device access
> > when it comes to updating thing like last CPU id.
> >
> > Fourth there is no such thing like device id ie equivalent of CPU id.
> > If we were to add something the CPU id field in flags of struct page
> > would not be big enough so this can have repercusion on struct page
> > size. This is not an easy sell.
> >
> > They are other issues i can't think of right now. I think for now it
>
> My opinion is most of the issues are the same no matter use CDM or HMM-CDM.
> I just care about a more complete solution no matter CDM,HMM-CDM or other ways.
> HMM or HMM-CDM depends on device driver, but haven't see a public/full driver to
> demonstrate the whole solution works fine.
I am working with NVidia close source driver team to make sure that it works
well for them. I am also working on nouveau open source driver for same NVidia
hardware thought it will be of less use as what is missing there is a solid
open source userspace to leverage this. Nonetheless open source driver are in
the work.
The way i see it is start with HMM-CDM which isolate most of the changes in
hmm code. Once we get more experience with real workload and not with device
driver test suite then we can start revisiting NUMA and deeper integration
with the linux kernel. I rather grow organicaly toward that than trying to
design something that would make major changes all over the kernel without
knowing for sure that we are going in the right direction. I hope that this
make sense to others too.
Cheers,
J?r?me
On 2017/7/21 9:41, Jerome Glisse wrote:
> On Fri, Jul 21, 2017 at 09:15:29AM +0800, Bob Liu wrote:
>> On 2017/7/20 23:03, Jerome Glisse wrote:
>>> On Wed, Jul 19, 2017 at 05:09:04PM +0800, Bob Liu wrote:
>>>> On 2017/7/19 10:25, Jerome Glisse wrote:
>>>>> On Wed, Jul 19, 2017 at 09:46:10AM +0800, Bob Liu wrote:
>>>>>> On 2017/7/18 23:38, Jerome Glisse wrote:
>>>>>>> On Tue, Jul 18, 2017 at 11:26:51AM +0800, Bob Liu wrote:
>>>>>>>> On 2017/7/14 5:15, J?r?me Glisse wrote:
>
> [...]
>
>>>> Then it's more like replace the numa node solution(CDM) with ZONE_DEVICE
>>>> (type MEMORY_DEVICE_PUBLIC). But the problem is the same, e.g how to make
>>>> sure the device memory say HBM won't be occupied by normal CPU allocation.
>>>> Things will be more complex if there are multi GPU connected by nvlink
>>>> (also cache coherent) in a system, each GPU has their own HBM.
>>>>
>>>> How to decide allocate physical memory from local HBM/DDR or remote HBM/
>>>> DDR?
>>>>
>>>> If using numa(CDM) approach there are NUMA mempolicy and autonuma mechanism
>>>> at least.
>>>
>>> NUMA is not as easy as you think. First like i said we want the device
>>> memory to be isolated from most existing mm mechanism. Because memory
>>> is unreliable and also because device might need to be able to evict
>>> memory to make contiguous physical memory allocation for graphics.
>>>
>>
>> Right, but we need isolation any way.
>> For hmm-cdm, the isolation is not adding device memory to lru list, and many
>> if (is_device_public_page(page)) ...
>>
>> But how to evict device memory?
>
> What you mean by evict ? Device driver can evict whenever they see the need
> to do so. CPU page fault will evict too. Process exit or munmap() will free
> the device memory.
>
> Are you refering to evict in the sense of memory reclaim under pressure ?
>
> So the way it flows for memory pressure is that if device driver want to
> make room it can evict stuff to system memory and if there is not enough
Yes, I mean this.
So every driver have to maintain their own LRU-similar list instead of reuse what already in linux kernel.
> system memory than thing get reclaim as usual before device driver can
> make progress on device memory reclaim.
>
>
>>> Second device driver are not integrated that closely within mm and the
>>> scheduler kernel code to allow to efficiently plug in device access
>>> notification to page (ie to update struct page so that numa worker
>>> thread can migrate memory base on accurate informations).
>>>
>>> Third it can be hard to decide who win between CPU and device access
>>> when it comes to updating thing like last CPU id.
>>>
>>> Fourth there is no such thing like device id ie equivalent of CPU id.
>>> If we were to add something the CPU id field in flags of struct page
>>> would not be big enough so this can have repercusion on struct page
>>> size. This is not an easy sell.
>>>
>>> They are other issues i can't think of right now. I think for now it
>>
>> My opinion is most of the issues are the same no matter use CDM or HMM-CDM.
>> I just care about a more complete solution no matter CDM,HMM-CDM or other ways.
>> HMM or HMM-CDM depends on device driver, but haven't see a public/full driver to
>> demonstrate the whole solution works fine.
>
> I am working with NVidia close source driver team to make sure that it works
> well for them. I am also working on nouveau open source driver for same NVidia
> hardware thought it will be of less use as what is missing there is a solid
> open source userspace to leverage this. Nonetheless open source driver are in
> the work.
>
Looking forward to see these drivers be public.
> The way i see it is start with HMM-CDM which isolate most of the changes in
> hmm code. Once we get more experience with real workload and not with device
> driver test suite then we can start revisiting NUMA and deeper integration
> with the linux kernel. I rather grow organicaly toward that than trying to
> design something that would make major changes all over the kernel without
> knowing for sure that we are going in the right direction. I hope that this
> make sense to others too.
>
Make sense.
Thanks,
Bob Liu
On Thu, Jul 20, 2017 at 6:41 PM, Jerome Glisse <[email protected]> wrote:
> On Fri, Jul 21, 2017 at 09:15:29AM +0800, Bob Liu wrote:
>> On 2017/7/20 23:03, Jerome Glisse wrote:
>> > On Wed, Jul 19, 2017 at 05:09:04PM +0800, Bob Liu wrote:
>> >> On 2017/7/19 10:25, Jerome Glisse wrote:
>> >>> On Wed, Jul 19, 2017 at 09:46:10AM +0800, Bob Liu wrote:
>> >>>> On 2017/7/18 23:38, Jerome Glisse wrote:
>> >>>>> On Tue, Jul 18, 2017 at 11:26:51AM +0800, Bob Liu wrote:
>> >>>>>> On 2017/7/14 5:15, Jérôme Glisse wrote:
>
> [...]
>
>> >> Then it's more like replace the numa node solution(CDM) with ZONE_DEVICE
>> >> (type MEMORY_DEVICE_PUBLIC). But the problem is the same, e.g how to make
>> >> sure the device memory say HBM won't be occupied by normal CPU allocation.
>> >> Things will be more complex if there are multi GPU connected by nvlink
>> >> (also cache coherent) in a system, each GPU has their own HBM.
>> >>
>> >> How to decide allocate physical memory from local HBM/DDR or remote HBM/
>> >> DDR?
>> >>
>> >> If using numa(CDM) approach there are NUMA mempolicy and autonuma mechanism
>> >> at least.
>> >
>> > NUMA is not as easy as you think. First like i said we want the device
>> > memory to be isolated from most existing mm mechanism. Because memory
>> > is unreliable and also because device might need to be able to evict
>> > memory to make contiguous physical memory allocation for graphics.
>> >
>>
>> Right, but we need isolation any way.
>> For hmm-cdm, the isolation is not adding device memory to lru list, and many
>> if (is_device_public_page(page)) ...
>>
>> But how to evict device memory?
>
> What you mean by evict ? Device driver can evict whenever they see the need
> to do so. CPU page fault will evict too. Process exit or munmap() will free
> the device memory.
>
> Are you refering to evict in the sense of memory reclaim under pressure ?
>
> So the way it flows for memory pressure is that if device driver want to
> make room it can evict stuff to system memory and if there is not enough
> system memory than thing get reclaim as usual before device driver can
> make progress on device memory reclaim.
>
>
>> > Second device driver are not integrated that closely within mm and the
>> > scheduler kernel code to allow to efficiently plug in device access
>> > notification to page (ie to update struct page so that numa worker
>> > thread can migrate memory base on accurate informations).
>> >
>> > Third it can be hard to decide who win between CPU and device access
>> > when it comes to updating thing like last CPU id.
>> >
>> > Fourth there is no such thing like device id ie equivalent of CPU id.
>> > If we were to add something the CPU id field in flags of struct page
>> > would not be big enough so this can have repercusion on struct page
>> > size. This is not an easy sell.
>> >
>> > They are other issues i can't think of right now. I think for now it
>>
>> My opinion is most of the issues are the same no matter use CDM or HMM-CDM.
>> I just care about a more complete solution no matter CDM,HMM-CDM or other ways.
>> HMM or HMM-CDM depends on device driver, but haven't see a public/full driver to
>> demonstrate the whole solution works fine.
>
> I am working with NVidia close source driver team to make sure that it works
> well for them. I am also working on nouveau open source driver for same NVidia
> hardware thought it will be of less use as what is missing there is a solid
> open source userspace to leverage this. Nonetheless open source driver are in
> the work.
Can you point to the nouveau patches? I still find these HMM patches
un-reviewable without an upstream consumer.
On Fri, Jul 21, 2017 at 10:10 AM, Bob Liu <[email protected]> wrote:
> On 2017/7/21 9:41, Jerome Glisse wrote:
>> On Fri, Jul 21, 2017 at 09:15:29AM +0800, Bob Liu wrote:
>>> On 2017/7/20 23:03, Jerome Glisse wrote:
>>>> On Wed, Jul 19, 2017 at 05:09:04PM +0800, Bob Liu wrote:
>>>>> On 2017/7/19 10:25, Jerome Glisse wrote:
>>>>>> On Wed, Jul 19, 2017 at 09:46:10AM +0800, Bob Liu wrote:
>>>>>>> On 2017/7/18 23:38, Jerome Glisse wrote:
>>>>>>>> On Tue, Jul 18, 2017 at 11:26:51AM +0800, Bob Liu wrote:
>>>>>>>>> On 2017/7/14 5:15, Jérôme Glisse wrote:
>>
>> [...]
>>
>>>>> Then it's more like replace the numa node solution(CDM) with ZONE_DEVICE
>>>>> (type MEMORY_DEVICE_PUBLIC). But the problem is the same, e.g how to make
>>>>> sure the device memory say HBM won't be occupied by normal CPU allocation.
>>>>> Things will be more complex if there are multi GPU connected by nvlink
>>>>> (also cache coherent) in a system, each GPU has their own HBM.
>>>>>
>>>>> How to decide allocate physical memory from local HBM/DDR or remote HBM/
>>>>> DDR?
>>>>>
>>>>> If using numa(CDM) approach there are NUMA mempolicy and autonuma mechanism
>>>>> at least.
>>>>
>>>> NUMA is not as easy as you think. First like i said we want the device
>>>> memory to be isolated from most existing mm mechanism. Because memory
>>>> is unreliable and also because device might need to be able to evict
>>>> memory to make contiguous physical memory allocation for graphics.
>>>>
>>>
>>> Right, but we need isolation any way.
>>> For hmm-cdm, the isolation is not adding device memory to lru list, and many
>>> if (is_device_public_page(page)) ...
>>>
>>> But how to evict device memory?
>>
>> What you mean by evict ? Device driver can evict whenever they see the need
>> to do so. CPU page fault will evict too. Process exit or munmap() will free
>> the device memory.
>>
>> Are you refering to evict in the sense of memory reclaim under pressure ?
>>
>> So the way it flows for memory pressure is that if device driver want to
>> make room it can evict stuff to system memory and if there is not enough
>
> Yes, I mean this.
> So every driver have to maintain their own LRU-similar list instead of reuse what already in linux kernel.
>
And how HMM-CDM can handle multiple devices or device with multiple
device memories(may with different properties also)?
This kind of hardware platform would be very common when CCIX is out soon.
Thanks,
Bob Liu
>> system memory than thing get reclaim as usual before device driver can
>> make progress on device memory reclaim.
>>
>>
>>>> Second device driver are not integrated that closely within mm and the
>>>> scheduler kernel code to allow to efficiently plug in device access
>>>> notification to page (ie to update struct page so that numa worker
>>>> thread can migrate memory base on accurate informations).
>>>>
>>>> Third it can be hard to decide who win between CPU and device access
>>>> when it comes to updating thing like last CPU id.
>>>>
>>>> Fourth there is no such thing like device id ie equivalent of CPU id.
>>>> If we were to add something the CPU id field in flags of struct page
>>>> would not be big enough so this can have repercusion on struct page
>>>> size. This is not an easy sell.
>>>>
>>>> They are other issues i can't think of right now. I think for now it
>>>
>>> My opinion is most of the issues are the same no matter use CDM or HMM-CDM.
>>> I just care about a more complete solution no matter CDM,HMM-CDM or other ways.
>>> HMM or HMM-CDM depends on device driver, but haven't see a public/full driver to
>>> demonstrate the whole solution works fine.
>>
>> I am working with NVidia close source driver team to make sure that it works
>> well for them. I am also working on nouveau open source driver for same NVidia
>> hardware thought it will be of less use as what is missing there is a solid
>> open source userspace to leverage this. Nonetheless open source driver are in
>> the work.
>>
>
> Looking forward to see these drivers be public.
>
>> The way i see it is start with HMM-CDM which isolate most of the changes in
>> hmm code. Once we get more experience with real workload and not with device
>> driver test suite then we can start revisiting NUMA and deeper integration
>> with the linux kernel. I rather grow organicaly toward that than trying to
>> design something that would make major changes all over the kernel without
>> knowing for sure that we are going in the right direction. I hope that this
>> make sense to others too.
>>
>
> Make sense.
>
> Thanks,
> Bob Liu
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
--
Regards,
--Bob
On Thu, Jul 20, 2017 at 08:48:20PM -0700, Dan Williams wrote:
> On Thu, Jul 20, 2017 at 6:41 PM, Jerome Glisse <[email protected]> wrote:
> > On Fri, Jul 21, 2017 at 09:15:29AM +0800, Bob Liu wrote:
> >> On 2017/7/20 23:03, Jerome Glisse wrote:
> >> > On Wed, Jul 19, 2017 at 05:09:04PM +0800, Bob Liu wrote:
> >> >> On 2017/7/19 10:25, Jerome Glisse wrote:
> >> >>> On Wed, Jul 19, 2017 at 09:46:10AM +0800, Bob Liu wrote:
> >> >>>> On 2017/7/18 23:38, Jerome Glisse wrote:
> >> >>>>> On Tue, Jul 18, 2017 at 11:26:51AM +0800, Bob Liu wrote:
> >> >>>>>> On 2017/7/14 5:15, J?r?me Glisse wrote:
> >
> > [...]
> >
> >> >> Then it's more like replace the numa node solution(CDM) with ZONE_DEVICE
> >> >> (type MEMORY_DEVICE_PUBLIC). But the problem is the same, e.g how to make
> >> >> sure the device memory say HBM won't be occupied by normal CPU allocation.
> >> >> Things will be more complex if there are multi GPU connected by nvlink
> >> >> (also cache coherent) in a system, each GPU has their own HBM.
> >> >>
> >> >> How to decide allocate physical memory from local HBM/DDR or remote HBM/
> >> >> DDR?
> >> >>
> >> >> If using numa(CDM) approach there are NUMA mempolicy and autonuma mechanism
> >> >> at least.
> >> >
> >> > NUMA is not as easy as you think. First like i said we want the device
> >> > memory to be isolated from most existing mm mechanism. Because memory
> >> > is unreliable and also because device might need to be able to evict
> >> > memory to make contiguous physical memory allocation for graphics.
> >> >
> >>
> >> Right, but we need isolation any way.
> >> For hmm-cdm, the isolation is not adding device memory to lru list, and many
> >> if (is_device_public_page(page)) ...
> >>
> >> But how to evict device memory?
> >
> > What you mean by evict ? Device driver can evict whenever they see the need
> > to do so. CPU page fault will evict too. Process exit or munmap() will free
> > the device memory.
> >
> > Are you refering to evict in the sense of memory reclaim under pressure ?
> >
> > So the way it flows for memory pressure is that if device driver want to
> > make room it can evict stuff to system memory and if there is not enough
> > system memory than thing get reclaim as usual before device driver can
> > make progress on device memory reclaim.
> >
> >
> >> > Second device driver are not integrated that closely within mm and the
> >> > scheduler kernel code to allow to efficiently plug in device access
> >> > notification to page (ie to update struct page so that numa worker
> >> > thread can migrate memory base on accurate informations).
> >> >
> >> > Third it can be hard to decide who win between CPU and device access
> >> > when it comes to updating thing like last CPU id.
> >> >
> >> > Fourth there is no such thing like device id ie equivalent of CPU id.
> >> > If we were to add something the CPU id field in flags of struct page
> >> > would not be big enough so this can have repercusion on struct page
> >> > size. This is not an easy sell.
> >> >
> >> > They are other issues i can't think of right now. I think for now it
> >>
> >> My opinion is most of the issues are the same no matter use CDM or HMM-CDM.
> >> I just care about a more complete solution no matter CDM,HMM-CDM or other ways.
> >> HMM or HMM-CDM depends on device driver, but haven't see a public/full driver to
> >> demonstrate the whole solution works fine.
> >
> > I am working with NVidia close source driver team to make sure that it works
> > well for them. I am also working on nouveau open source driver for same NVidia
> > hardware thought it will be of less use as what is missing there is a solid
> > open source userspace to leverage this. Nonetheless open source driver are in
> > the work.
>
> Can you point to the nouveau patches? I still find these HMM patches
> un-reviewable without an upstream consumer.
I am still working on those, i hope i will be able to post them in 3 weeks or so.
Cheers,
J?r?me
On Fri, Jul 21, 2017 at 08:01:07PM +0800, Bob Liu wrote:
> On Fri, Jul 21, 2017 at 10:10 AM, Bob Liu <[email protected]> wrote:
> > On 2017/7/21 9:41, Jerome Glisse wrote:
> >> On Fri, Jul 21, 2017 at 09:15:29AM +0800, Bob Liu wrote:
> >>> On 2017/7/20 23:03, Jerome Glisse wrote:
> >>>> On Wed, Jul 19, 2017 at 05:09:04PM +0800, Bob Liu wrote:
> >>>>> On 2017/7/19 10:25, Jerome Glisse wrote:
> >>>>>> On Wed, Jul 19, 2017 at 09:46:10AM +0800, Bob Liu wrote:
> >>>>>>> On 2017/7/18 23:38, Jerome Glisse wrote:
> >>>>>>>> On Tue, Jul 18, 2017 at 11:26:51AM +0800, Bob Liu wrote:
> >>>>>>>>> On 2017/7/14 5:15, J?r?me Glisse wrote:
> >>
> >> [...]
> >>
> >>>>> Then it's more like replace the numa node solution(CDM) with ZONE_DEVICE
> >>>>> (type MEMORY_DEVICE_PUBLIC). But the problem is the same, e.g how to make
> >>>>> sure the device memory say HBM won't be occupied by normal CPU allocation.
> >>>>> Things will be more complex if there are multi GPU connected by nvlink
> >>>>> (also cache coherent) in a system, each GPU has their own HBM.
> >>>>>
> >>>>> How to decide allocate physical memory from local HBM/DDR or remote HBM/
> >>>>> DDR?
> >>>>>
> >>>>> If using numa(CDM) approach there are NUMA mempolicy and autonuma mechanism
> >>>>> at least.
> >>>>
> >>>> NUMA is not as easy as you think. First like i said we want the device
> >>>> memory to be isolated from most existing mm mechanism. Because memory
> >>>> is unreliable and also because device might need to be able to evict
> >>>> memory to make contiguous physical memory allocation for graphics.
> >>>>
> >>>
> >>> Right, but we need isolation any way.
> >>> For hmm-cdm, the isolation is not adding device memory to lru list, and many
> >>> if (is_device_public_page(page)) ...
> >>>
> >>> But how to evict device memory?
> >>
> >> What you mean by evict ? Device driver can evict whenever they see the need
> >> to do so. CPU page fault will evict too. Process exit or munmap() will free
> >> the device memory.
> >>
> >> Are you refering to evict in the sense of memory reclaim under pressure ?
> >>
> >> So the way it flows for memory pressure is that if device driver want to
> >> make room it can evict stuff to system memory and if there is not enough
> >
> > Yes, I mean this.
> > So every driver have to maintain their own LRU-similar list instead of
> > reuse what already in linux kernel.
Regarding LRU it is again not as easy. First we do necessarily have access
information like CPU page table for device page table. Second the mmu_notifier
callback on per page basis is costly. Finaly device are use differently than
CPU, usualy you schedule a job and once that job is done you can safely evict
memory it was using. Existing device driver already have quite large memory
management code of their own because of that different usage model.
LRU might make sense at one point but so far i doubt it is the right solution
for device memory.
>
> And how HMM-CDM can handle multiple devices or device with multiple
> device memories(may with different properties also)?
> This kind of hardware platform would be very common when CCIX is out soon.
A) Multiple device is under control of device driver. Multiple devices link
to each other through dedicated link can have themself a complex topology and
remote access between device is highly tie to the device (how to program the
device mmu and device registers) and thus to the device driver.
If we identify common design pattern between different hardware then we might
start thinking about factoring out some common code to help those cases.
B) Multiple different device is an harder problem. Each device provide their
own userspace API and that is through that API that you will get memory
placement advise. If several device fight for placement of same chunk of
memory one can argue that the application is broken or device is broken.
But for now we assume that device and application will behave.
Rate limiting migration is hard, you need to keep migration statistics and
that need memory. So unless we really need to do that i would rather avoid
doing that. Again this is a thing for which we will have to wait and see
how thing panout.
Maybe i should stress that HMM is a set of helpers for device memory and it
is not intended to be a policy maker or to manage device memory. Intention
is that device driver will keep managing device memory as they already do
today.
A deeper integration with process memory management is probably bound to
happen but for now it is just about having toolbox for device driver.
J?r?me
On Thu, Jul 20, 2017 at 08:48:20PM -0700, Dan Williams wrote:
> On Thu, Jul 20, 2017 at 6:41 PM, Jerome Glisse <[email protected]> wrote:
> > On Fri, Jul 21, 2017 at 09:15:29AM +0800, Bob Liu wrote:
> >> On 2017/7/20 23:03, Jerome Glisse wrote:
> >> > On Wed, Jul 19, 2017 at 05:09:04PM +0800, Bob Liu wrote:
> >> >> On 2017/7/19 10:25, Jerome Glisse wrote:
> >> >>> On Wed, Jul 19, 2017 at 09:46:10AM +0800, Bob Liu wrote:
> >> >>>> On 2017/7/18 23:38, Jerome Glisse wrote:
> >> >>>>> On Tue, Jul 18, 2017 at 11:26:51AM +0800, Bob Liu wrote:
> >> >>>>>> On 2017/7/14 5:15, J?r?me Glisse wrote:
[...]
> >> > Second device driver are not integrated that closely within mm and the
> >> > scheduler kernel code to allow to efficiently plug in device access
> >> > notification to page (ie to update struct page so that numa worker
> >> > thread can migrate memory base on accurate informations).
> >> >
> >> > Third it can be hard to decide who win between CPU and device access
> >> > when it comes to updating thing like last CPU id.
> >> >
> >> > Fourth there is no such thing like device id ie equivalent of CPU id.
> >> > If we were to add something the CPU id field in flags of struct page
> >> > would not be big enough so this can have repercusion on struct page
> >> > size. This is not an easy sell.
> >> >
> >> > They are other issues i can't think of right now. I think for now it
> >>
> >> My opinion is most of the issues are the same no matter use CDM or HMM-CDM.
> >> I just care about a more complete solution no matter CDM,HMM-CDM or other ways.
> >> HMM or HMM-CDM depends on device driver, but haven't see a public/full driver to
> >> demonstrate the whole solution works fine.
> >
> > I am working with NVidia close source driver team to make sure that it works
> > well for them. I am also working on nouveau open source driver for same NVidia
> > hardware thought it will be of less use as what is missing there is a solid
> > open source userspace to leverage this. Nonetheless open source driver are in
> > the work.
>
> Can you point to the nouveau patches? I still find these HMM patches
> un-reviewable without an upstream consumer.
So i pushed a branch with WIP for nouveau to use HMM:
https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-nouveau
Top 16 patches are HMM related (implementic logic inside the driver to use
HMM). The next 16 patches are hardware specific patches and some nouveau
changes needed to allow page fault.
It is enough to have simple malloc test case working:
https://cgit.freedesktop.org/~glisse/compote
There is 2 program here the old one is existing way you use GPU for compute
task while the new one is what HMM allow to achieve ie use malloc memory
directly.
I haven't added yet the device memory support it is in work and i will push
update to this branch and repo for that. Probably next week if no pressing
bug preempt my time.
So there is a lot of ugliness in all this and i don't expect this to be what
end up upstream. Right now there is a large rework of nouveau vm (virtual
memory) code happening to rework completely how we do address space management
within nouveau. This work is prerequisite for a clean implementation for HMM
inside nouveau (it will also lift the 40bits address space limitation that
exist today inside nouveau driver). Once that work land i will work on clean
upstreamable implementation for nouveau to use HMM as well as userspace to
leverage it (this is requirement for upstream GPU driver to have open source
userspace that make use of features). All this is a lot of work and there is
not many people working on this.
They are other initiatives under way related to this that i can not talk about
publicly but if they bare fruit they might help to speedup all this.
J?r?me
On Wed, Sep 6, 2017 at 3:36 AM, Jerome Glisse <[email protected]> wrote:
> On Thu, Jul 20, 2017 at 08:48:20PM -0700, Dan Williams wrote:
>> On Thu, Jul 20, 2017 at 6:41 PM, Jerome Glisse <[email protected]> wrote:
>> > On Fri, Jul 21, 2017 at 09:15:29AM +0800, Bob Liu wrote:
>> >> On 2017/7/20 23:03, Jerome Glisse wrote:
>> >> > On Wed, Jul 19, 2017 at 05:09:04PM +0800, Bob Liu wrote:
>> >> >> On 2017/7/19 10:25, Jerome Glisse wrote:
>> >> >>> On Wed, Jul 19, 2017 at 09:46:10AM +0800, Bob Liu wrote:
>> >> >>>> On 2017/7/18 23:38, Jerome Glisse wrote:
>> >> >>>>> On Tue, Jul 18, 2017 at 11:26:51AM +0800, Bob Liu wrote:
>> >> >>>>>> On 2017/7/14 5:15, Jérôme Glisse wrote:
>
> [...]
>
>> >> > Second device driver are not integrated that closely within mm and the
>> >> > scheduler kernel code to allow to efficiently plug in device access
>> >> > notification to page (ie to update struct page so that numa worker
>> >> > thread can migrate memory base on accurate informations).
>> >> >
>> >> > Third it can be hard to decide who win between CPU and device access
>> >> > when it comes to updating thing like last CPU id.
>> >> >
>> >> > Fourth there is no such thing like device id ie equivalent of CPU id.
>> >> > If we were to add something the CPU id field in flags of struct page
>> >> > would not be big enough so this can have repercusion on struct page
>> >> > size. This is not an easy sell.
>> >> >
>> >> > They are other issues i can't think of right now. I think for now it
>> >>
>> >> My opinion is most of the issues are the same no matter use CDM or HMM-CDM.
>> >> I just care about a more complete solution no matter CDM,HMM-CDM or other ways.
>> >> HMM or HMM-CDM depends on device driver, but haven't see a public/full driver to
>> >> demonstrate the whole solution works fine.
>> >
>> > I am working with NVidia close source driver team to make sure that it works
>> > well for them. I am also working on nouveau open source driver for same NVidia
>> > hardware thought it will be of less use as what is missing there is a solid
>> > open source userspace to leverage this. Nonetheless open source driver are in
>> > the work.
>>
>> Can you point to the nouveau patches? I still find these HMM patches
>> un-reviewable without an upstream consumer.
>
> So i pushed a branch with WIP for nouveau to use HMM:
>
> https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-nouveau
>
Nice to see that.
Btw, do you have any plan for a CDM-HMM driver? CPU can write to
Device memory directly without extra copy.
--
Thanks,
Bob Liu
> Top 16 patches are HMM related (implementic logic inside the driver to use
> HMM). The next 16 patches are hardware specific patches and some nouveau
> changes needed to allow page fault.
>
> It is enough to have simple malloc test case working:
>
> https://cgit.freedesktop.org/~glisse/compote
>
> There is 2 program here the old one is existing way you use GPU for compute
> task while the new one is what HMM allow to achieve ie use malloc memory
> directly.
>
>
> I haven't added yet the device memory support it is in work and i will push
> update to this branch and repo for that. Probably next week if no pressing
> bug preempt my time.
>
>
> So there is a lot of ugliness in all this and i don't expect this to be what
> end up upstream. Right now there is a large rework of nouveau vm (virtual
> memory) code happening to rework completely how we do address space management
> within nouveau. This work is prerequisite for a clean implementation for HMM
> inside nouveau (it will also lift the 40bits address space limitation that
> exist today inside nouveau driver). Once that work land i will work on clean
> upstreamable implementation for nouveau to use HMM as well as userspace to
> leverage it (this is requirement for upstream GPU driver to have open source
> userspace that make use of features). All this is a lot of work and there is
> not many people working on this.
>
>
> They are other initiatives under way related to this that i can not talk about
> publicly but if they bare fruit they might help to speedup all this.
>
> Jérôme
>
On Sun, Sep 10, 2017 at 07:22:58AM +0800, Bob Liu wrote:
> On Wed, Sep 6, 2017 at 3:36 AM, Jerome Glisse <[email protected]> wrote:
> > On Thu, Jul 20, 2017 at 08:48:20PM -0700, Dan Williams wrote:
> >> On Thu, Jul 20, 2017 at 6:41 PM, Jerome Glisse <[email protected]> wrote:
> >> > On Fri, Jul 21, 2017 at 09:15:29AM +0800, Bob Liu wrote:
> >> >> On 2017/7/20 23:03, Jerome Glisse wrote:
> >> >> > On Wed, Jul 19, 2017 at 05:09:04PM +0800, Bob Liu wrote:
> >> >> >> On 2017/7/19 10:25, Jerome Glisse wrote:
> >> >> >>> On Wed, Jul 19, 2017 at 09:46:10AM +0800, Bob Liu wrote:
> >> >> >>>> On 2017/7/18 23:38, Jerome Glisse wrote:
> >> >> >>>>> On Tue, Jul 18, 2017 at 11:26:51AM +0800, Bob Liu wrote:
> >> >> >>>>>> On 2017/7/14 5:15, J?r?me Glisse wrote:
> >
> > [...]
> >
> >> >> > Second device driver are not integrated that closely within mm and the
> >> >> > scheduler kernel code to allow to efficiently plug in device access
> >> >> > notification to page (ie to update struct page so that numa worker
> >> >> > thread can migrate memory base on accurate informations).
> >> >> >
> >> >> > Third it can be hard to decide who win between CPU and device access
> >> >> > when it comes to updating thing like last CPU id.
> >> >> >
> >> >> > Fourth there is no such thing like device id ie equivalent of CPU id.
> >> >> > If we were to add something the CPU id field in flags of struct page
> >> >> > would not be big enough so this can have repercusion on struct page
> >> >> > size. This is not an easy sell.
> >> >> >
> >> >> > They are other issues i can't think of right now. I think for now it
> >> >>
> >> >> My opinion is most of the issues are the same no matter use CDM or HMM-CDM.
> >> >> I just care about a more complete solution no matter CDM,HMM-CDM or other ways.
> >> >> HMM or HMM-CDM depends on device driver, but haven't see a public/full driver to
> >> >> demonstrate the whole solution works fine.
> >> >
> >> > I am working with NVidia close source driver team to make sure that it works
> >> > well for them. I am also working on nouveau open source driver for same NVidia
> >> > hardware thought it will be of less use as what is missing there is a solid
> >> > open source userspace to leverage this. Nonetheless open source driver are in
> >> > the work.
> >>
> >> Can you point to the nouveau patches? I still find these HMM patches
> >> un-reviewable without an upstream consumer.
> >
> > So i pushed a branch with WIP for nouveau to use HMM:
> >
> > https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-nouveau
> >
>
> Nice to see that.
> Btw, do you have any plan for a CDM-HMM driver? CPU can write to
> Device memory directly without extra copy.
Yes nouveau CDM support on PPC (which is the only CDM platform commercialy
available today) is on the TODO list. Note that the driver changes for CDM
are minimal (probably less than 100 lines of code). From the driver point
of view this is memory and it doesn't matter if it is CDM or not.
The real burden is on the application developpers who need to update their
code to leverage this.
Also as a data point you want to avoid CPU access to CDM device memory as
much as possible. The overhead for single cache line access are high (this
is PCIE or derivative protocol and it is a packet protocol).
Cheers,
J?r?me
On 2017/9/12 7:36, Jerome Glisse wrote:
> On Sun, Sep 10, 2017 at 07:22:58AM +0800, Bob Liu wrote:
>> On Wed, Sep 6, 2017 at 3:36 AM, Jerome Glisse <[email protected]> wrote:
>>> On Thu, Jul 20, 2017 at 08:48:20PM -0700, Dan Williams wrote:
>>>> On Thu, Jul 20, 2017 at 6:41 PM, Jerome Glisse <[email protected]> wrote:
>>>>> On Fri, Jul 21, 2017 at 09:15:29AM +0800, Bob Liu wrote:
>>>>>> On 2017/7/20 23:03, Jerome Glisse wrote:
>>>>>>> On Wed, Jul 19, 2017 at 05:09:04PM +0800, Bob Liu wrote:
>>>>>>>> On 2017/7/19 10:25, Jerome Glisse wrote:
>>>>>>>>> On Wed, Jul 19, 2017 at 09:46:10AM +0800, Bob Liu wrote:
>>>>>>>>>> On 2017/7/18 23:38, Jerome Glisse wrote:
>>>>>>>>>>> On Tue, Jul 18, 2017 at 11:26:51AM +0800, Bob Liu wrote:
>>>>>>>>>>>> On 2017/7/14 5:15, J?r?me Glisse wrote:
>>>
>>> [...]
>>>
>>>>>>> Second device driver are not integrated that closely within mm and the
>>>>>>> scheduler kernel code to allow to efficiently plug in device access
>>>>>>> notification to page (ie to update struct page so that numa worker
>>>>>>> thread can migrate memory base on accurate informations).
>>>>>>>
>>>>>>> Third it can be hard to decide who win between CPU and device access
>>>>>>> when it comes to updating thing like last CPU id.
>>>>>>>
>>>>>>> Fourth there is no such thing like device id ie equivalent of CPU id.
>>>>>>> If we were to add something the CPU id field in flags of struct page
>>>>>>> would not be big enough so this can have repercusion on struct page
>>>>>>> size. This is not an easy sell.
>>>>>>>
>>>>>>> They are other issues i can't think of right now. I think for now it
>>>>>>
>>>>>> My opinion is most of the issues are the same no matter use CDM or HMM-CDM.
>>>>>> I just care about a more complete solution no matter CDM,HMM-CDM or other ways.
>>>>>> HMM or HMM-CDM depends on device driver, but haven't see a public/full driver to
>>>>>> demonstrate the whole solution works fine.
>>>>>
>>>>> I am working with NVidia close source driver team to make sure that it works
>>>>> well for them. I am also working on nouveau open source driver for same NVidia
>>>>> hardware thought it will be of less use as what is missing there is a solid
>>>>> open source userspace to leverage this. Nonetheless open source driver are in
>>>>> the work.
>>>>
>>>> Can you point to the nouveau patches? I still find these HMM patches
>>>> un-reviewable without an upstream consumer.
>>>
>>> So i pushed a branch with WIP for nouveau to use HMM:
>>>
>>> https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-nouveau
>>>
>>
>> Nice to see that.
>> Btw, do you have any plan for a CDM-HMM driver? CPU can write to
>> Device memory directly without extra copy.
>
> Yes nouveau CDM support on PPC (which is the only CDM platform commercialy
> available today) is on the TODO list. Note that the driver changes for CDM
> are minimal (probably less than 100 lines of code). From the driver point
> of view this is memory and it doesn't matter if it is CDM or not.
>
> The real burden is on the application developpers who need to update their
> code to leverage this.
>
Why it's not transparent to application?
Application just use system malloc() and don't care whether the data is copied or not.
>
> Also as a data point you want to avoid CPU access to CDM device memory as
> much as possible. The overhead for single cache line access are high (this
> is PCIE or derivative protocol and it is a packet protocol).
>
Thank you for the hint, we are going to follow cdm-hmm since HMM already merged into upstream.
--
Thanks,
Bob
On Tue, Sep 12, 2017 at 09:02:19AM +0800, Bob Liu wrote:
> On 2017/9/12 7:36, Jerome Glisse wrote:
> > On Sun, Sep 10, 2017 at 07:22:58AM +0800, Bob Liu wrote:
> >> On Wed, Sep 6, 2017 at 3:36 AM, Jerome Glisse <[email protected]> wrote:
> >>> On Thu, Jul 20, 2017 at 08:48:20PM -0700, Dan Williams wrote:
> >>>> On Thu, Jul 20, 2017 at 6:41 PM, Jerome Glisse <[email protected]> wrote:
> >>>>> On Fri, Jul 21, 2017 at 09:15:29AM +0800, Bob Liu wrote:
> >>>>>> On 2017/7/20 23:03, Jerome Glisse wrote:
> >>>>>>> On Wed, Jul 19, 2017 at 05:09:04PM +0800, Bob Liu wrote:
> >>>>>>>> On 2017/7/19 10:25, Jerome Glisse wrote:
> >>>>>>>>> On Wed, Jul 19, 2017 at 09:46:10AM +0800, Bob Liu wrote:
> >>>>>>>>>> On 2017/7/18 23:38, Jerome Glisse wrote:
> >>>>>>>>>>> On Tue, Jul 18, 2017 at 11:26:51AM +0800, Bob Liu wrote:
> >>>>>>>>>>>> On 2017/7/14 5:15, J?r?me Glisse wrote:
> >>>
> >>> [...]
> >>>
> >>>>>>> Second device driver are not integrated that closely within mm and the
> >>>>>>> scheduler kernel code to allow to efficiently plug in device access
> >>>>>>> notification to page (ie to update struct page so that numa worker
> >>>>>>> thread can migrate memory base on accurate informations).
> >>>>>>>
> >>>>>>> Third it can be hard to decide who win between CPU and device access
> >>>>>>> when it comes to updating thing like last CPU id.
> >>>>>>>
> >>>>>>> Fourth there is no such thing like device id ie equivalent of CPU id.
> >>>>>>> If we were to add something the CPU id field in flags of struct page
> >>>>>>> would not be big enough so this can have repercusion on struct page
> >>>>>>> size. This is not an easy sell.
> >>>>>>>
> >>>>>>> They are other issues i can't think of right now. I think for now it
> >>>>>>
> >>>>>> My opinion is most of the issues are the same no matter use CDM or HMM-CDM.
> >>>>>> I just care about a more complete solution no matter CDM,HMM-CDM or other ways.
> >>>>>> HMM or HMM-CDM depends on device driver, but haven't see a public/full driver to
> >>>>>> demonstrate the whole solution works fine.
> >>>>>
> >>>>> I am working with NVidia close source driver team to make sure that it works
> >>>>> well for them. I am also working on nouveau open source driver for same NVidia
> >>>>> hardware thought it will be of less use as what is missing there is a solid
> >>>>> open source userspace to leverage this. Nonetheless open source driver are in
> >>>>> the work.
> >>>>
> >>>> Can you point to the nouveau patches? I still find these HMM patches
> >>>> un-reviewable without an upstream consumer.
> >>>
> >>> So i pushed a branch with WIP for nouveau to use HMM:
> >>>
> >>> https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-nouveau
> >>>
> >>
> >> Nice to see that.
> >> Btw, do you have any plan for a CDM-HMM driver? CPU can write to
> >> Device memory directly without extra copy.
> >
> > Yes nouveau CDM support on PPC (which is the only CDM platform commercialy
> > available today) is on the TODO list. Note that the driver changes for CDM
> > are minimal (probably less than 100 lines of code). From the driver point
> > of view this is memory and it doesn't matter if it is CDM or not.
> >
> > The real burden is on the application developpers who need to update their
> > code to leverage this.
> >
>
> Why it's not transparent to application?
> Application just use system malloc() and don't care whether the data is copied or not.
Porting today software to malloc/mmap is easy and apply to both non CDM and
CDM hardware.
So malloc/mmap is a given what i mean is that having CPU capable of doing
cache coherent access to device memory is a new thing. It never existed before
and thus no one ever though of how to take advantages of that ie there is no
existing program designed with that in mind.
Cheers,
J?r?me
On Tue, Sep 12, 2017 at 7:36 AM, Jerome Glisse <[email protected]> wrote:
> On Sun, Sep 10, 2017 at 07:22:58AM +0800, Bob Liu wrote:
>> On Wed, Sep 6, 2017 at 3:36 AM, Jerome Glisse <[email protected]> wrote:
>> > On Thu, Jul 20, 2017 at 08:48:20PM -0700, Dan Williams wrote:
>> >> On Thu, Jul 20, 2017 at 6:41 PM, Jerome Glisse <[email protected]> wrote:
>> >> > On Fri, Jul 21, 2017 at 09:15:29AM +0800, Bob Liu wrote:
>> >> >> On 2017/7/20 23:03, Jerome Glisse wrote:
>> >> >> > On Wed, Jul 19, 2017 at 05:09:04PM +0800, Bob Liu wrote:
>> >> >> >> On 2017/7/19 10:25, Jerome Glisse wrote:
>> >> >> >>> On Wed, Jul 19, 2017 at 09:46:10AM +0800, Bob Liu wrote:
>> >> >> >>>> On 2017/7/18 23:38, Jerome Glisse wrote:
>> >> >> >>>>> On Tue, Jul 18, 2017 at 11:26:51AM +0800, Bob Liu wrote:
>> >> >> >>>>>> On 2017/7/14 5:15, Jérôme Glisse wrote:
>> >
>> > [...]
>> >
>> >> >> > Second device driver are not integrated that closely within mm and the
>> >> >> > scheduler kernel code to allow to efficiently plug in device access
>> >> >> > notification to page (ie to update struct page so that numa worker
>> >> >> > thread can migrate memory base on accurate informations).
>> >> >> >
>> >> >> > Third it can be hard to decide who win between CPU and device access
>> >> >> > when it comes to updating thing like last CPU id.
>> >> >> >
>> >> >> > Fourth there is no such thing like device id ie equivalent of CPU id.
>> >> >> > If we were to add something the CPU id field in flags of struct page
>> >> >> > would not be big enough so this can have repercusion on struct page
>> >> >> > size. This is not an easy sell.
>> >> >> >
>> >> >> > They are other issues i can't think of right now. I think for now it
>> >> >>
>> >> >> My opinion is most of the issues are the same no matter use CDM or HMM-CDM.
>> >> >> I just care about a more complete solution no matter CDM,HMM-CDM or other ways.
>> >> >> HMM or HMM-CDM depends on device driver, but haven't see a public/full driver to
>> >> >> demonstrate the whole solution works fine.
>> >> >
>> >> > I am working with NVidia close source driver team to make sure that it works
>> >> > well for them. I am also working on nouveau open source driver for same NVidia
>> >> > hardware thought it will be of less use as what is missing there is a solid
>> >> > open source userspace to leverage this. Nonetheless open source driver are in
>> >> > the work.
>> >>
>> >> Can you point to the nouveau patches? I still find these HMM patches
>> >> un-reviewable without an upstream consumer.
>> >
>> > So i pushed a branch with WIP for nouveau to use HMM:
>> >
>> > https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-nouveau
>> >
>>
>> Nice to see that.
>> Btw, do you have any plan for a CDM-HMM driver? CPU can write to
>> Device memory directly without extra copy.
>
> Yes nouveau CDM support on PPC (which is the only CDM platform commercialy
> available today) is on the TODO list. Note that the driver changes for CDM
> are minimal (probably less than 100 lines of code). From the driver point
> of view this is memory and it doesn't matter if it is CDM or not.
>
It seems have to migrate/copy memory between system-memory and
device-memory even in HMM-CDM solution.
Because device-memory is not added into buddy system, the page fault
for normal malloc() always allocate memory from system-memory!!
If the device then access the same virtual address, the data is copied
to device-memory.
Correct me if I misunderstand something.
@Balbir, how do you plan to make zero-copy work if using HMM-CDM?
--
Thanks,
Bob
On Tue, Sep 26, 2017 at 05:56:26PM +0800, Bob Liu wrote:
> On Tue, Sep 12, 2017 at 7:36 AM, Jerome Glisse <[email protected]> wrote:
> > On Sun, Sep 10, 2017 at 07:22:58AM +0800, Bob Liu wrote:
> >> On Wed, Sep 6, 2017 at 3:36 AM, Jerome Glisse <[email protected]> wrote:
> >> > On Thu, Jul 20, 2017 at 08:48:20PM -0700, Dan Williams wrote:
> >> >> On Thu, Jul 20, 2017 at 6:41 PM, Jerome Glisse <[email protected]> wrote:
> >> >> > On Fri, Jul 21, 2017 at 09:15:29AM +0800, Bob Liu wrote:
> >> >> >> On 2017/7/20 23:03, Jerome Glisse wrote:
> >> >> >> > On Wed, Jul 19, 2017 at 05:09:04PM +0800, Bob Liu wrote:
> >> >> >> >> On 2017/7/19 10:25, Jerome Glisse wrote:
> >> >> >> >>> On Wed, Jul 19, 2017 at 09:46:10AM +0800, Bob Liu wrote:
> >> >> >> >>>> On 2017/7/18 23:38, Jerome Glisse wrote:
> >> >> >> >>>>> On Tue, Jul 18, 2017 at 11:26:51AM +0800, Bob Liu wrote:
> >> >> >> >>>>>> On 2017/7/14 5:15, J?r?me Glisse wrote:
> >> >
> >> > [...]
> >> >
> >> >> >> > Second device driver are not integrated that closely within mm and the
> >> >> >> > scheduler kernel code to allow to efficiently plug in device access
> >> >> >> > notification to page (ie to update struct page so that numa worker
> >> >> >> > thread can migrate memory base on accurate informations).
> >> >> >> >
> >> >> >> > Third it can be hard to decide who win between CPU and device access
> >> >> >> > when it comes to updating thing like last CPU id.
> >> >> >> >
> >> >> >> > Fourth there is no such thing like device id ie equivalent of CPU id.
> >> >> >> > If we were to add something the CPU id field in flags of struct page
> >> >> >> > would not be big enough so this can have repercusion on struct page
> >> >> >> > size. This is not an easy sell.
> >> >> >> >
> >> >> >> > They are other issues i can't think of right now. I think for now it
> >> >> >>
> >> >> >> My opinion is most of the issues are the same no matter use CDM or HMM-CDM.
> >> >> >> I just care about a more complete solution no matter CDM,HMM-CDM or other ways.
> >> >> >> HMM or HMM-CDM depends on device driver, but haven't see a public/full driver to
> >> >> >> demonstrate the whole solution works fine.
> >> >> >
> >> >> > I am working with NVidia close source driver team to make sure that it works
> >> >> > well for them. I am also working on nouveau open source driver for same NVidia
> >> >> > hardware thought it will be of less use as what is missing there is a solid
> >> >> > open source userspace to leverage this. Nonetheless open source driver are in
> >> >> > the work.
> >> >>
> >> >> Can you point to the nouveau patches? I still find these HMM patches
> >> >> un-reviewable without an upstream consumer.
> >> >
> >> > So i pushed a branch with WIP for nouveau to use HMM:
> >> >
> >> > https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-nouveau
> >> >
> >>
> >> Nice to see that.
> >> Btw, do you have any plan for a CDM-HMM driver? CPU can write to
> >> Device memory directly without extra copy.
> >
> > Yes nouveau CDM support on PPC (which is the only CDM platform commercialy
> > available today) is on the TODO list. Note that the driver changes for CDM
> > are minimal (probably less than 100 lines of code). From the driver point
> > of view this is memory and it doesn't matter if it is CDM or not.
> >
>
> It seems have to migrate/copy memory between system-memory and
> device-memory even in HMM-CDM solution.
> Because device-memory is not added into buddy system, the page fault
> for normal malloc() always allocate memory from system-memory!!
> If the device then access the same virtual address, the data is copied
> to device-memory.
>
> Correct me if I misunderstand something.
> @Balbir, how do you plan to make zero-copy work if using HMM-CDM?
Device can access system memory so copy to device is _not_ mandatory. Copying
data to device is for performance only ie the device driver take hint from
userspace and monitor device activity to decide which memory should be migrated
to device memory to maximize performance.
Moreover in some previous version of the HMM patchset we had an helper that
allowed to directly allocate device memory on device page fault. I intend to
post this helper again. With that helper you can have zero copy when device
is the first to access the memory.
Plan is to get what we have today work properly with the open source driver
and make it perform well. Once we get some experience with real workload we
might look into allowing CPU page fault to be directed to device memory but
at this time i don't think we need this.
Cheers,
J?r?me
On 2017/9/27 0:16, Jerome Glisse wrote:
> On Tue, Sep 26, 2017 at 05:56:26PM +0800, Bob Liu wrote:
>> On Tue, Sep 12, 2017 at 7:36 AM, Jerome Glisse <[email protected]> wrote:
>>> On Sun, Sep 10, 2017 at 07:22:58AM +0800, Bob Liu wrote:
>>>> On Wed, Sep 6, 2017 at 3:36 AM, Jerome Glisse <[email protected]> wrote:
>>>>> On Thu, Jul 20, 2017 at 08:48:20PM -0700, Dan Williams wrote:
>>>>>> On Thu, Jul 20, 2017 at 6:41 PM, Jerome Glisse <[email protected]> wrote:
[...]
>>>>> So i pushed a branch with WIP for nouveau to use HMM:
>>>>>
>>>>> https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-nouveau
>>>>>
>>>>
>>>> Nice to see that.
>>>> Btw, do you have any plan for a CDM-HMM driver? CPU can write to
>>>> Device memory directly without extra copy.
>>>
>>> Yes nouveau CDM support on PPC (which is the only CDM platform commercialy
>>> available today) is on the TODO list. Note that the driver changes for CDM
>>> are minimal (probably less than 100 lines of code). From the driver point
>>> of view this is memory and it doesn't matter if it is CDM or not.
>>>
>>
>> It seems have to migrate/copy memory between system-memory and
>> device-memory even in HMM-CDM solution.
>> Because device-memory is not added into buddy system, the page fault
>> for normal malloc() always allocate memory from system-memory!!
>> If the device then access the same virtual address, the data is copied
>> to device-memory.
>>
>> Correct me if I misunderstand something.
>> @Balbir, how do you plan to make zero-copy work if using HMM-CDM?
>
> Device can access system memory so copy to device is _not_ mandatory. Copying
> data to device is for performance only ie the device driver take hint from
> userspace and monitor device activity to decide which memory should be migrated
> to device memory to maximize performance.
>
> Moreover in some previous version of the HMM patchset we had an helper that
Could you point in which version? I'd like to have a look.
> allowed to directly allocate device memory on device page fault. I intend to
> post this helper again. With that helper you can have zero copy when device
> is the first to access the memory.
>
> Plan is to get what we have today work properly with the open source driver
> and make it perform well. Once we get some experience with real workload we
> might look into allowing CPU page fault to be directed to device memory but
> at this time i don't think we need this.
>
For us, we need this feature that CPU page fault can be direct to device memory.
So that don't need to copy data from system memory to device memory.
Do you have any suggestion on the implementation? I'll try to make a prototype patch.
--
Thanks,
Bob
On Sat, Sep 30, 2017 at 10:57:38AM +0800, Bob Liu wrote:
> On 2017/9/27 0:16, Jerome Glisse wrote:
> > On Tue, Sep 26, 2017 at 05:56:26PM +0800, Bob Liu wrote:
> >> On Tue, Sep 12, 2017 at 7:36 AM, Jerome Glisse <[email protected]> wrote:
> >>> On Sun, Sep 10, 2017 at 07:22:58AM +0800, Bob Liu wrote:
> >>>> On Wed, Sep 6, 2017 at 3:36 AM, Jerome Glisse <[email protected]> wrote:
> >>>>> On Thu, Jul 20, 2017 at 08:48:20PM -0700, Dan Williams wrote:
> >>>>>> On Thu, Jul 20, 2017 at 6:41 PM, Jerome Glisse <[email protected]> wrote:
> [...]
> >>>>> So i pushed a branch with WIP for nouveau to use HMM:
> >>>>>
> >>>>> https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-nouveau
> >>>>>
> >>>>
> >>>> Nice to see that.
> >>>> Btw, do you have any plan for a CDM-HMM driver? CPU can write to
> >>>> Device memory directly without extra copy.
> >>>
> >>> Yes nouveau CDM support on PPC (which is the only CDM platform commercialy
> >>> available today) is on the TODO list. Note that the driver changes for CDM
> >>> are minimal (probably less than 100 lines of code). From the driver point
> >>> of view this is memory and it doesn't matter if it is CDM or not.
> >>>
> >>
> >> It seems have to migrate/copy memory between system-memory and
> >> device-memory even in HMM-CDM solution.
> >> Because device-memory is not added into buddy system, the page fault
> >> for normal malloc() always allocate memory from system-memory!!
> >> If the device then access the same virtual address, the data is copied
> >> to device-memory.
> >>
> >> Correct me if I misunderstand something.
> >> @Balbir, how do you plan to make zero-copy work if using HMM-CDM?
> >
> > Device can access system memory so copy to device is _not_ mandatory. Copying
> > data to device is for performance only ie the device driver take hint from
> > userspace and monitor device activity to decide which memory should be migrated
> > to device memory to maximize performance.
> >
> > Moreover in some previous version of the HMM patchset we had an helper that
>
> Could you point in which version? I'd like to have a look.
I will need to dig in.
>
> > allowed to directly allocate device memory on device page fault. I intend to
> > post this helper again. With that helper you can have zero copy when device
> > is the first to access the memory.
> >
> > Plan is to get what we have today work properly with the open source driver
> > and make it perform well. Once we get some experience with real workload we
> > might look into allowing CPU page fault to be directed to device memory but
> > at this time i don't think we need this.
> >
>
> For us, we need this feature that CPU page fault can be direct to device memory.
> So that don't need to copy data from system memory to device memory.
> Do you have any suggestion on the implementation? I'll try to make a prototype patch.
Why do you need that ? What is the device and what are the requirement ?
J?r?me
On Thu, Nov 16, 2017 at 02:41:39PM -0800, chetan L wrote:
> On Thu, Nov 16, 2017 at 1:29 PM, Jerome Glisse <[email protected]> wrote:
>
> >
> > For the NUMA discussion this is related to CPU less node ie not wanting
> > to add any more CPU less node (node with only memory) and they are other
> > aspect too. For instance you do not necessarily have good informations
> > from the device to know if a page is access a lot by the device (this
> > kind of information is often only accessible by the device driver). Thus
>
> @Jerome - one comment w.r.t 'do not necessarily have good info on
> device access'.
>
> So you could be assuming a few things here :). CCIX extends the CPU
> complex's coherency domain(it is now a single/unified coherency
> domain). The CCIX-EP (lets say an accelerator/XPU or a NIC or a combo)
> is now a true peer w.r.t the host-numa-node(s) (aka 1st class
> citizen). I don't know how much info was revealed at the latest ARM
> techcon where CCIX was presented. So I cannot divulge any further
> details until I see that slide deck. However, you can safely assume
> that the host will have *all* the info w.r.t the device-access and
> vice-versa.
I do have access to CCIX, last time i read the draft, few month ago,
my understanding was that there is no mechanism to differentiate between
device behind the root complex. So when you do autonuma you don't know
which of your CCIX device is the one faulting hence you can not keep
track of that inside struct page for autonuma (ignoring the issue with
the lack of CPUID for each device).
This is what i mean by NUMA is not a good fit as it is. Yes everything
is cache coherent and all, but that is just a small part of what is
needed to make autonuma as it is today work.
J�r�me
From 1584296199549362830@xxx Fri Nov 17 07:09:55 +0000 2017
X-GM-THRID: 1572843623662560165
X-Gmail-Labels: Inbox,Category Forums,HistoricalUnread
On Thu, Nov 16, 2017 at 1:29 PM, Jerome Glisse <[email protected]> wrote:
>
> For the NUMA discussion this is related to CPU less node ie not wanting
> to add any more CPU less node (node with only memory) and they are other
> aspect too. For instance you do not necessarily have good informations
> from the device to know if a page is access a lot by the device (this
> kind of information is often only accessible by the device driver). Thus
@Jerome - one comment w.r.t 'do not necessarily have good info on
device access'.
So you could be assuming a few things here :). CCIX extends the CPU
complex's coherency domain(it is now a single/unified coherency
domain). The CCIX-EP (lets say an accelerator/XPU or a NIC or a combo)
is now a true peer w.r.t the host-numa-node(s) (aka 1st class
citizen). I don't know how much info was revealed at the latest ARM
techcon where CCIX was presented. So I cannot divulge any further
details until I see that slide deck. However, you can safely assume
that the host will have *all* the info w.r.t the device-access and
vice-versa.
Chetan
From 1584273544040828391@xxx Fri Nov 17 01:09:49 +0000 2017
X-GM-THRID: 1572843623662560165
X-Gmail-Labels: Inbox,Category Forums,HistoricalUnread
On Wed, Nov 15, 2017 at 07:29:10PM -0800, chetan L wrote:
> On Wed, Nov 15, 2017 at 7:23 PM, chetan L <[email protected]> wrote:
> > On Wed, Nov 15, 2017 at 6:44 PM, Jerome Glisse <[email protected]> wrote:
> >> On Wed, Nov 15, 2017 at 06:10:08PM -0800, chet l wrote:
> >>> >> You may think it as a CCIX device or CAPI device.
> >>> >> The requirement is eliminate any extra copy.
> >>> >> A typical usecase/requirement is malloc() and madvise() allocate from
> >>> >> device memory, then CPU write data to device memory directly and
> >>> >> trigger device to read the data/do calculation.
> >>> >
> >>> > I suggest you rely on the device driver userspace API to do a migration after malloc
> >>> > then. Something like:
> >>> > ptr = malloc(size);
> >>> > my_device_migrate(ptr, size);
> >>> >
> >>> > Which would call an ioctl of the device driver which itself would migrate memory or
> >>> > allocate device memory for the range if pointer return by malloc is not yet back by
> >>> > any pages.
> >>> >
> >>>
> >>> So for CCIX, I don't think there is going to be an inline device
> >>> driver that would allocate any memory for you. The expansion memory
> >>> will become part of the system memory as part of the boot process. So,
> >>> if the host DDR is 256GB and the CCIX expansion memory is 4GB, the
> >>> total system mem will be 260GB.
> >>>
> >>> Assume that the 'mm' is taught to mark/anoint the ZONE_DEVICE(or
> >>> ZONE_XXX) range from 256 to 260 GB. Then, for kmalloc it(mm) won't use
> >>> the ZONE_DEV range. But for a malloc, it will/can use that range.
> >>
> >> HMM zone device memory would work with that, you just need to teach the
> >> platform to identify this memory zone and not hotplug it. Again you
> >> should rely on specific device driver API to allocate this memory.
> >>
> >
> > @Jerome - a new linux-accelerator's list has just been created. I have
> > CC'd that list since we have overlapping interests w.r.t CCIX.
> >
> > I cannot comment on surprise add/remove as of now ... will cross the
> > bridge later.
Note that this is not hotplug strictly speaking. Design today is that it
is the device driver that register the memory. From kernel point of view
this is an hotplug but for many of the target architecture there is no
real hotplug ie device and its memory was present at boot time.
Like i said i think for now we are better of having each device manage and
register its memory. HMM provide a toolbox for that. If we see common trend
accross multiple devices then we can think about making something more
generic.
For the NUMA discussion this is related to CPU less node ie not wanting
to add any more CPU less node (node with only memory) and they are other
aspect too. For instance you do not necessarily have good informations
from the device to know if a page is access a lot by the device (this
kind of information is often only accessible by the device driver). Thus
the automatic NUMA placement is useless here. Not mentioning that for it
to work we would need to change how it currently work (iirc there is
issue when you not have a CPU id you can use).
Cheers,
J�r�me
From 1584198340406960594@xxx Thu Nov 16 05:14:29 +0000 2017
X-GM-THRID: 1572843623662560165
X-Gmail-Labels: Inbox,Category Forums,HistoricalUnread
CC'ing : [email protected]
On Wed, Nov 15, 2017 at 6:44 PM, Jerome Glisse <[email protected]> wrote:
> On Wed, Nov 15, 2017 at 06:10:08PM -0800, chet l wrote:
>> >> You may think it as a CCIX device or CAPI device.
>> >> The requirement is eliminate any extra copy.
>> >> A typical usecase/requirement is malloc() and madvise() allocate from
>> >> device memory, then CPU write data to device memory directly and
>> >> trigger device to read the data/do calculation.
>> >
>> > I suggest you rely on the device driver userspace API to do a migration after malloc
>> > then. Something like:
>> > ptr = malloc(size);
>> > my_device_migrate(ptr, size);
>> >
>> > Which would call an ioctl of the device driver which itself would migrate memory or
>> > allocate device memory for the range if pointer return by malloc is not yet back by
>> > any pages.
>> >
>>
>> So for CCIX, I don't think there is going to be an inline device
>> driver that would allocate any memory for you. The expansion memory
>> will become part of the system memory as part of the boot process. So,
>> if the host DDR is 256GB and the CCIX expansion memory is 4GB, the
>> total system mem will be 260GB.
>>
>> Assume that the 'mm' is taught to mark/anoint the ZONE_DEVICE(or
>> ZONE_XXX) range from 256 to 260 GB. Then, for kmalloc it(mm) won't use
>> the ZONE_DEV range. But for a malloc, it will/can use that range.
>
> HMM zone device memory would work with that, you just need to teach the
> platform to identify this memory zone and not hotplug it. Again you
> should rely on specific device driver API to allocate this memory.
>
@Jerome - a new linux-accelerator's list has just been created. I have
CC'd that list since we have overlapping interests w.r.t CCIX.
I cannot comment on surprise add/remove as of now ... will cross the
bridge later.
>> > There has been several discussions already about madvise/mbind/set_mempolicy/
>> > move_pages and at this time i don't think we want to add or change any of them to
>> > understand device memory. My personal opinion is that we first need to have enough
>>
>> We will visit these APIs when we are more closer to building exotic
>> CCIX devices. And the plan is to present/express the CCIX proximity
>> attributes just like a NUMA node-proximity attribute today. That way
>> there would be minimal disruptions to the existing OS ecosystem.
>
> NUMA have been rejected previously see CDM/CAPI threads. So i don't see
> it being accepted for CCIX either. My belief is that we want to hide this
> inside device driver and only once we see multiple devices all doing the
> same kind of thing we should move toward building something generic that
> catter to CCIX devices.
Thanks for pointing out the NUMA thingy. I will visit the CDM/CAPI
threads to understand what was discussed before commenting further.
Chetan
From 1584194914595783230@xxx Thu Nov 16 04:20:02 +0000 2017
X-GM-THRID: 1572843623662560165
X-Gmail-Labels: Inbox,Category Forums,HistoricalUnread
On Wed, Nov 15, 2017 at 7:23 PM, chetan L <[email protected]> wrote:
> CC'ing : [email protected]
>
Sorry, CC'ing the correct list this time: [email protected]
> On Wed, Nov 15, 2017 at 6:44 PM, Jerome Glisse <[email protected]> wrote:
>> On Wed, Nov 15, 2017 at 06:10:08PM -0800, chet l wrote:
>>> >> You may think it as a CCIX device or CAPI device.
>>> >> The requirement is eliminate any extra copy.
>>> >> A typical usecase/requirement is malloc() and madvise() allocate from
>>> >> device memory, then CPU write data to device memory directly and
>>> >> trigger device to read the data/do calculation.
>>> >
>>> > I suggest you rely on the device driver userspace API to do a migration after malloc
>>> > then. Something like:
>>> > ptr = malloc(size);
>>> > my_device_migrate(ptr, size);
>>> >
>>> > Which would call an ioctl of the device driver which itself would migrate memory or
>>> > allocate device memory for the range if pointer return by malloc is not yet back by
>>> > any pages.
>>> >
>>>
>>> So for CCIX, I don't think there is going to be an inline device
>>> driver that would allocate any memory for you. The expansion memory
>>> will become part of the system memory as part of the boot process. So,
>>> if the host DDR is 256GB and the CCIX expansion memory is 4GB, the
>>> total system mem will be 260GB.
>>>
>>> Assume that the 'mm' is taught to mark/anoint the ZONE_DEVICE(or
>>> ZONE_XXX) range from 256 to 260 GB. Then, for kmalloc it(mm) won't use
>>> the ZONE_DEV range. But for a malloc, it will/can use that range.
>>
>> HMM zone device memory would work with that, you just need to teach the
>> platform to identify this memory zone and not hotplug it. Again you
>> should rely on specific device driver API to allocate this memory.
>>
>
> @Jerome - a new linux-accelerator's list has just been created. I have
> CC'd that list since we have overlapping interests w.r.t CCIX.
>
> I cannot comment on surprise add/remove as of now ... will cross the
> bridge later.
>
>
>>> > There has been several discussions already about madvise/mbind/set_mempolicy/
>>> > move_pages and at this time i don't think we want to add or change any of them to
>>> > understand device memory. My personal opinion is that we first need to have enough
>>>
>>> We will visit these APIs when we are more closer to building exotic
>>> CCIX devices. And the plan is to present/express the CCIX proximity
>>> attributes just like a NUMA node-proximity attribute today. That way
>>> there would be minimal disruptions to the existing OS ecosystem.
>>
>> NUMA have been rejected previously see CDM/CAPI threads. So i don't see
>> it being accepted for CCIX either. My belief is that we want to hide this
>> inside device driver and only once we see multiple devices all doing the
>> same kind of thing we should move toward building something generic that
>> catter to CCIX devices.
>
>
> Thanks for pointing out the NUMA thingy. I will visit the CDM/CAPI
> threads to understand what was discussed before commenting further.
>
From 1584194628875115086@xxx Thu Nov 16 04:15:29 +0000 2017
X-GM-THRID: 1572843623662560165
X-Gmail-Labels: Inbox,Category Forums,HistoricalUnread
On Wed, Nov 15, 2017 at 06:10:08PM -0800, chet l wrote:
> >> You may think it as a CCIX device or CAPI device.
> >> The requirement is eliminate any extra copy.
> >> A typical usecase/requirement is malloc() and madvise() allocate from
> >> device memory, then CPU write data to device memory directly and
> >> trigger device to read the data/do calculation.
> >
> > I suggest you rely on the device driver userspace API to do a migration after malloc
> > then. Something like:
> > ptr = malloc(size);
> > my_device_migrate(ptr, size);
> >
> > Which would call an ioctl of the device driver which itself would migrate memory or
> > allocate device memory for the range if pointer return by malloc is not yet back by
> > any pages.
> >
>
> So for CCIX, I don't think there is going to be an inline device
> driver that would allocate any memory for you. The expansion memory
> will become part of the system memory as part of the boot process. So,
> if the host DDR is 256GB and the CCIX expansion memory is 4GB, the
> total system mem will be 260GB.
>
> Assume that the 'mm' is taught to mark/anoint the ZONE_DEVICE(or
> ZONE_XXX) range from 256 to 260 GB. Then, for kmalloc it(mm) won't use
> the ZONE_DEV range. But for a malloc, it will/can use that range.
HMM zone device memory would work with that, you just need to teach the
platform to identify this memory zone and not hotplug it. Again you
should rely on specific device driver API to allocate this memory.
> > There has been several discussions already about madvise/mbind/set_mempolicy/
> > move_pages and at this time i don't think we want to add or change any of them to
> > understand device memory. My personal opinion is that we first need to have enough
>
> We will visit these APIs when we are more closer to building exotic
> CCIX devices. And the plan is to present/express the CCIX proximity
> attributes just like a NUMA node-proximity attribute today. That way
> there would be minimal disruptions to the existing OS ecosystem.
NUMA have been rejected previously see CDM/CAPI threads. So i don't see
it being accepted for CCIX either. My belief is that we want to hide this
inside device driver and only once we see multiple devices all doing the
same kind of thing we should move toward building something generic that
catter to CCIX devices.
J�r�me
From 1584194472841752725@xxx Thu Nov 16 04:13:00 +0000 2017
X-GM-THRID: 1572843623662560165
X-Gmail-Labels: Inbox,Category Forums,HistoricalUnread
>> You may think it as a CCIX device or CAPI device.
>> The requirement is eliminate any extra copy.
>> A typical usecase/requirement is malloc() and madvise() allocate from
>> device memory, then CPU write data to device memory directly and
>> trigger device to read the data/do calculation.
>
> I suggest you rely on the device driver userspace API to do a migration after malloc
> then. Something like:
> ptr = malloc(size);
> my_device_migrate(ptr, size);
>
> Which would call an ioctl of the device driver which itself would migrate memory or
> allocate device memory for the range if pointer return by malloc is not yet back by
> any pages.
>
So for CCIX, I don't think there is going to be an inline device
driver that would allocate any memory for you. The expansion memory
will become part of the system memory as part of the boot process. So,
if the host DDR is 256GB and the CCIX expansion memory is 4GB, the
total system mem will be 260GB.
Assume that the 'mm' is taught to mark/anoint the ZONE_DEVICE(or
ZONE_XXX) range from 256 to 260 GB. Then, for kmalloc it(mm) won't use
the ZONE_DEV range. But for a malloc, it will/can use that range.
> There has been several discussions already about madvise/mbind/set_mempolicy/
> move_pages and at this time i don't think we want to add or change any of them to
> understand device memory. My personal opinion is that we first need to have enough
We will visit these APIs when we are more closer to building exotic
CCIX devices. And the plan is to present/express the CCIX proximity
attributes just like a NUMA node-proximity attribute today. That way
there would be minimal disruptions to the existing OS ecosystem.
Chetan
From 1581066728146512122@xxx Thu Oct 12 15:38:51 +0000 2017
X-GM-THRID: 1572843623662560165
X-Gmail-Labels: Inbox,Category Forums
On Wed, Oct 11, 2017 at 09:15:57PM +0800, Bob Liu wrote:
> On Sun, Oct 1, 2017 at 6:49 AM, Jerome Glisse <[email protected]> wrote:
> > On Sat, Sep 30, 2017 at 10:57:38AM +0800, Bob Liu wrote:
> >> On 2017/9/27 0:16, Jerome Glisse wrote:
> >> > On Tue, Sep 26, 2017 at 05:56:26PM +0800, Bob Liu wrote:
> >> >> On Tue, Sep 12, 2017 at 7:36 AM, Jerome Glisse <[email protected]> wrote:
> >> >>> On Sun, Sep 10, 2017 at 07:22:58AM +0800, Bob Liu wrote:
> >> >>>> On Wed, Sep 6, 2017 at 3:36 AM, Jerome Glisse <[email protected]> wrote:
> >> >>>>> On Thu, Jul 20, 2017 at 08:48:20PM -0700, Dan Williams wrote:
> >> >>>>>> On Thu, Jul 20, 2017 at 6:41 PM, Jerome Glisse <[email protected]> wrote:
> >> [...]
> >> >>>>> So i pushed a branch with WIP for nouveau to use HMM:
> >> >>>>>
> >> >>>>> https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-nouveau
> >> >>>>>
> >> >>>>
> >> >>>> Nice to see that.
> >> >>>> Btw, do you have any plan for a CDM-HMM driver? CPU can write to
> >> >>>> Device memory directly without extra copy.
> >> >>>
> >> >>> Yes nouveau CDM support on PPC (which is the only CDM platform commercialy
> >> >>> available today) is on the TODO list. Note that the driver changes for CDM
> >> >>> are minimal (probably less than 100 lines of code). From the driver point
> >> >>> of view this is memory and it doesn't matter if it is CDM or not.
> >> >>>
> >> >>
> >> >> It seems have to migrate/copy memory between system-memory and
> >> >> device-memory even in HMM-CDM solution.
> >> >> Because device-memory is not added into buddy system, the page fault
> >> >> for normal malloc() always allocate memory from system-memory!!
> >> >> If the device then access the same virtual address, the data is copied
> >> >> to device-memory.
> >> >>
> >> >> Correct me if I misunderstand something.
> >> >> @Balbir, how do you plan to make zero-copy work if using HMM-CDM?
> >> >
> >> > Device can access system memory so copy to device is _not_ mandatory. Copying
> >> > data to device is for performance only ie the device driver take hint from
> >> > userspace and monitor device activity to decide which memory should be migrated
> >> > to device memory to maximize performance.
> >> >
> >> > Moreover in some previous version of the HMM patchset we had an helper that
> >>
> >> Could you point in which version? I'd like to have a look.
> >
> > I will need to dig in.
> >
>
> Thank you.
I forgot about this, sorry i was traveling i am still catching up. I will send
you those patches once i unearth where i end up backing them.
>
> >>
> >> > allowed to directly allocate device memory on device page fault. I intend to
> >> > post this helper again. With that helper you can have zero copy when device
> >> > is the first to access the memory.
> >> >
> >> > Plan is to get what we have today work properly with the open source driver
> >> > and make it perform well. Once we get some experience with real workload we
> >> > might look into allowing CPU page fault to be directed to device memory but
> >> > at this time i don't think we need this.
> >> >
> >>
> >> For us, we need this feature that CPU page fault can be direct to device memory.
> >> So that don't need to copy data from system memory to device memory.
> >> Do you have any suggestion on the implementation? I'll try to make a prototype patch.
> >
> > Why do you need that ? What is the device and what are the requirement ?
> >
>
> You may think it as a CCIX device or CAPI device.
> The requirement is eliminate any extra copy.
> A typical usecase/requirement is malloc() and madvise() allocate from
> device memory, then CPU write data to device memory directly and
> trigger device to read the data/do calculation.
I suggest you rely on the device driver userspace API to do a migration after malloc
then. Something like:
ptr = malloc(size);
my_device_migrate(ptr, size);
Which would call an ioctl of the device driver which itself would migrate memory or
allocate device memory for the range if pointer return by malloc is not yet back by
any pages.
There has been several discussions already about madvise/mbind/set_mempolicy/
move_pages and at this time i don't think we want to add or change any of them to
understand device memory. My personal opinion is that we first need to have enough
upstream user and understand of how it is actualy use before it make sense to try
to formalize and define a syscall or change an existing one. User facing API are
set in stone and i don't want to design them by making broad assumption on how i
think device memory will be use.
So for time being i think it is better to use existing device API to manage and
give hint to the kernel on where memory should be (ie should device memory be use
for some range). The first user of this are GPU and they already have a lot of
ioctl to manage and propagate hint from user space. So at this time i suggest that
you piggy back on any existing ioctl of your device or add new ioctl.
Hope this help.
J�r�me
From 1580967199042673638@xxx Wed Oct 11 13:16:52 +0000 2017
X-GM-THRID: 1572843623662560165
X-Gmail-Labels: Inbox,Category Forums
On Sun, Oct 1, 2017 at 6:49 AM, Jerome Glisse <[email protected]> wrote:
> On Sat, Sep 30, 2017 at 10:57:38AM +0800, Bob Liu wrote:
>> On 2017/9/27 0:16, Jerome Glisse wrote:
>> > On Tue, Sep 26, 2017 at 05:56:26PM +0800, Bob Liu wrote:
>> >> On Tue, Sep 12, 2017 at 7:36 AM, Jerome Glisse <[email protected]> wrote:
>> >>> On Sun, Sep 10, 2017 at 07:22:58AM +0800, Bob Liu wrote:
>> >>>> On Wed, Sep 6, 2017 at 3:36 AM, Jerome Glisse <[email protected]> wrote:
>> >>>>> On Thu, Jul 20, 2017 at 08:48:20PM -0700, Dan Williams wrote:
>> >>>>>> On Thu, Jul 20, 2017 at 6:41 PM, Jerome Glisse <[email protected]> wrote:
>> [...]
>> >>>>> So i pushed a branch with WIP for nouveau to use HMM:
>> >>>>>
>> >>>>> https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-nouveau
>> >>>>>
>> >>>>
>> >>>> Nice to see that.
>> >>>> Btw, do you have any plan for a CDM-HMM driver? CPU can write to
>> >>>> Device memory directly without extra copy.
>> >>>
>> >>> Yes nouveau CDM support on PPC (which is the only CDM platform commercialy
>> >>> available today) is on the TODO list. Note that the driver changes for CDM
>> >>> are minimal (probably less than 100 lines of code). From the driver point
>> >>> of view this is memory and it doesn't matter if it is CDM or not.
>> >>>
>> >>
>> >> It seems have to migrate/copy memory between system-memory and
>> >> device-memory even in HMM-CDM solution.
>> >> Because device-memory is not added into buddy system, the page fault
>> >> for normal malloc() always allocate memory from system-memory!!
>> >> If the device then access the same virtual address, the data is copied
>> >> to device-memory.
>> >>
>> >> Correct me if I misunderstand something.
>> >> @Balbir, how do you plan to make zero-copy work if using HMM-CDM?
>> >
>> > Device can access system memory so copy to device is _not_ mandatory. Copying
>> > data to device is for performance only ie the device driver take hint from
>> > userspace and monitor device activity to decide which memory should be migrated
>> > to device memory to maximize performance.
>> >
>> > Moreover in some previous version of the HMM patchset we had an helper that
>>
>> Could you point in which version? I'd like to have a look.
>
> I will need to dig in.
>
Thank you.
>>
>> > allowed to directly allocate device memory on device page fault. I intend to
>> > post this helper again. With that helper you can have zero copy when device
>> > is the first to access the memory.
>> >
>> > Plan is to get what we have today work properly with the open source driver
>> > and make it perform well. Once we get some experience with real workload we
>> > might look into allowing CPU page fault to be directed to device memory but
>> > at this time i don't think we need this.
>> >
>>
>> For us, we need this feature that CPU page fault can be direct to device memory.
>> So that don't need to copy data from system memory to device memory.
>> Do you have any suggestion on the implementation? I'll try to make a prototype patch.
>
> Why do you need that ? What is the device and what are the requirement ?
>
You may think it as a CCIX device or CAPI device.
The requirement is eliminate any extra copy.
A typical usecase/requirement is malloc() and madvise() allocate from
device memory, then CPU write data to device memory directly and
trigger device to read the data/do calculation.
--
Regards,
--Bob
From 1580006702485477720@xxx Sat Sep 30 22:50:11 +0000 2017
X-GM-THRID: 1572843623662560165
X-Gmail-Labels: Inbox,Category Forums