2020-06-20 04:50:38

by Ralph Campbell

[permalink] [raw]
Subject: [PATCH 00/16] mm/hmm/nouveau: THP mapping and migration

These patches apply to linux-5.8.0-rc1. Patches 1-3 should probably go
into 5.8, the others can be queued for 5.9. Patches 4-6 improve the HMM
self tests. Patch 7-8 prepare nouveau for the meat of this series which
adds support and testing for compound page mapping of system memory
(patches 9-11) and compound page migration to device private memory
(patches 12-16). Since these changes are split across mm core, nouveau,
and testing, I'm guessing Jason Gunthorpe's HMM tree would be appropriate.

Ralph Campbell (16):
mm: fix migrate_vma_setup() src_owner and normal pages
nouveau: fix migrate page regression
nouveau: fix mixed normal and device private page migration
mm/hmm: fix test timeout on slower machines
mm/hmm/test: remove redundant page table invalidate
mm/hmm: test mixed normal and device private migrations
nouveau: make nvkm_vmm_ctor() and nvkm_mmu_ptp_get() static
nouveau/hmm: fault one page at a time
mm/hmm: add output flag for compound page mapping
nouveau/hmm: support mapping large sysmem pages
hmm: add tests for HMM_PFN_COMPOUND flag
mm/hmm: optimize migrate_vma_setup() for holes
mm: support THP migration to device private memory
mm/thp: add THP allocation helper
mm/hmm/test: add self tests for THP migration
nouveau: support THP migration to private memory

drivers/gpu/drm/nouveau/nouveau_dmem.c | 177 +++++---
drivers/gpu/drm/nouveau/nouveau_svm.c | 241 +++++------
drivers/gpu/drm/nouveau/nouveau_svm.h | 3 +-
.../gpu/drm/nouveau/nvkm/subdev/mmu/base.c | 6 +-
.../gpu/drm/nouveau/nvkm/subdev/mmu/priv.h | 2 +
drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.c | 10 +-
drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h | 3 -
.../drm/nouveau/nvkm/subdev/mmu/vmmgp100.c | 29 +-
include/linux/gfp.h | 10 +
include/linux/hmm.h | 4 +-
include/linux/migrate.h | 1 +
include/linux/mm.h | 1 +
lib/test_hmm.c | 359 ++++++++++++----
lib/test_hmm_uapi.h | 2 +
mm/hmm.c | 10 +-
mm/huge_memory.c | 46 ++-
mm/internal.h | 1 -
mm/memory.c | 10 +-
mm/memremap.c | 9 +-
mm/migrate.c | 236 +++++++++--
mm/page_alloc.c | 1 +
tools/testing/selftests/vm/hmm-tests.c | 388 +++++++++++++++++-
22 files changed, 1203 insertions(+), 346 deletions(-)

--
2.20.1


2020-06-20 04:50:41

by Ralph Campbell

[permalink] [raw]
Subject: [PATCH 02/16] nouveau: fix migrate page regression

The patch to add zero page migration to GPU memory inadvertantly included
part of a future change which broke normal page migration to GPU memory
by copying too much data and corrupting GPU memory.
Fix this by only copying one page instead of a byte count.

Fixes: 9d4296a7d4b3 ("drm/nouveau/nouveau/hmm: fix migrate zero page to GPU")
Signed-off-by: Ralph Campbell <[email protected]>
---
drivers/gpu/drm/nouveau/nouveau_dmem.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index e5c230d9ae24..cc9993837508 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -550,7 +550,7 @@ static unsigned long nouveau_dmem_migrate_copy_one(struct nouveau_drm *drm,
DMA_BIDIRECTIONAL);
if (dma_mapping_error(dev, *dma_addr))
goto out_free_page;
- if (drm->dmem->migrate.copy_func(drm, page_size(spage),
+ if (drm->dmem->migrate.copy_func(drm, 1,
NOUVEAU_APER_VRAM, paddr, NOUVEAU_APER_HOST, *dma_addr))
goto out_dma_unmap;
} else {
--
2.20.1

2020-06-20 04:50:53

by Ralph Campbell

[permalink] [raw]
Subject: [PATCH 12/16] mm/hmm: optimize migrate_vma_setup() for holes

When migrating system memory to device private memory, if the source
address range is a valid VMA range and there is no memory or a zero page,
the source PFN array is marked as valid but with no PFN. This lets the
device driver allocate private memory and clear it, then insert the new
device private struct page into the CPU's page tables when
migrate_vma_pages() is called. migrate_vma_pages() only inserts the
new page if the VMA is an anonymous range. There is no point in telling
the device driver to allocate device private memory and then throwing
it away. Instead, mark the source PFN array entries as not migrating to
avoid this overhead.

Signed-off-by: Ralph Campbell <[email protected]>
---
mm/migrate.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 24535281cea3..87c52e0ee580 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2178,9 +2178,13 @@ static int migrate_vma_collect_hole(unsigned long start,
{
struct migrate_vma *migrate = walk->private;
unsigned long addr;
+ unsigned long flags;
+
+ /* Only allow populating anonymous memory */
+ flags = vma_is_anonymous(walk->vma) ? MIGRATE_PFN_MIGRATE : 0;

for (addr = start; addr < end; addr += PAGE_SIZE) {
- migrate->src[migrate->npages] = MIGRATE_PFN_MIGRATE;
+ migrate->src[migrate->npages] = flags;
migrate->dst[migrate->npages] = 0;
migrate->npages++;
migrate->cpages++;
@@ -2748,7 +2752,7 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
pte_t *ptep;

/* Only allow populating anonymous memory */
- if (!vma_is_anonymous(vma))
+ if (WARN_ON_ONCE(!vma_is_anonymous(vma)))
goto abort;

pgdp = pgd_offset(mm, addr);
--
2.20.1

2020-06-20 04:50:55

by Ralph Campbell

[permalink] [raw]
Subject: [PATCH 16/16] nouveau: support THP migration to private memory

Add support for migrating transparent huge pages to and from device
private memory.

Signed-off-by: Ralph Campbell <[email protected]>
---
drivers/gpu/drm/nouveau/nouveau_dmem.c | 171 +++++++++++++++++--------
drivers/gpu/drm/nouveau/nouveau_svm.c | 11 +-
drivers/gpu/drm/nouveau/nouveau_svm.h | 3 +-
3 files changed, 127 insertions(+), 58 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index f6a806ba3caa..e8c4c0bc78ae 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -46,6 +46,7 @@
*/
#define DMEM_CHUNK_SIZE (2UL << 20)
#define DMEM_CHUNK_NPAGES (DMEM_CHUNK_SIZE >> PAGE_SHIFT)
+#define PMD_ORDER (PMD_SHIFT - PAGE_SHIFT)

enum nouveau_aper {
NOUVEAU_APER_VIRT,
@@ -53,7 +54,7 @@ enum nouveau_aper {
NOUVEAU_APER_HOST,
};

-typedef int (*nouveau_migrate_copy_t)(struct nouveau_drm *drm, u64 npages,
+typedef int (*nouveau_migrate_copy_t)(struct nouveau_drm *drm, u32 length,
enum nouveau_aper, u64 dst_addr,
enum nouveau_aper, u64 src_addr);
typedef int (*nouveau_clear_page_t)(struct nouveau_drm *drm, u32 length,
@@ -79,6 +80,7 @@ struct nouveau_dmem {
struct list_head chunks;
struct mutex mutex;
struct page *free_pages;
+ struct page *free_huge_pages;
spinlock_t lock;
};

@@ -109,8 +111,13 @@ static void nouveau_dmem_page_free(struct page *page)
struct nouveau_dmem *dmem = chunk->drm->dmem;

spin_lock(&dmem->lock);
- page->zone_device_data = dmem->free_pages;
- dmem->free_pages = page;
+ if (PageHuge(page)) {
+ page->zone_device_data = dmem->free_huge_pages;
+ dmem->free_huge_pages = page;
+ } else {
+ page->zone_device_data = dmem->free_pages;
+ dmem->free_pages = page;
+ }

WARN_ON(!chunk->callocated);
chunk->callocated--;
@@ -136,33 +143,41 @@ static void nouveau_dmem_fence_done(struct nouveau_fence **fence)

static vm_fault_t nouveau_dmem_fault_copy_one(struct nouveau_drm *drm,
struct vm_fault *vmf, struct migrate_vma *args,
- dma_addr_t *dma_addr)
+ dma_addr_t *dma_addr, size_t *sizep)
{
struct device *dev = drm->dev->dev;
struct page *dpage, *spage;
+ unsigned int order;

spage = migrate_pfn_to_page(args->src[0]);
if (!spage || !(args->src[0] & MIGRATE_PFN_MIGRATE))
return 0;

- dpage = alloc_page_vma(GFP_HIGHUSER, vmf->vma, vmf->address);
+ order = compound_order(spage);
+ if (order)
+ dpage = alloc_transhugepage(vmf->vma, vmf->address);
+ else
+ dpage = alloc_page_vma(GFP_HIGHUSER, vmf->vma, vmf->address);
if (!dpage)
return VM_FAULT_SIGBUS;
+ WARN_ON_ONCE(order != compound_order(dpage));
lock_page(dpage);

- *dma_addr = dma_map_page(dev, dpage, 0, PAGE_SIZE, DMA_BIDIRECTIONAL);
+ *sizep = page_size(dpage);
+ *dma_addr = dma_map_page(dev, dpage, 0, *sizep, DMA_BIDIRECTIONAL);
if (dma_mapping_error(dev, *dma_addr))
goto error_free_page;

- if (drm->dmem->migrate.copy_func(drm, 1, NOUVEAU_APER_HOST, *dma_addr,
- NOUVEAU_APER_VRAM, nouveau_dmem_page_addr(spage)))
+ if (drm->dmem->migrate.copy_func(drm, page_size(spage),
+ NOUVEAU_APER_HOST, *dma_addr, NOUVEAU_APER_VRAM,
+ nouveau_dmem_page_addr(spage)))
goto error_dma_unmap;

args->dst[0] = migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED;
return 0;

error_dma_unmap:
- dma_unmap_page(dev, *dma_addr, PAGE_SIZE, DMA_BIDIRECTIONAL);
+ dma_unmap_page(dev, *dma_addr, page_size(dpage), DMA_BIDIRECTIONAL);
error_free_page:
__free_page(dpage);
return VM_FAULT_SIGBUS;
@@ -173,8 +188,11 @@ static vm_fault_t nouveau_dmem_migrate_to_ram(struct vm_fault *vmf)
struct nouveau_drm *drm = page_to_drm(vmf->page);
struct nouveau_dmem *dmem = drm->dmem;
struct nouveau_fence *fence;
+ struct page *page;
+ unsigned int order;
unsigned long src = 0, dst = 0;
dma_addr_t dma_addr = 0;
+ size_t size = 0;
vm_fault_t ret;
struct migrate_vma args = {
.vma = vmf->vma,
@@ -185,26 +203,52 @@ static vm_fault_t nouveau_dmem_migrate_to_ram(struct vm_fault *vmf)
.src_owner = drm->dev,
};

+ /*
+ * If the page was migrated to the GPU as a huge page, migrate it
+ * back the same way.
+ * FIXME If there is thrashing, maybe we should migrate one page.
+ */
+ page = compound_head(vmf->page);
+ order = compound_order(page);
+ if (order) {
+ args.start &= PAGE_MASK << order;
+ args.end = args.start + (PAGE_SIZE << order);
+ args.src = kcalloc(1U << order, sizeof(*args.src), GFP_KERNEL);
+ if (!args.src)
+ return VM_FAULT_OOM;
+ args.dst = kcalloc(1U << order, sizeof(*args.dst), GFP_KERNEL);
+ if (!args.dst) {
+ ret = VM_FAULT_OOM;
+ goto error_src;
+ }
+ }
+
/*
* FIXME what we really want is to find some heuristic to migrate more
* than just one page on CPU fault. When such fault happens it is very
* likely that more surrounding page will CPU fault too.
*/
- if (migrate_vma_setup(&args) < 0)
- return VM_FAULT_SIGBUS;
- if (!args.cpages)
- return 0;
+ if (migrate_vma_setup(&args) < 0) {
+ ret = VM_FAULT_SIGBUS;
+ goto error_dst;
+ }

- ret = nouveau_dmem_fault_copy_one(drm, vmf, &args, &dma_addr);
- if (ret || dst == 0)
+ ret = nouveau_dmem_fault_copy_one(drm, vmf, &args, &dma_addr, &size);
+ if (ret)
goto done;

nouveau_fence_new(dmem->migrate.chan, false, &fence);
migrate_vma_pages(&args);
nouveau_dmem_fence_done(&fence);
- dma_unmap_page(drm->dev->dev, dma_addr, PAGE_SIZE, DMA_BIDIRECTIONAL);
+ dma_unmap_page(drm->dev->dev, dma_addr, size, DMA_BIDIRECTIONAL);
done:
migrate_vma_finalize(&args);
+error_dst:
+ if (args.dst != &dst)
+ kfree(args.dst);
+error_src:
+ if (args.src != &src)
+ kfree(args.src);
return ret;
}

@@ -213,8 +257,8 @@ static const struct dev_pagemap_ops nouveau_dmem_pagemap_ops = {
.migrate_to_ram = nouveau_dmem_migrate_to_ram,
};

-static int
-nouveau_dmem_chunk_alloc(struct nouveau_drm *drm, struct page **ppage)
+static int nouveau_dmem_chunk_alloc(struct nouveau_drm *drm, bool is_huge,
+ struct page **ppage)
{
struct nouveau_dmem_chunk *chunk;
struct resource *res;
@@ -266,16 +310,20 @@ nouveau_dmem_chunk_alloc(struct nouveau_drm *drm, struct page **ppage)
pfn_first = chunk->pagemap.res.start >> PAGE_SHIFT;
page = pfn_to_page(pfn_first);
spin_lock(&drm->dmem->lock);
- for (i = 0; i < DMEM_CHUNK_NPAGES - 1; ++i, ++page) {
- page->zone_device_data = drm->dmem->free_pages;
- drm->dmem->free_pages = page;
- }
+ if (is_huge)
+ prep_compound_page(page, PMD_ORDER);
+ else
+ for (i = 0; i < DMEM_CHUNK_NPAGES - 1; ++i, ++page) {
+ page->zone_device_data = drm->dmem->free_pages;
+ drm->dmem->free_pages = page;
+ }
*ppage = page;
chunk->callocated++;
spin_unlock(&drm->dmem->lock);

- NV_INFO(drm, "DMEM: registered %ldMB of device memory\n",
- DMEM_CHUNK_SIZE >> 20);
+ NV_INFO(drm, "DMEM: registered %ldMB of %sdevice memory %lx %lx\n",
+ DMEM_CHUNK_SIZE >> 20, is_huge ? "huge " : "", pfn_first,
+ nouveau_dmem_page_addr(page));

return 0;

@@ -293,14 +341,20 @@ nouveau_dmem_chunk_alloc(struct nouveau_drm *drm, struct page **ppage)
}

static struct page *
-nouveau_dmem_page_alloc_locked(struct nouveau_drm *drm)
+nouveau_dmem_page_alloc_locked(struct nouveau_drm *drm, bool is_huge)
{
struct nouveau_dmem_chunk *chunk;
struct page *page = NULL;
int ret;

spin_lock(&drm->dmem->lock);
- if (drm->dmem->free_pages) {
+ if (is_huge && drm->dmem->free_huge_pages) {
+ page = drm->dmem->free_huge_pages;
+ drm->dmem->free_huge_pages = page->zone_device_data;
+ chunk = nouveau_page_to_chunk(page);
+ chunk->callocated++;
+ spin_unlock(&drm->dmem->lock);
+ } else if (!is_huge && drm->dmem->free_pages) {
page = drm->dmem->free_pages;
drm->dmem->free_pages = page->zone_device_data;
chunk = nouveau_page_to_chunk(page);
@@ -308,7 +362,7 @@ nouveau_dmem_page_alloc_locked(struct nouveau_drm *drm)
spin_unlock(&drm->dmem->lock);
} else {
spin_unlock(&drm->dmem->lock);
- ret = nouveau_dmem_chunk_alloc(drm, &page);
+ ret = nouveau_dmem_chunk_alloc(drm, is_huge, &page);
if (ret)
return NULL;
}
@@ -381,19 +435,18 @@ nouveau_dmem_fini(struct nouveau_drm *drm)
}

static int
-nvc0b5_migrate_copy(struct nouveau_drm *drm, u64 npages,
+nvc0b5_migrate_copy(struct nouveau_drm *drm, u32 length,
enum nouveau_aper dst_aper, u64 dst_addr,
enum nouveau_aper src_aper, u64 src_addr)
{
struct nouveau_channel *chan = drm->dmem->migrate.chan;
- u32 launch_dma = (1 << 9) /* MULTI_LINE_ENABLE. */ |
- (1 << 8) /* DST_MEMORY_LAYOUT_PITCH. */ |
+ u32 launch_dma = (1 << 8) /* DST_MEMORY_LAYOUT_PITCH. */ |
(1 << 7) /* SRC_MEMORY_LAYOUT_PITCH. */ |
(1 << 2) /* FLUSH_ENABLE_TRUE. */ |
(2 << 0) /* DATA_TRANSFER_TYPE_NON_PIPELINED. */;
int ret;

- ret = RING_SPACE(chan, 13);
+ ret = RING_SPACE(chan, 11);
if (ret)
return ret;

@@ -425,17 +478,15 @@ nvc0b5_migrate_copy(struct nouveau_drm *drm, u64 npages,
launch_dma |= 0x00002000; /* DST_TYPE_PHYSICAL. */
}

- BEGIN_NVC0(chan, NvSubCopy, 0x0400, 8);
- OUT_RING (chan, upper_32_bits(src_addr));
- OUT_RING (chan, lower_32_bits(src_addr));
- OUT_RING (chan, upper_32_bits(dst_addr));
- OUT_RING (chan, lower_32_bits(dst_addr));
- OUT_RING (chan, PAGE_SIZE);
- OUT_RING (chan, PAGE_SIZE);
- OUT_RING (chan, PAGE_SIZE);
- OUT_RING (chan, npages);
+ BEGIN_NVC0(chan, NvSubCopy, 0x0400, 4);
+ OUT_RING(chan, upper_32_bits(src_addr));
+ OUT_RING(chan, lower_32_bits(src_addr));
+ OUT_RING(chan, upper_32_bits(dst_addr));
+ OUT_RING(chan, lower_32_bits(dst_addr));
+ BEGIN_NVC0(chan, NvSubCopy, 0x0418, 1);
+ OUT_RING(chan, length);
BEGIN_NVC0(chan, NvSubCopy, 0x0300, 1);
- OUT_RING (chan, launch_dma);
+ OUT_RING(chan, launch_dma);
return 0;
}

@@ -535,6 +586,7 @@ static unsigned long nouveau_dmem_migrate_copy_one(struct nouveau_drm *drm,
struct device *dev = drm->dev->dev;
struct page *dpage, *spage;
unsigned long paddr;
+ unsigned long dst;

spage = migrate_pfn_to_page(src);
if (!(src & MIGRATE_PFN_MIGRATE))
@@ -546,7 +598,8 @@ static unsigned long nouveau_dmem_migrate_copy_one(struct nouveau_drm *drm,
goto done;
}

- dpage = nouveau_dmem_page_alloc_locked(drm);
+ dpage = nouveau_dmem_page_alloc_locked(drm,
+ src & MIGRATE_PFN_COMPOUND);
if (!dpage)
goto out;

@@ -556,7 +609,7 @@ static unsigned long nouveau_dmem_migrate_copy_one(struct nouveau_drm *drm,
DMA_BIDIRECTIONAL);
if (dma_mapping_error(dev, *dma_addr))
goto out_free_page;
- if (drm->dmem->migrate.copy_func(drm, 1,
+ if (drm->dmem->migrate.copy_func(drm, page_size(spage),
NOUVEAU_APER_VRAM, paddr, NOUVEAU_APER_HOST, *dma_addr))
goto out_dma_unmap;
} else {
@@ -571,10 +624,13 @@ static unsigned long nouveau_dmem_migrate_copy_one(struct nouveau_drm *drm,
((paddr >> PAGE_SHIFT) << NVIF_VMM_PFNMAP_V0_ADDR_SHIFT);
if (src & MIGRATE_PFN_WRITE)
*pfn |= NVIF_VMM_PFNMAP_V0_W;
- return migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED;
+ dst = migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED;
+ if (PageHead(dpage))
+ dst |= MIGRATE_PFN_COMPOUND;
+ return dst;

out_dma_unmap:
- dma_unmap_page(dev, *dma_addr, PAGE_SIZE, DMA_BIDIRECTIONAL);
+ dma_unmap_page(dev, *dma_addr, page_size(spage), DMA_BIDIRECTIONAL);
out_free_page:
nouveau_dmem_page_free_locked(drm, dpage);
out:
@@ -588,24 +644,30 @@ static void nouveau_dmem_migrate_chunk(struct nouveau_drm *drm,
{
struct nouveau_fence *fence;
unsigned long addr = args->start, nr_dma = 0, i;
+ unsigned int page_shift = PAGE_SHIFT;

for (i = 0; addr < args->end; i++) {
args->dst[i] = nouveau_dmem_migrate_copy_one(drm, args->src[i],
dma_addrs + nr_dma, pfns + i);
if (!dma_mapping_error(drm->dev->dev, dma_addrs[nr_dma]))
nr_dma++;
+ if (args->dst[i] & MIGRATE_PFN_COMPOUND) {
+ page_shift = PMD_SHIFT;
+ i++;
+ break;
+ }
addr += PAGE_SIZE;
}

nouveau_fence_new(drm->dmem->migrate.chan, false, &fence);
migrate_vma_pages(args);
nouveau_dmem_fence_done(&fence);
- nouveau_pfns_map(svmm, args->vma->vm_mm, args->start, pfns, i);
+ nouveau_pfns_map(svmm, args->vma->vm_mm, args->start, pfns, i,
+ page_shift);

- while (nr_dma--) {
- dma_unmap_page(drm->dev->dev, dma_addrs[nr_dma], PAGE_SIZE,
- DMA_BIDIRECTIONAL);
- }
+ while (nr_dma)
+ dma_unmap_page(drm->dev->dev, dma_addrs[--nr_dma],
+ 1UL << page_shift, DMA_BIDIRECTIONAL);
migrate_vma_finalize(args);
}

@@ -617,7 +679,7 @@ nouveau_dmem_migrate_vma(struct nouveau_drm *drm,
unsigned long end)
{
unsigned long npages = (end - start) >> PAGE_SHIFT;
- unsigned long max = min(SG_MAX_SINGLE_ALLOC, npages);
+ unsigned long max = min(1UL << PMD_ORDER, npages);
dma_addr_t *dma_addrs;
struct migrate_vma args = {
.vma = vma,
@@ -646,8 +708,10 @@ nouveau_dmem_migrate_vma(struct nouveau_drm *drm,
if (!pfns)
goto out_free_dma;

- for (i = 0; i < npages; i += max) {
- args.end = start + (max << PAGE_SHIFT);
+ for (; args.start < end; args.start = args.end) {
+ args.end = ALIGN(args.start, PMD_SIZE);
+ if (args.start == args.end)
+ args.end = min(end, args.start + PMD_SIZE);
ret = migrate_vma_setup(&args);
if (ret)
goto out_free_pfns;
@@ -655,7 +719,6 @@ nouveau_dmem_migrate_vma(struct nouveau_drm *drm,
if (args.cpages)
nouveau_dmem_migrate_chunk(drm, svmm, &args, dma_addrs,
pfns);
- args.start = args.end;
}

ret = 0;
diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c b/drivers/gpu/drm/nouveau/nouveau_svm.c
index a27625f3c5f9..f386a9318190 100644
--- a/drivers/gpu/drm/nouveau/nouveau_svm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
@@ -684,7 +684,6 @@ nouveau_svm_fault(struct nvif_notify *notify)
nouveau_svm_fault_cancel_fault(svm, buffer->fault[fi]);
continue;
}
- SVMM_DBG(svmm, "addr %016llx", buffer->fault[fi]->addr);

/* We try and group handling of faults within a small
* window into a single update.
@@ -736,6 +735,10 @@ nouveau_svm_fault(struct nvif_notify *notify)
}
mmput(mm);

+ SVMM_DBG(svmm, "addr %llx %s %c", buffer->fault[fi]->addr,
+ args.phys[0] & NVIF_VMM_PFNMAP_V0_VRAM ?
+ "vram" : "sysmem",
+ args.i.p.size > PAGE_SIZE ? 'H' : 'N');
limit = args.i.p.addr + args.i.p.size;
for (fn = fi; ++fn < buffer->fault_nr; ) {
/* It's okay to skip over duplicate addresses from the
@@ -807,13 +810,15 @@ nouveau_pfns_free(u64 *pfns)

void
nouveau_pfns_map(struct nouveau_svmm *svmm, struct mm_struct *mm,
- unsigned long addr, u64 *pfns, unsigned long npages)
+ unsigned long addr, u64 *pfns, unsigned long npages,
+ unsigned int page_shift)
{
struct nouveau_pfnmap_args *args = nouveau_pfns_to_args(pfns);
int ret;

args->p.addr = addr;
- args->p.size = npages << PAGE_SHIFT;
+ args->p.page = page_shift;
+ args->p.size = npages << args->p.page;

mutex_lock(&svmm->mutex);

diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.h b/drivers/gpu/drm/nouveau/nouveau_svm.h
index f0fcd1b72e8b..ba5927e445ad 100644
--- a/drivers/gpu/drm/nouveau/nouveau_svm.h
+++ b/drivers/gpu/drm/nouveau/nouveau_svm.h
@@ -22,7 +22,8 @@ int nouveau_svmm_bind(struct drm_device *, void *, struct drm_file *);
u64 *nouveau_pfns_alloc(unsigned long npages);
void nouveau_pfns_free(u64 *pfns);
void nouveau_pfns_map(struct nouveau_svmm *svmm, struct mm_struct *mm,
- unsigned long addr, u64 *pfns, unsigned long npages);
+ unsigned long addr, u64 *pfns, unsigned long npages,
+ unsigned int page_shift);
#else /* IS_ENABLED(CONFIG_DRM_NOUVEAU_SVM) */
static inline void nouveau_svm_init(struct nouveau_drm *drm) {}
static inline void nouveau_svm_fini(struct nouveau_drm *drm) {}
--
2.20.1

2020-06-20 04:50:56

by Ralph Campbell

[permalink] [raw]
Subject: [PATCH 13/16] mm: support THP migration to device private memory

Support transparent huge page migration to ZONE_DEVICE private memory.
A new flag (MIGRATE_PFN_COMPOUND) is added to the input PFN array to
indicate the huge page was fully mapped by the CPU.
Export prep_compound_page() so that device drivers can create huge
device private pages after calling memremap_pages().

Signed-off-by: Ralph Campbell <[email protected]>
---
include/linux/migrate.h | 1 +
include/linux/mm.h | 1 +
mm/huge_memory.c | 30 ++++--
mm/internal.h | 1 -
mm/memory.c | 10 +-
mm/memremap.c | 9 +-
mm/migrate.c | 226 ++++++++++++++++++++++++++++++++--------
mm/page_alloc.c | 1 +
8 files changed, 226 insertions(+), 53 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 3e546cbf03dd..f6a64965c8bd 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -166,6 +166,7 @@ static inline int migrate_misplaced_transhuge_page(struct mm_struct *mm,
#define MIGRATE_PFN_MIGRATE (1UL << 1)
#define MIGRATE_PFN_LOCKED (1UL << 2)
#define MIGRATE_PFN_WRITE (1UL << 3)
+#define MIGRATE_PFN_COMPOUND (1UL << 4)
#define MIGRATE_PFN_SHIFT 6

static inline struct page *migrate_pfn_to_page(unsigned long mpfn)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index dc7b87310c10..020b9dd3cddb 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -932,6 +932,7 @@ static inline unsigned int page_shift(struct page *page)
}

void free_compound_page(struct page *page);
+void prep_compound_page(struct page *page, unsigned int order);

#ifdef CONFIG_MMU
/*
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 78c84bee7e29..25d95f7b1e98 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1663,23 +1663,35 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
} else {
struct page *page = NULL;
int flush_needed = 1;
+ bool is_anon = false;

if (pmd_present(orig_pmd)) {
page = pmd_page(orig_pmd);
+ is_anon = PageAnon(page);
page_remove_rmap(page, true);
VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
VM_BUG_ON_PAGE(!PageHead(page), page);
} else if (thp_migration_supported()) {
swp_entry_t entry;

- VM_BUG_ON(!is_pmd_migration_entry(orig_pmd));
entry = pmd_to_swp_entry(orig_pmd);
- page = pfn_to_page(swp_offset(entry));
+ if (is_device_private_entry(entry)) {
+ page = device_private_entry_to_page(entry);
+ is_anon = PageAnon(page);
+ page_remove_rmap(page, true);
+ VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
+ VM_BUG_ON_PAGE(!PageHead(page), page);
+ put_page(page);
+ } else {
+ VM_BUG_ON(!is_pmd_migration_entry(orig_pmd));
+ page = pfn_to_page(swp_offset(entry));
+ is_anon = PageAnon(page);
+ }
flush_needed = 0;
} else
WARN_ONCE(1, "Non present huge pmd without pmd migration enabled!");

- if (PageAnon(page)) {
+ if (is_anon) {
zap_deposited_table(tlb->mm, pmd);
add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
} else {
@@ -1778,8 +1790,8 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
/*
* Returns
* - 0 if PMD could not be locked
- * - 1 if PMD was locked but protections unchange and TLB flush unnecessary
- * - HPAGE_PMD_NR is protections changed and TLB flush necessary
+ * - 1 if PMD was locked but protections unchanged and TLB flush unnecessary
+ * - HPAGE_PMD_NR if protections changed and TLB flush necessary
*/
int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
unsigned long addr, pgprot_t newprot, unsigned long cp_flags)
@@ -2689,7 +2701,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
if (!mapcount && page_ref_freeze(head, 1 + extra_pins)) {
if (!list_empty(page_deferred_list(head))) {
ds_queue->split_queue_len--;
- list_del(page_deferred_list(head));
+ list_del_init(page_deferred_list(head));
}
spin_unlock(&ds_queue->split_queue_lock);
if (mapping) {
@@ -2743,7 +2755,7 @@ void free_transhuge_page(struct page *page)
spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
if (!list_empty(page_deferred_list(page))) {
ds_queue->split_queue_len--;
- list_del(page_deferred_list(page));
+ list_del_init(page_deferred_list(page));
}
spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
free_compound_page(page);
@@ -2963,6 +2975,10 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
pmde = pmd_mksoft_dirty(pmde);
if (is_write_migration_entry(entry))
pmde = maybe_pmd_mkwrite(pmde, vma);
+ if (unlikely(is_device_private_page(new))) {
+ entry = make_device_private_entry(new, pmd_write(pmde));
+ pmde = swp_entry_to_pmd(entry);
+ }

flush_cache_range(vma, mmun_start, mmun_start + HPAGE_PMD_SIZE);
if (PageAnon(new))
diff --git a/mm/internal.h b/mm/internal.h
index 9886db20d94f..58f051a14dae 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -196,7 +196,6 @@ extern void __putback_isolated_page(struct page *page, unsigned int order,
extern void memblock_free_pages(struct page *page, unsigned long pfn,
unsigned int order);
extern void __free_pages_core(struct page *page, unsigned int order);
-extern void prep_compound_page(struct page *page, unsigned int order);
extern void post_alloc_hook(struct page *page, unsigned int order,
gfp_t gfp_flags);
extern int user_min_free_kbytes;
diff --git a/mm/memory.c b/mm/memory.c
index dc7f3543b1fd..4e03d1a2ead5 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4322,9 +4322,15 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,

barrier();
if (unlikely(is_swap_pmd(orig_pmd))) {
+ swp_entry_t entry = pmd_to_swp_entry(orig_pmd);
+
+ if (is_device_private_entry(entry)) {
+ vmf.page = device_private_entry_to_page(entry);
+ return vmf.page->pgmap->ops->migrate_to_ram(&vmf);
+ }
VM_BUG_ON(thp_migration_supported() &&
- !is_pmd_migration_entry(orig_pmd));
- if (is_pmd_migration_entry(orig_pmd))
+ !is_migration_entry(entry));
+ if (is_migration_entry(entry))
pmd_migration_entry_wait(mm, vmf.pmd);
return 0;
}
diff --git a/mm/memremap.c b/mm/memremap.c
index 03e38b7a38f1..4231054188b4 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -132,8 +132,13 @@ void memunmap_pages(struct dev_pagemap *pgmap)
int nid;

dev_pagemap_kill(pgmap);
- for_each_device_pfn(pfn, pgmap)
- put_page(pfn_to_page(pfn));
+ for_each_device_pfn(pfn, pgmap) {
+ struct page *page = pfn_to_page(pfn);
+ unsigned int order = compound_order(page);
+
+ put_page(page);
+ pfn += (1U << order) - 1;
+ }
dev_pagemap_cleanup(pgmap);

/* make sure to access a memmap that was actually initialized */
diff --git a/mm/migrate.c b/mm/migrate.c
index 87c52e0ee580..1536677cefc9 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -51,6 +51,7 @@
#include <linux/oom.h>

#include <asm/tlbflush.h>
+#include <asm/pgalloc.h>

#define CREATE_TRACE_POINTS
#include <trace/events/migrate.h>
@@ -2185,6 +2186,8 @@ static int migrate_vma_collect_hole(unsigned long start,

for (addr = start; addr < end; addr += PAGE_SIZE) {
migrate->src[migrate->npages] = flags;
+ if ((addr & ~PMD_MASK) == 0 && (end & ~PMD_MASK) == 0)
+ migrate->src[migrate->npages] |= MIGRATE_PFN_COMPOUND;
migrate->dst[migrate->npages] = 0;
migrate->npages++;
migrate->cpages++;
@@ -2219,48 +2222,87 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
unsigned long addr = start, unmapped = 0;
spinlock_t *ptl;
pte_t *ptep;
+ pmd_t pmd;

again:
- if (pmd_none(*pmdp))
+ pmd = READ_ONCE(*pmdp);
+ if (pmd_none(pmd))
return migrate_vma_collect_hole(start, end, -1, walk);

- if (pmd_trans_huge(*pmdp)) {
+ if (pmd_trans_huge(pmd) || !pmd_present(pmd)) {
struct page *page;
+ unsigned long write = 0;
+ int ret;

ptl = pmd_lock(mm, pmdp);
- if (unlikely(!pmd_trans_huge(*pmdp))) {
- spin_unlock(ptl);
- goto again;
- }
+ if (pmd_trans_huge(*pmdp)) {
+ page = pmd_page(*pmdp);
+ if (is_huge_zero_page(page)) {
+ spin_unlock(ptl);
+ return migrate_vma_collect_hole(start, end, -1,
+ walk);
+ }
+ if (pmd_write(*pmdp))
+ write = MIGRATE_PFN_WRITE;
+ } else if (!pmd_present(*pmdp)) {
+ swp_entry_t entry = pmd_to_swp_entry(*pmdp);

- page = pmd_page(*pmdp);
- if (is_huge_zero_page(page)) {
- spin_unlock(ptl);
- split_huge_pmd(vma, pmdp, addr);
- if (pmd_trans_unstable(pmdp))
+ if (!is_device_private_entry(entry)) {
+ spin_unlock(ptl);
return migrate_vma_collect_skip(start, end,
walk);
+ }
+ page = device_private_entry_to_page(entry);
+ if (is_write_device_private_entry(entry))
+ write = MIGRATE_PFN_WRITE;
} else {
- int ret;
+ spin_unlock(ptl);
+ goto again;
+ }

- get_page(page);
+ get_page(page);
+ if (unlikely(!trylock_page(page))) {
spin_unlock(ptl);
- if (unlikely(!trylock_page(page)))
- return migrate_vma_collect_skip(start, end,
- walk);
- ret = split_huge_page(page);
- unlock_page(page);
put_page(page);
- if (ret)
- return migrate_vma_collect_skip(start, end,
- walk);
- if (pmd_none(*pmdp))
- return migrate_vma_collect_hole(start, end, -1,
- walk);
+ return migrate_vma_collect_skip(start, end, walk);
+ }
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+ if ((start & HPAGE_PMD_MASK) == start &&
+ start + HPAGE_PMD_SIZE == end) {
+ struct page_vma_mapped_walk vmw = {
+ .vma = vma,
+ .address = start,
+ .pmd = pmdp,
+ .ptl = ptl,
+ };
+
+ migrate->src[migrate->npages] = write |
+ migrate_pfn(page_to_pfn(page)) |
+ MIGRATE_PFN_MIGRATE | MIGRATE_PFN_LOCKED |
+ MIGRATE_PFN_COMPOUND;
+ migrate->dst[migrate->npages] = 0;
+ migrate->npages++;
+ migrate->cpages++;
+ migrate_vma_collect_skip(start + PAGE_SIZE, end, walk);
+
+ /* Note that this also removes the page from the rmap. */
+ set_pmd_migration_entry(&vmw, page);
+ spin_unlock(ptl);
+
+ return 0;
}
+#endif
+ spin_unlock(ptl);
+ ret = split_huge_page(page);
+ unlock_page(page);
+ put_page(page);
+ if (ret)
+ return migrate_vma_collect_skip(start, end, walk);
+ if (pmd_none(*pmdp))
+ return migrate_vma_collect_hole(start, end, -1, walk);
}

- if (unlikely(pmd_bad(*pmdp)))
+ if (unlikely(pmd_bad(pmd) || pmd_devmap(pmd)))
return migrate_vma_collect_skip(start, end, walk);

ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
@@ -2310,8 +2352,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
}

- /* FIXME support THP */
- if (!page || !page->mapping || PageTransCompound(page)) {
+ if (!page || !page->mapping) {
mpfn = 0;
goto next;
}
@@ -2420,14 +2461,6 @@ static bool migrate_vma_check_page(struct page *page)
*/
int extra = 1;

- /*
- * FIXME support THP (transparent huge page), it is bit more complex to
- * check them than regular pages, because they can be mapped with a pmd
- * or with a pte (split pte mapping).
- */
- if (PageCompound(page))
- return false;
-
/* Page from ZONE_DEVICE have one extra reference */
if (is_zone_device_page(page)) {
/*
@@ -2726,13 +2759,115 @@ int migrate_vma_setup(struct migrate_vma *args)
}
EXPORT_SYMBOL(migrate_vma_setup);

+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+/*
+ * This code closely follows:
+ * do_huge_pmd_anonymous_page()
+ * __do_huge_pmd_anonymous_page()
+ * except that the page being inserted is likely to be a device private page
+ * instead of an allocated or zero page.
+ */
+static int insert_huge_pmd_anonymous_page(struct vm_area_struct *vma,
+ unsigned long haddr,
+ struct page *page,
+ unsigned long *src,
+ unsigned long *dst,
+ pmd_t *pmdp)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ unsigned int i;
+ spinlock_t *ptl;
+ bool flush = false;
+ pgtable_t pgtable;
+ gfp_t gfp;
+ pmd_t entry;
+
+ if (WARN_ON_ONCE(compound_order(page) != HPAGE_PMD_ORDER))
+ goto abort;
+
+ if (unlikely(anon_vma_prepare(vma)))
+ goto abort;
+
+ prep_transhuge_page(page);
+
+ gfp = GFP_TRANSHUGE_LIGHT;
+ if (mem_cgroup_charge(page, mm, gfp))
+ goto abort;
+
+ pgtable = pte_alloc_one(mm);
+ if (unlikely(!pgtable))
+ goto abort;
+
+ __SetPageUptodate(page);
+
+ if (is_zone_device_page(page)) {
+ if (!is_device_private_page(page))
+ goto pgtable_abort;
+ entry = swp_entry_to_pmd(make_device_private_entry(page,
+ vma->vm_flags & VM_WRITE));
+ } else {
+ entry = mk_huge_pmd(page, vma->vm_page_prot);
+ entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+ }
+
+ ptl = pmd_lock(mm, pmdp);
+
+ if (check_stable_address_space(mm))
+ goto unlock_abort;
+
+ /*
+ * Check for userfaultfd but do not deliver the fault. Instead,
+ * just back off.
+ */
+ if (userfaultfd_missing(vma))
+ goto unlock_abort;
+
+ if (pmd_present(*pmdp)) {
+ if (!is_huge_zero_pmd(*pmdp))
+ goto unlock_abort;
+ flush = true;
+ } else if (!pmd_none(*pmdp))
+ goto unlock_abort;
+
+ get_page(page);
+ page_add_new_anon_rmap(page, vma, haddr, true);
+ if (!is_device_private_page(page))
+ lru_cache_add_active_or_unevictable(page, vma);
+ if (flush) {
+ pte_free(mm, pgtable);
+ flush_cache_range(vma, haddr, haddr + HPAGE_PMD_SIZE);
+ pmdp_invalidate(vma, haddr, pmdp);
+ } else {
+ pgtable_trans_huge_deposit(mm, pmdp, pgtable);
+ mm_inc_nr_ptes(mm);
+ }
+ set_pmd_at(mm, haddr, pmdp, entry);
+ update_mmu_cache_pmd(vma, haddr, pmdp);
+ add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
+ spin_unlock(ptl);
+ count_vm_event(THP_FAULT_ALLOC);
+ count_memcg_event_mm(mm, THP_FAULT_ALLOC);
+
+ return 0;
+
+unlock_abort:
+ spin_unlock(ptl);
+pgtable_abort:
+ pte_free(mm, pgtable);
+abort:
+ for (i = 0; i < HPAGE_PMD_NR; i++)
+ src[i] &= ~MIGRATE_PFN_MIGRATE;
+ return -EINVAL;
+}
+#endif
+
/*
* This code closely matches the code in:
* __handle_mm_fault()
* handle_pte_fault()
* do_anonymous_page()
- * to map in an anonymous zero page but the struct page will be a ZONE_DEVICE
- * private page.
+ * to map in an anonymous zero page except the struct page is already allocated
+ * and will likely be a ZONE_DEVICE private page.
*/
static void migrate_vma_insert_page(struct migrate_vma *migrate,
unsigned long addr,
@@ -2766,7 +2901,16 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
if (!pmdp)
goto abort;

- if (pmd_trans_huge(*pmdp) || pmd_devmap(*pmdp))
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+ if (*dst & MIGRATE_PFN_COMPOUND) {
+ int ret = insert_huge_pmd_anonymous_page(vma, addr, page, src,
+ dst, pmdp);
+ if (ret)
+ goto abort;
+ return;
+ }
+#endif
+ if (pmd_trans_huge(*pmdp) || pmd_devmap(*pmdp) || pmd_bad(*pmdp))
goto abort;

/*
@@ -2804,7 +2948,8 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,

swp_entry = make_device_private_entry(page, vma->vm_flags & VM_WRITE);
entry = swp_entry_to_pte(swp_entry);
- }
+ } else
+ goto abort;
} else {
entry = mk_pte(page, vma->vm_page_prot);
if (vma->vm_flags & VM_WRITE)
@@ -2833,10 +2978,10 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
goto unlock_abort;

inc_mm_counter(mm, MM_ANONPAGES);
+ get_page(page);
page_add_new_anon_rmap(page, vma, addr, false);
if (!is_zone_device_page(page))
lru_cache_add_active_or_unevictable(page, vma);
- get_page(page);

if (flush) {
flush_cache_page(vma, addr, pte_pfn(*ptep));
@@ -2850,7 +2995,6 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
}

pte_unmap_unlock(ptep, ptl);
- *src = MIGRATE_PFN_MIGRATE;
return;

unlock_abort:
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 48eb0f1410d4..a852ed2f204c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -686,6 +686,7 @@ void prep_compound_page(struct page *page, unsigned int order)
if (hpage_pincount_available(page))
atomic_set(compound_pincount_ptr(page), 0);
}
+EXPORT_SYMBOL_GPL(prep_compound_page);

#ifdef CONFIG_DEBUG_PAGEALLOC
unsigned int _debug_guardpage_minorder;
--
2.20.1

2020-06-20 04:50:59

by Ralph Campbell

[permalink] [raw]
Subject: [PATCH 14/16] mm/thp: add THP allocation helper

Transparent huge page allocation policy is controlled by several sysfs
variables. Rather than expose these to each device driver that needs to
allocate THPs, provide a helper function.

Signed-off-by: Ralph Campbell <[email protected]>
---
include/linux/gfp.h | 10 ++++++++++
mm/huge_memory.c | 16 ++++++++++++++++
2 files changed, 26 insertions(+)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 67a0774e080b..1c7d968a27d3 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -562,6 +562,16 @@ extern struct page *alloc_pages_vma(gfp_t gfp_mask, int order,
alloc_pages_vma(gfp_mask, 0, vma, addr, numa_node_id(), false)
#define alloc_page_vma_node(gfp_mask, vma, addr, node) \
alloc_pages_vma(gfp_mask, 0, vma, addr, node, false)
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+extern struct page *alloc_transhugepage(struct vm_area_struct *vma,
+ unsigned long addr);
+#else
+static inline struct page *alloc_transhugepage(struct vm_area_struct *vma,
+ unsigned long addr)
+{
+ return NULL;
+}
+#endif

extern unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order);
extern unsigned long get_zeroed_page(gfp_t gfp_mask);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 25d95f7b1e98..f749633ed350 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -775,6 +775,22 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
return __do_huge_pmd_anonymous_page(vmf, page, gfp);
}

+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+struct page *alloc_transhugepage(struct vm_area_struct *vma,
+ unsigned long haddr)
+{
+ gfp_t gfp;
+ struct page *page;
+
+ gfp = alloc_hugepage_direct_gfpmask(vma);
+ page = alloc_hugepage_vma(gfp, vma, haddr, HPAGE_PMD_ORDER);
+ if (page)
+ prep_transhuge_page(page);
+ return page;
+}
+EXPORT_SYMBOL_GPL(alloc_transhugepage);
+#endif
+
static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
pmd_t *pmd, pfn_t pfn, pgprot_t prot, bool write,
pgtable_t pgtable)
--
2.20.1

2020-06-20 04:51:03

by Ralph Campbell

[permalink] [raw]
Subject: [PATCH 01/16] mm: fix migrate_vma_setup() src_owner and normal pages

The caller of migrate_vma_setup() does not know what type of page is
stored in the CPU's page tables. Pages within the specified range are
free to be swapped out, migrated, or freed until after migrate_vma_setup()
returns. The caller needs to set struct migrate_vma.src_owner in case a
page is a ZONE device private page that the device owns and might want to
migrate. However, the current code skips normal anonymous pages if
src_owner is set, thus preventing those pages from being migrated.
Remove the src_owner check for normal pages since src_owner only applies
to device private pages and allow a range of normal and device private
pages to be migrated.

Fixes: 800bb1c8dc80 ("mm: handle multiple owners of device private pages in migrate_vma")
Signed-off-by: Ralph Campbell <[email protected]>
---
mm/migrate.c | 2 --
1 file changed, 2 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index f37729673558..24535281cea3 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2295,8 +2295,6 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
if (is_write_device_private_entry(entry))
mpfn |= MIGRATE_PFN_WRITE;
} else {
- if (migrate->src_owner)
- goto next;
pfn = pte_pfn(pte);
if (is_zero_pfn(pfn)) {
mpfn = MIGRATE_PFN_MIGRATE;
--
2.20.1

2020-06-20 04:51:44

by Ralph Campbell

[permalink] [raw]
Subject: [PATCH 04/16] mm/hmm: fix test timeout on slower machines

The HMM self test "migrate_multiple" can timeout on slower machines.
Lower the number of loop iterations to fix this.

Signed-off-by: Ralph Campbell <[email protected]>
---
tools/testing/selftests/vm/hmm-tests.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/vm/hmm-tests.c b/tools/testing/selftests/vm/hmm-tests.c
index 79db22604019..bdfa95ac9a7d 100644
--- a/tools/testing/selftests/vm/hmm-tests.c
+++ b/tools/testing/selftests/vm/hmm-tests.c
@@ -45,7 +45,7 @@ struct hmm_buffer {
#define TWOMEG (1 << 21)
#define HMM_BUFFER_SIZE (1024 << 12)
#define HMM_PATH_MAX 64
-#define NTIMES 256
+#define NTIMES 50

#define ALIGN(x, a) (((x) + (a - 1)) & (~((a) - 1)))

--
2.20.1

2020-06-20 04:52:01

by Ralph Campbell

[permalink] [raw]
Subject: [PATCH 03/16] nouveau: fix mixed normal and device private page migration

The OpenCL function clEnqueueSVMMigrateMem(), without any flags, will
migrate memory in the given address range to device private memory. The
source pages might already have been migrated to device private memory.
In that case, the source struct page is not checked to see if it is
a device private page and incorrectly computes the GPU's physical
address of local memory leading to data corruption.
Fix this by checking the source struct page and computing the correct
physical address.

Signed-off-by: Ralph Campbell <[email protected]>
---
drivers/gpu/drm/nouveau/nouveau_dmem.c | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index cc9993837508..f6a806ba3caa 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -540,6 +540,12 @@ static unsigned long nouveau_dmem_migrate_copy_one(struct nouveau_drm *drm,
if (!(src & MIGRATE_PFN_MIGRATE))
goto out;

+ if (spage && is_device_private_page(spage)) {
+ paddr = nouveau_dmem_page_addr(spage);
+ *dma_addr = DMA_MAPPING_ERROR;
+ goto done;
+ }
+
dpage = nouveau_dmem_page_alloc_locked(drm);
if (!dpage)
goto out;
@@ -560,6 +566,7 @@ static unsigned long nouveau_dmem_migrate_copy_one(struct nouveau_drm *drm,
goto out_free_page;
}

+done:
*pfn = NVIF_VMM_PFNMAP_V0_V | NVIF_VMM_PFNMAP_V0_VRAM |
((paddr >> PAGE_SHIFT) << NVIF_VMM_PFNMAP_V0_ADDR_SHIFT);
if (src & MIGRATE_PFN_WRITE)
@@ -615,6 +622,7 @@ nouveau_dmem_migrate_vma(struct nouveau_drm *drm,
struct migrate_vma args = {
.vma = vma,
.start = start,
+ .src_owner = drm->dev,
};
unsigned long i;
u64 *pfns;
--
2.20.1

2020-06-20 04:52:23

by Ralph Campbell

[permalink] [raw]
Subject: [PATCH 11/16] hmm: add tests for HMM_PFN_COMPOUND flag

Add some sanity tests for hmm_range_fault() returning the HMM_PFN_COMPOUND
flag.

Signed-off-by: Ralph Campbell <[email protected]>
---
lib/test_hmm.c | 2 +
lib/test_hmm_uapi.h | 2 +
tools/testing/selftests/vm/hmm-tests.c | 76 ++++++++++++++++++++++++++
3 files changed, 80 insertions(+)

diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 50bdf041770a..db5d2e8d7420 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -779,6 +779,8 @@ static void dmirror_mkentry(struct dmirror *dmirror, struct hmm_range *range,
*perm |= HMM_DMIRROR_PROT_WRITE;
else
*perm |= HMM_DMIRROR_PROT_READ;
+ if (entry & HMM_PFN_COMPOUND)
+ *perm |= HMM_DMIRROR_PROT_COMPOUND;
}

static bool dmirror_snapshot_invalidate(struct mmu_interval_notifier *mni,
diff --git a/lib/test_hmm_uapi.h b/lib/test_hmm_uapi.h
index 67b3b2e6ff5d..21cf4da6f020 100644
--- a/lib/test_hmm_uapi.h
+++ b/lib/test_hmm_uapi.h
@@ -40,6 +40,7 @@ struct hmm_dmirror_cmd {
* HMM_DMIRROR_PROT_NONE: unpopulated PTE or PTE with no access
* HMM_DMIRROR_PROT_READ: read-only PTE
* HMM_DMIRROR_PROT_WRITE: read/write PTE
+ * HMM_DMIRROR_PROT_COMPOUND: compound page is fully mapped by same permissions
* HMM_DMIRROR_PROT_ZERO: special read-only zero page
* HMM_DMIRROR_PROT_DEV_PRIVATE_LOCAL: Migrated device private page on the
* device the ioctl() is made
@@ -51,6 +52,7 @@ enum {
HMM_DMIRROR_PROT_NONE = 0x00,
HMM_DMIRROR_PROT_READ = 0x01,
HMM_DMIRROR_PROT_WRITE = 0x02,
+ HMM_DMIRROR_PROT_COMPOUND = 0x04,
HMM_DMIRROR_PROT_ZERO = 0x10,
HMM_DMIRROR_PROT_DEV_PRIVATE_LOCAL = 0x20,
HMM_DMIRROR_PROT_DEV_PRIVATE_REMOTE = 0x30,
diff --git a/tools/testing/selftests/vm/hmm-tests.c b/tools/testing/selftests/vm/hmm-tests.c
index e2a36783e99d..e0fa864d03fa 100644
--- a/tools/testing/selftests/vm/hmm-tests.c
+++ b/tools/testing/selftests/vm/hmm-tests.c
@@ -1301,6 +1301,82 @@ TEST_F(hmm2, snapshot)
hmm_buffer_free(buffer);
}

+/*
+ * Test the hmm_range_fault() HMM_PFN_COMPOUND flag for large pages that
+ * should be mapped by a large page table entry.
+ */
+TEST_F(hmm, compound)
+{
+ struct hmm_buffer *buffer;
+ unsigned long npages;
+ unsigned long size;
+ int *ptr;
+ unsigned char *m;
+ int ret;
+ long pagesizes[4];
+ int n, idx;
+ unsigned long i;
+
+ /* Skip test if we can't allocate a hugetlbfs page. */
+
+ n = gethugepagesizes(pagesizes, 4);
+ if (n <= 0)
+ return;
+ for (idx = 0; --n > 0; ) {
+ if (pagesizes[n] < pagesizes[idx])
+ idx = n;
+ }
+ size = ALIGN(TWOMEG, pagesizes[idx]);
+ npages = size >> self->page_shift;
+
+ buffer = malloc(sizeof(*buffer));
+ ASSERT_NE(buffer, NULL);
+
+ buffer->ptr = get_hugepage_region(size, GHR_STRICT);
+ if (buffer->ptr == NULL) {
+ free(buffer);
+ return;
+ }
+
+ buffer->size = size;
+ buffer->mirror = malloc(npages);
+ ASSERT_NE(buffer->mirror, NULL);
+
+ /* Initialize the pages the device will snapshot in buffer->ptr. */
+ for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+ ptr[i] = i;
+
+ /* Simulate a device snapshotting CPU pagetables. */
+ ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_SNAPSHOT, buffer, npages);
+ ASSERT_EQ(ret, 0);
+ ASSERT_EQ(buffer->cpages, npages);
+
+ /* Check what the device saw. */
+ m = buffer->mirror;
+ for (i = 0; i < npages; ++i)
+ ASSERT_EQ(m[i], HMM_DMIRROR_PROT_WRITE |
+ HMM_DMIRROR_PROT_COMPOUND);
+
+ /* Make the region read-only. */
+ ret = mprotect(buffer->ptr, size, PROT_READ);
+ ASSERT_EQ(ret, 0);
+
+ /* Simulate a device snapshotting CPU pagetables. */
+ ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_SNAPSHOT, buffer, npages);
+ ASSERT_EQ(ret, 0);
+ ASSERT_EQ(buffer->cpages, npages);
+
+ /* Check what the device saw. */
+ m = buffer->mirror;
+ for (i = 0; i < npages; ++i)
+ ASSERT_EQ(m[i], HMM_DMIRROR_PROT_READ |
+ HMM_DMIRROR_PROT_COMPOUND);
+
+ free_hugepage_region(buffer->ptr);
+ buffer->ptr = NULL;
+ hmm_buffer_free(buffer);
+}
+
/*
* Test two devices reading the same memory (double mapped).
*/
--
2.20.1

2020-06-20 04:52:52

by Ralph Campbell

[permalink] [raw]
Subject: [PATCH 10/16] nouveau/hmm: support mapping large sysmem pages

Nouveau currently only supports mapping PAGE_SIZE sized pages of system
memory when shared virtual memory (SVM) is enabled. Use the new
HMM_PFN_COMPOUND flag that hmm_range_fault() returns to support mapping
system memory pages larger than PAGE_SIZE.

Signed-off-by: Ralph Campbell <[email protected]>
---
drivers/gpu/drm/nouveau/nouveau_svm.c | 47 ++++++++++++++-----
.../gpu/drm/nouveau/nvkm/subdev/mmu/base.c | 4 ++
.../gpu/drm/nouveau/nvkm/subdev/mmu/priv.h | 2 +
drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.c | 8 ++--
.../drm/nouveau/nvkm/subdev/mmu/vmmgp100.c | 29 ++++++++----
5 files changed, 66 insertions(+), 24 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c b/drivers/gpu/drm/nouveau/nouveau_svm.c
index 665dede69bd1..a27625f3c5f9 100644
--- a/drivers/gpu/drm/nouveau/nouveau_svm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
@@ -514,38 +514,51 @@ static const struct mmu_interval_notifier_ops nouveau_svm_mni_ops = {
};

static void nouveau_hmm_convert_pfn(struct nouveau_drm *drm,
- struct hmm_range *range, u64 *ioctl_addr)
+ struct hmm_range *range,
+ struct nouveau_pfnmap_args *args)
{
struct page *page;

/*
- * The ioctl_addr prepared here is passed through nvif_object_ioctl()
+ * The address prepared here is passed through nvif_object_ioctl()
* to an eventual DMA map in something like gp100_vmm_pgt_pfn()
*
* This is all just encoding the internal hmm representation into a
* different nouveau internal representation.
*/
if (!(range->hmm_pfns[0] & HMM_PFN_VALID)) {
- ioctl_addr[0] = 0;
+ args->p.phys[0] = 0;
return;
}

page = hmm_pfn_to_page(range->hmm_pfns[0]);
+ /*
+ * Only map compound pages to the GPU if the CPU is also mapping the
+ * page as a compound page. Otherwise, the PTE protections might not be
+ * consistent (e.g., CPU only maps part of a compound page).
+ */
+ if (range->hmm_pfns[0] & HMM_PFN_COMPOUND) {
+ page = compound_head(page);
+ args->p.page = page_shift(page);
+ args->p.size = 1UL << args->p.page;
+ args->p.addr &= ~(args->p.size - 1);
+ }
if (is_device_private_page(page))
- ioctl_addr[0] = nouveau_dmem_page_addr(page) |
+ args->p.phys[0] = nouveau_dmem_page_addr(page) |
NVIF_VMM_PFNMAP_V0_V |
NVIF_VMM_PFNMAP_V0_VRAM;
else
- ioctl_addr[0] = page_to_phys(page) |
+ args->p.phys[0] = page_to_phys(page) |
NVIF_VMM_PFNMAP_V0_V |
NVIF_VMM_PFNMAP_V0_HOST;
if (range->hmm_pfns[0] & HMM_PFN_WRITE)
- ioctl_addr[0] |= NVIF_VMM_PFNMAP_V0_W;
+ args->p.phys[0] |= NVIF_VMM_PFNMAP_V0_W;
}

static int nouveau_range_fault(struct nouveau_svmm *svmm,
- struct nouveau_drm *drm, void *data, u32 size,
- u64 *ioctl_addr, unsigned long hmm_flags,
+ struct nouveau_drm *drm,
+ struct nouveau_pfnmap_args *args, u32 size,
+ unsigned long hmm_flags,
struct svm_notifier *notifier)
{
unsigned long timeout =
@@ -585,10 +598,10 @@ static int nouveau_range_fault(struct nouveau_svmm *svmm,
break;
}

- nouveau_hmm_convert_pfn(drm, &range, ioctl_addr);
+ nouveau_hmm_convert_pfn(drm, &range, args);

svmm->vmm->vmm.object.client->super = true;
- ret = nvif_object_ioctl(&svmm->vmm->vmm.object, data, size, NULL);
+ ret = nvif_object_ioctl(&svmm->vmm->vmm.object, args, size, NULL);
svmm->vmm->vmm.object.client->super = false;
mutex_unlock(&svmm->mutex);

@@ -717,12 +730,13 @@ nouveau_svm_fault(struct nvif_notify *notify)
args.i.p.addr, args.i.p.size,
&nouveau_svm_mni_ops);
if (!ret) {
- ret = nouveau_range_fault(svmm, svm->drm, &args,
- sizeof(args), args.phys, hmm_flags, &notifier);
+ ret = nouveau_range_fault(svmm, svm->drm, &args.i,
+ sizeof(args), hmm_flags, &notifier);
mmu_interval_notifier_remove(&notifier.notifier);
}
mmput(mm);

+ limit = args.i.p.addr + args.i.p.size;
for (fn = fi; ++fn < buffer->fault_nr; ) {
/* It's okay to skip over duplicate addresses from the
* same SVMM as faults are ordered by access type such
@@ -730,9 +744,16 @@ nouveau_svm_fault(struct nvif_notify *notify)
*
* ie. WRITE faults appear first, thus any handling of
* pending READ faults will already be satisfied.
+ * But if a large page is mapped, make sure subsequent
+ * fault addresses have sufficient access permission.
*/
if (buffer->fault[fn]->svmm != svmm ||
- buffer->fault[fn]->addr >= limit)
+ buffer->fault[fn]->addr >= limit ||
+ (buffer->fault[fi]->access == 0 /* READ. */ &&
+ !(args.phys[0] & NVIF_VMM_PFNMAP_V0_V)) ||
+ (buffer->fault[fi]->access != 0 /* READ. */ &&
+ buffer->fault[fi]->access != 3 /* PREFETCH. */ &&
+ !(args.phys[0] & NVIF_VMM_PFNMAP_V0_W)))
break;
}

diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/base.c b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/base.c
index de91e9a26172..ecea365d72ad 100644
--- a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/base.c
+++ b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/base.c
@@ -94,6 +94,8 @@ nvkm_mmu_ptp_get(struct nvkm_mmu *mmu, u32 size, bool zero)
}
pt->ptp = ptp;
pt->sub = true;
+ pt->ptei_shift = 3;
+ pt->page_shift = PAGE_SHIFT;

/* Sub-allocate from parent object, removing PTP from cache
* if there's no more free slots left.
@@ -203,6 +205,8 @@ nvkm_mmu_ptc_get(struct nvkm_mmu *mmu, u32 size, u32 align, bool zero)
return NULL;
pt->ptc = ptc;
pt->sub = false;
+ pt->ptei_shift = 3;
+ pt->page_shift = PAGE_SHIFT;

ret = nvkm_memory_new(mmu->subdev.device, NVKM_MEM_TARGET_INST,
size, align, zero, &pt->memory);
diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/priv.h b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/priv.h
index 479b02344271..f2162bb35bea 100644
--- a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/priv.h
+++ b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/priv.h
@@ -55,6 +55,8 @@ struct nvkm_mmu_pt {
struct nvkm_memory *memory;
bool sub;
u16 base;
+ u8 ptei_shift;
+ u8 page_shift;
u64 addr;
struct list_head head;
};
diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.c b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.c
index 67b00dcef4b8..c7581f4f313e 100644
--- a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.c
+++ b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.c
@@ -562,6 +562,9 @@ nvkm_vmm_iter(struct nvkm_vmm *vmm, const struct nvkm_vmm_page *page,
/* Handle PTE updates. */
if (!REF_PTES || REF_PTES(&it, pfn, ptei, ptes)) {
struct nvkm_mmu_pt *pt = pgt->pt[type];
+
+ pt->page_shift = page->shift;
+ pt->ptei_shift = ilog2(desc->size);
if (MAP_PTES || CLR_PTES) {
if (MAP_PTES)
MAP_PTES(vmm, pt, ptei, ptes, map);
@@ -1204,7 +1207,6 @@ nvkm_vmm_pfn_unmap(struct nvkm_vmm *vmm, u64 addr, u64 size)
/*TODO:
* - Avoid PT readback (for dma_unmap etc), this might end up being dealt
* with inside HMM, which would be a lot nicer for us to deal with.
- * - Multiple page sizes (particularly for huge page support).
* - Support for systems without a 4KiB page size.
*/
int
@@ -1220,8 +1222,8 @@ nvkm_vmm_pfn_map(struct nvkm_vmm *vmm, u8 shift, u64 addr, u64 size, u64 *pfn)
/* Only support mapping where the page size of the incoming page
* array matches a page size available for direct mapping.
*/
- while (page->shift && page->shift != shift &&
- page->desc->func->pfn == NULL)
+ while (page->shift && (page->shift != shift ||
+ page->desc->func->pfn == NULL))
page++;

if (!page->shift || !IS_ALIGNED(addr, 1ULL << shift) ||
diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmmgp100.c b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmmgp100.c
index d86287565542..94507cb2cf75 100644
--- a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmmgp100.c
+++ b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmmgp100.c
@@ -39,12 +39,15 @@ gp100_vmm_pfn_unmap(struct nvkm_vmm *vmm,

nvkm_kmap(pt->memory);
while (ptes--) {
- u32 datalo = nvkm_ro32(pt->memory, pt->base + ptei * 8 + 0);
- u32 datahi = nvkm_ro32(pt->memory, pt->base + ptei * 8 + 4);
+ u32 datalo = nvkm_ro32(pt->memory,
+ pt->base + (ptei << pt->ptei_shift) + 0);
+ u32 datahi = nvkm_ro32(pt->memory,
+ pt->base + (ptei << pt->ptei_shift) + 4);
u64 data = (u64)datahi << 32 | datalo;
if ((data & (3ULL << 1)) != 0) {
addr = (data >> 8) << 12;
- dma_unmap_page(dev, addr, PAGE_SIZE, DMA_BIDIRECTIONAL);
+ dma_unmap_page(dev, addr, 1UL << pt->page_shift,
+ DMA_BIDIRECTIONAL);
}
ptei++;
}
@@ -58,11 +61,14 @@ gp100_vmm_pfn_clear(struct nvkm_vmm *vmm,
bool dma = false;
nvkm_kmap(pt->memory);
while (ptes--) {
- u32 datalo = nvkm_ro32(pt->memory, pt->base + ptei * 8 + 0);
- u32 datahi = nvkm_ro32(pt->memory, pt->base + ptei * 8 + 4);
+ u32 datalo = nvkm_ro32(pt->memory,
+ pt->base + (ptei << pt->ptei_shift) + 0);
+ u32 datahi = nvkm_ro32(pt->memory,
+ pt->base + (ptei << pt->ptei_shift) + 4);
u64 data = (u64)datahi << 32 | datalo;
if ((data & BIT_ULL(0)) && (data & (3ULL << 1)) != 0) {
- VMM_WO064(pt, vmm, ptei * 8, data & ~BIT_ULL(0));
+ VMM_WO064(pt, vmm, ptei << pt->ptei_shift,
+ data & ~BIT_ULL(0));
dma = true;
}
ptei++;
@@ -87,7 +93,8 @@ gp100_vmm_pgt_pfn(struct nvkm_vmm *vmm, struct nvkm_mmu_pt *pt,
if (!(*map->pfn & NVKM_VMM_PFN_VRAM)) {
addr = *map->pfn >> NVKM_VMM_PFN_ADDR_SHIFT;
addr = dma_map_page(dev, pfn_to_page(addr), 0,
- PAGE_SIZE, DMA_BIDIRECTIONAL);
+ 1UL << pt->page_shift,
+ DMA_BIDIRECTIONAL);
if (!WARN_ON(dma_mapping_error(dev, addr))) {
data |= addr >> 4;
data |= 2ULL << 1; /* SYSTEM_COHERENT_MEMORY. */
@@ -99,7 +106,7 @@ gp100_vmm_pgt_pfn(struct nvkm_vmm *vmm, struct nvkm_mmu_pt *pt,
data |= BIT_ULL(0); /* VALID. */
}

- VMM_WO064(pt, vmm, ptei++ * 8, data);
+ VMM_WO064(pt, vmm, ptei++ << pt->ptei_shift, data);
map->pfn++;
}
nvkm_done(pt->memory);
@@ -264,6 +271,9 @@ gp100_vmm_desc_pd0 = {
.sparse = gp100_vmm_pd0_sparse,
.pde = gp100_vmm_pd0_pde,
.mem = gp100_vmm_pd0_mem,
+ .pfn = gp100_vmm_pgt_pfn,
+ .pfn_clear = gp100_vmm_pfn_clear,
+ .pfn_unmap = gp100_vmm_pfn_unmap,
};

static void
@@ -286,6 +296,9 @@ gp100_vmm_desc_pd1 = {
.unmap = gf100_vmm_pgt_unmap,
.sparse = gp100_vmm_pgt_sparse,
.pde = gp100_vmm_pd1_pde,
+ .pfn = gp100_vmm_pgt_pfn,
+ .pfn_clear = gp100_vmm_pfn_clear,
+ .pfn_unmap = gp100_vmm_pfn_unmap,
};

const struct nvkm_vmm_desc
--
2.20.1

2020-06-20 04:53:04

by Ralph Campbell

[permalink] [raw]
Subject: [PATCH 05/16] mm/hmm/test: remove redundant page table invalidate

When migrating pages to or from device private memory, the device's
page tables will be invalidated as part of migrate_vma_setup() locking
and isolating the pages. The HMM self test driver doesn't need to
invalidate the page table a second time after migrating pages to system
memory so remove that bit of extra code.

Signed-off-by: Ralph Campbell <[email protected]>
---
lib/test_hmm.c | 14 ++++----------
1 file changed, 4 insertions(+), 10 deletions(-)

diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 28528285942c..f7c2b51a7a9d 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -1018,15 +1018,6 @@ static vm_fault_t dmirror_devmem_fault_alloc_and_copy(struct migrate_vma *args,
return 0;
}

-static void dmirror_devmem_fault_finalize_and_map(struct migrate_vma *args,
- struct dmirror *dmirror)
-{
- /* Invalidate the device's page table mapping. */
- mutex_lock(&dmirror->mutex);
- dmirror_do_update(dmirror, args->start, args->end);
- mutex_unlock(&dmirror->mutex);
-}
-
static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
{
struct migrate_vma args;
@@ -1059,7 +1050,10 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
if (ret)
return ret;
migrate_vma_pages(&args);
- dmirror_devmem_fault_finalize_and_map(&args, dmirror);
+ /*
+ * No device finalize step is needed since migrate_vma_setup() will
+ * have already invalidated the device page table.
+ */
migrate_vma_finalize(&args);
return 0;
}
--
2.20.1

2020-06-21 23:39:16

by Zi Yan

[permalink] [raw]
Subject: Re: [PATCH 13/16] mm: support THP migration to device private memory

On 19 Jun 2020, at 17:56, Ralph Campbell wrote:

> Support transparent huge page migration to ZONE_DEVICE private memory.
> A new flag (MIGRATE_PFN_COMPOUND) is added to the input PFN array to
> indicate the huge page was fully mapped by the CPU.
> Export prep_compound_page() so that device drivers can create huge
> device private pages after calling memremap_pages().
>
> Signed-off-by: Ralph Campbell <[email protected]>
> ---
> include/linux/migrate.h | 1 +
> include/linux/mm.h | 1 +
> mm/huge_memory.c | 30 ++++--
> mm/internal.h | 1 -
> mm/memory.c | 10 +-
> mm/memremap.c | 9 +-
> mm/migrate.c | 226 ++++++++++++++++++++++++++++++++--------
> mm/page_alloc.c | 1 +
> 8 files changed, 226 insertions(+), 53 deletions(-)
>
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index 3e546cbf03dd..f6a64965c8bd 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -166,6 +166,7 @@ static inline int migrate_misplaced_transhuge_page(struct mm_struct *mm,
> #define MIGRATE_PFN_MIGRATE (1UL << 1)
> #define MIGRATE_PFN_LOCKED (1UL << 2)
> #define MIGRATE_PFN_WRITE (1UL << 3)
> +#define MIGRATE_PFN_COMPOUND (1UL << 4)
> #define MIGRATE_PFN_SHIFT 6
>
> static inline struct page *migrate_pfn_to_page(unsigned long mpfn)
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index dc7b87310c10..020b9dd3cddb 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -932,6 +932,7 @@ static inline unsigned int page_shift(struct page *page)
> }
>
> void free_compound_page(struct page *page);
> +void prep_compound_page(struct page *page, unsigned int order);
>
> #ifdef CONFIG_MMU
> /*
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 78c84bee7e29..25d95f7b1e98 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1663,23 +1663,35 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> } else {
> struct page *page = NULL;
> int flush_needed = 1;
> + bool is_anon = false;
>
> if (pmd_present(orig_pmd)) {
> page = pmd_page(orig_pmd);
> + is_anon = PageAnon(page);
> page_remove_rmap(page, true);
> VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
> VM_BUG_ON_PAGE(!PageHead(page), page);
> } else if (thp_migration_supported()) {
> swp_entry_t entry;
>
> - VM_BUG_ON(!is_pmd_migration_entry(orig_pmd));
> entry = pmd_to_swp_entry(orig_pmd);
> - page = pfn_to_page(swp_offset(entry));
> + if (is_device_private_entry(entry)) {
> + page = device_private_entry_to_page(entry);
> + is_anon = PageAnon(page);
> + page_remove_rmap(page, true);
> + VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
> + VM_BUG_ON_PAGE(!PageHead(page), page);
> + put_page(page);

Why do you hide this code behind thp_migration_supported()? It seems that you just need
pmd swap entry not pmd migration entry. Also the condition is not consistent with the code
in __handle_mm_fault(), in which you handle is_device_private_entry() directly without
checking thp_migration_support().

Do we need to support split_huge_pmd() if a page is migrated to device? Any new code
needed in split_huge_pmd()?



> + } else {
> + VM_BUG_ON(!is_pmd_migration_entry(orig_pmd));
> + page = pfn_to_page(swp_offset(entry));
> + is_anon = PageAnon(page);
> + }
> flush_needed = 0;
> } else
> WARN_ONCE(1, "Non present huge pmd without pmd migration enabled!");
>
> - if (PageAnon(page)) {
> + if (is_anon) {
> zap_deposited_table(tlb->mm, pmd);
> add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
> } else {
> @@ -1778,8 +1790,8 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
> /*
> * Returns
> * - 0 if PMD could not be locked
> - * - 1 if PMD was locked but protections unchange and TLB flush unnecessary
> - * - HPAGE_PMD_NR is protections changed and TLB flush necessary
> + * - 1 if PMD was locked but protections unchanged and TLB flush unnecessary
> + * - HPAGE_PMD_NR if protections changed and TLB flush necessary
> */
> int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> unsigned long addr, pgprot_t newprot, unsigned long cp_flags)
> @@ -2689,7 +2701,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
> if (!mapcount && page_ref_freeze(head, 1 + extra_pins)) {
> if (!list_empty(page_deferred_list(head))) {
> ds_queue->split_queue_len--;
> - list_del(page_deferred_list(head));
> + list_del_init(page_deferred_list(head));
> }
> spin_unlock(&ds_queue->split_queue_lock);
> if (mapping) {
> @@ -2743,7 +2755,7 @@ void free_transhuge_page(struct page *page)
> spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> if (!list_empty(page_deferred_list(page))) {
> ds_queue->split_queue_len--;
> - list_del(page_deferred_list(page));
> + list_del_init(page_deferred_list(page));
> }
> spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
> free_compound_page(page);
> @@ -2963,6 +2975,10 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
> pmde = pmd_mksoft_dirty(pmde);
> if (is_write_migration_entry(entry))
> pmde = maybe_pmd_mkwrite(pmde, vma);
> + if (unlikely(is_device_private_page(new))) {
> + entry = make_device_private_entry(new, pmd_write(pmde));
> + pmde = swp_entry_to_pmd(entry);
> + }
>
> flush_cache_range(vma, mmun_start, mmun_start + HPAGE_PMD_SIZE);
> if (PageAnon(new))
> diff --git a/mm/internal.h b/mm/internal.h
> index 9886db20d94f..58f051a14dae 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -196,7 +196,6 @@ extern void __putback_isolated_page(struct page *page, unsigned int order,
> extern void memblock_free_pages(struct page *page, unsigned long pfn,
> unsigned int order);
> extern void __free_pages_core(struct page *page, unsigned int order);
> -extern void prep_compound_page(struct page *page, unsigned int order);
> extern void post_alloc_hook(struct page *page, unsigned int order,
> gfp_t gfp_flags);
> extern int user_min_free_kbytes;
> diff --git a/mm/memory.c b/mm/memory.c
> index dc7f3543b1fd..4e03d1a2ead5 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4322,9 +4322,15 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
>
> barrier();
> if (unlikely(is_swap_pmd(orig_pmd))) {
> + swp_entry_t entry = pmd_to_swp_entry(orig_pmd);
> +
> + if (is_device_private_entry(entry)) {
> + vmf.page = device_private_entry_to_page(entry);
> + return vmf.page->pgmap->ops->migrate_to_ram(&vmf);
> + }
> VM_BUG_ON(thp_migration_supported() &&
> - !is_pmd_migration_entry(orig_pmd));
> - if (is_pmd_migration_entry(orig_pmd))
> + !is_migration_entry(entry));
> + if (is_migration_entry(entry))
> pmd_migration_entry_wait(mm, vmf.pmd);
> return 0;
> }
> diff --git a/mm/memremap.c b/mm/memremap.c
> index 03e38b7a38f1..4231054188b4 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -132,8 +132,13 @@ void memunmap_pages(struct dev_pagemap *pgmap)
> int nid;
>
> dev_pagemap_kill(pgmap);
> - for_each_device_pfn(pfn, pgmap)
> - put_page(pfn_to_page(pfn));
> + for_each_device_pfn(pfn, pgmap) {
> + struct page *page = pfn_to_page(pfn);
> + unsigned int order = compound_order(page);
> +
> + put_page(page);
> + pfn += (1U << order) - 1;
> + }
> dev_pagemap_cleanup(pgmap);
>
> /* make sure to access a memmap that was actually initialized */
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 87c52e0ee580..1536677cefc9 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -51,6 +51,7 @@
> #include <linux/oom.h>
>
> #include <asm/tlbflush.h>
> +#include <asm/pgalloc.h>
>
> #define CREATE_TRACE_POINTS
> #include <trace/events/migrate.h>
> @@ -2185,6 +2186,8 @@ static int migrate_vma_collect_hole(unsigned long start,
>
> for (addr = start; addr < end; addr += PAGE_SIZE) {
> migrate->src[migrate->npages] = flags;
> + if ((addr & ~PMD_MASK) == 0 && (end & ~PMD_MASK) == 0)
> + migrate->src[migrate->npages] |= MIGRATE_PFN_COMPOUND;
> migrate->dst[migrate->npages] = 0;
> migrate->npages++;
> migrate->cpages++;
> @@ -2219,48 +2222,87 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> unsigned long addr = start, unmapped = 0;
> spinlock_t *ptl;
> pte_t *ptep;
> + pmd_t pmd;
>
> again:
> - if (pmd_none(*pmdp))
> + pmd = READ_ONCE(*pmdp);
> + if (pmd_none(pmd))
> return migrate_vma_collect_hole(start, end, -1, walk);
>
> - if (pmd_trans_huge(*pmdp)) {
> + if (pmd_trans_huge(pmd) || !pmd_present(pmd)) {
> struct page *page;
> + unsigned long write = 0;
> + int ret;
>
> ptl = pmd_lock(mm, pmdp);
> - if (unlikely(!pmd_trans_huge(*pmdp))) {
> - spin_unlock(ptl);
> - goto again;
> - }
> + if (pmd_trans_huge(*pmdp)) {
> + page = pmd_page(*pmdp);
> + if (is_huge_zero_page(page)) {
> + spin_unlock(ptl);
> + return migrate_vma_collect_hole(start, end, -1,
> + walk);
> + }
> + if (pmd_write(*pmdp))
> + write = MIGRATE_PFN_WRITE;
> + } else if (!pmd_present(*pmdp)) {
> + swp_entry_t entry = pmd_to_swp_entry(*pmdp);
>
> - page = pmd_page(*pmdp);
> - if (is_huge_zero_page(page)) {
> - spin_unlock(ptl);
> - split_huge_pmd(vma, pmdp, addr);
> - if (pmd_trans_unstable(pmdp))
> + if (!is_device_private_entry(entry)) {
> + spin_unlock(ptl);
> return migrate_vma_collect_skip(start, end,
> walk);
> + }
> + page = device_private_entry_to_page(entry);
> + if (is_write_device_private_entry(entry))
> + write = MIGRATE_PFN_WRITE;
> } else {
> - int ret;
> + spin_unlock(ptl);
> + goto again;
> + }
>
> - get_page(page);
> + get_page(page);
> + if (unlikely(!trylock_page(page))) {
> spin_unlock(ptl);
> - if (unlikely(!trylock_page(page)))
> - return migrate_vma_collect_skip(start, end,
> - walk);
> - ret = split_huge_page(page);
> - unlock_page(page);
> put_page(page);
> - if (ret)
> - return migrate_vma_collect_skip(start, end,
> - walk);
> - if (pmd_none(*pmdp))
> - return migrate_vma_collect_hole(start, end, -1,
> - walk);
> + return migrate_vma_collect_skip(start, end, walk);
> + }
> +#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
> + if ((start & HPAGE_PMD_MASK) == start &&
> + start + HPAGE_PMD_SIZE == end) {
> + struct page_vma_mapped_walk vmw = {
> + .vma = vma,
> + .address = start,
> + .pmd = pmdp,
> + .ptl = ptl,
> + };
> +
> + migrate->src[migrate->npages] = write |
> + migrate_pfn(page_to_pfn(page)) |
> + MIGRATE_PFN_MIGRATE | MIGRATE_PFN_LOCKED |
> + MIGRATE_PFN_COMPOUND;
> + migrate->dst[migrate->npages] = 0;
> + migrate->npages++;
> + migrate->cpages++;
> + migrate_vma_collect_skip(start + PAGE_SIZE, end, walk);
> +
> + /* Note that this also removes the page from the rmap. */
> + set_pmd_migration_entry(&vmw, page);
> + spin_unlock(ptl);
> +
> + return 0;
> }
> +#endif
> + spin_unlock(ptl);
> + ret = split_huge_page(page);
> + unlock_page(page);
> + put_page(page);
> + if (ret)
> + return migrate_vma_collect_skip(start, end, walk);
> + if (pmd_none(*pmdp))
> + return migrate_vma_collect_hole(start, end, -1, walk);
> }
>
> - if (unlikely(pmd_bad(*pmdp)))
> + if (unlikely(pmd_bad(pmd) || pmd_devmap(pmd)))
> return migrate_vma_collect_skip(start, end, walk);
>
> ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
> @@ -2310,8 +2352,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
> }
>
> - /* FIXME support THP */
> - if (!page || !page->mapping || PageTransCompound(page)) {
> + if (!page || !page->mapping) {
> mpfn = 0;
> goto next;
> }
> @@ -2420,14 +2461,6 @@ static bool migrate_vma_check_page(struct page *page)
> */
> int extra = 1;
>
> - /*
> - * FIXME support THP (transparent huge page), it is bit more complex to
> - * check them than regular pages, because they can be mapped with a pmd
> - * or with a pte (split pte mapping).
> - */
> - if (PageCompound(page))
> - return false;
> -
> /* Page from ZONE_DEVICE have one extra reference */
> if (is_zone_device_page(page)) {
> /*
> @@ -2726,13 +2759,115 @@ int migrate_vma_setup(struct migrate_vma *args)
> }
> EXPORT_SYMBOL(migrate_vma_setup);
>
> +#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
> +/*
> + * This code closely follows:
> + * do_huge_pmd_anonymous_page()
> + * __do_huge_pmd_anonymous_page()
> + * except that the page being inserted is likely to be a device private page
> + * instead of an allocated or zero page.
> + */
> +static int insert_huge_pmd_anonymous_page(struct vm_area_struct *vma,
> + unsigned long haddr,
> + struct page *page,
> + unsigned long *src,
> + unsigned long *dst,
> + pmd_t *pmdp)
> +{
> + struct mm_struct *mm = vma->vm_mm;
> + unsigned int i;
> + spinlock_t *ptl;
> + bool flush = false;
> + pgtable_t pgtable;
> + gfp_t gfp;
> + pmd_t entry;
> +
> + if (WARN_ON_ONCE(compound_order(page) != HPAGE_PMD_ORDER))
> + goto abort;
> +
> + if (unlikely(anon_vma_prepare(vma)))
> + goto abort;
> +
> + prep_transhuge_page(page);
> +
> + gfp = GFP_TRANSHUGE_LIGHT;
> + if (mem_cgroup_charge(page, mm, gfp))
> + goto abort;
> +
> + pgtable = pte_alloc_one(mm);
> + if (unlikely(!pgtable))
> + goto abort;
> +
> + __SetPageUptodate(page);
> +
> + if (is_zone_device_page(page)) {
> + if (!is_device_private_page(page))
> + goto pgtable_abort;
> + entry = swp_entry_to_pmd(make_device_private_entry(page,
> + vma->vm_flags & VM_WRITE));
> + } else {
> + entry = mk_huge_pmd(page, vma->vm_page_prot);
> + entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
> + }
> +
> + ptl = pmd_lock(mm, pmdp);
> +
> + if (check_stable_address_space(mm))
> + goto unlock_abort;
> +
> + /*
> + * Check for userfaultfd but do not deliver the fault. Instead,
> + * just back off.
> + */
> + if (userfaultfd_missing(vma))
> + goto unlock_abort;
> +
> + if (pmd_present(*pmdp)) {
> + if (!is_huge_zero_pmd(*pmdp))
> + goto unlock_abort;
> + flush = true;
> + } else if (!pmd_none(*pmdp))
> + goto unlock_abort;
> +
> + get_page(page);
> + page_add_new_anon_rmap(page, vma, haddr, true);
> + if (!is_device_private_page(page))
> + lru_cache_add_active_or_unevictable(page, vma);
> + if (flush) {
> + pte_free(mm, pgtable);
> + flush_cache_range(vma, haddr, haddr + HPAGE_PMD_SIZE);
> + pmdp_invalidate(vma, haddr, pmdp);
> + } else {
> + pgtable_trans_huge_deposit(mm, pmdp, pgtable);
> + mm_inc_nr_ptes(mm);
> + }
> + set_pmd_at(mm, haddr, pmdp, entry);
> + update_mmu_cache_pmd(vma, haddr, pmdp);
> + add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
> + spin_unlock(ptl);
> + count_vm_event(THP_FAULT_ALLOC);
> + count_memcg_event_mm(mm, THP_FAULT_ALLOC);
> +
> + return 0;
> +
> +unlock_abort:
> + spin_unlock(ptl);
> +pgtable_abort:
> + pte_free(mm, pgtable);
> +abort:
> + for (i = 0; i < HPAGE_PMD_NR; i++)
> + src[i] &= ~MIGRATE_PFN_MIGRATE;
> + return -EINVAL;
> +}
> +#endif
> +
> /*
> * This code closely matches the code in:
> * __handle_mm_fault()
> * handle_pte_fault()
> * do_anonymous_page()
> - * to map in an anonymous zero page but the struct page will be a ZONE_DEVICE
> - * private page.
> + * to map in an anonymous zero page except the struct page is already allocated
> + * and will likely be a ZONE_DEVICE private page.
> */
> static void migrate_vma_insert_page(struct migrate_vma *migrate,
> unsigned long addr,
> @@ -2766,7 +2901,16 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
> if (!pmdp)
> goto abort;
>
> - if (pmd_trans_huge(*pmdp) || pmd_devmap(*pmdp))
> +#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
> + if (*dst & MIGRATE_PFN_COMPOUND) {
> + int ret = insert_huge_pmd_anonymous_page(vma, addr, page, src,
> + dst, pmdp);
> + if (ret)
> + goto abort;
> + return;
> + }
> +#endif
> + if (pmd_trans_huge(*pmdp) || pmd_devmap(*pmdp) || pmd_bad(*pmdp))
> goto abort;
>
> /*
> @@ -2804,7 +2948,8 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
>
> swp_entry = make_device_private_entry(page, vma->vm_flags & VM_WRITE);
> entry = swp_entry_to_pte(swp_entry);
> - }
> + } else
> + goto abort;
> } else {
> entry = mk_pte(page, vma->vm_page_prot);
> if (vma->vm_flags & VM_WRITE)
> @@ -2833,10 +2978,10 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
> goto unlock_abort;
>
> inc_mm_counter(mm, MM_ANONPAGES);
> + get_page(page);
> page_add_new_anon_rmap(page, vma, addr, false);
> if (!is_zone_device_page(page))
> lru_cache_add_active_or_unevictable(page, vma);
> - get_page(page);
>
> if (flush) {
> flush_cache_page(vma, addr, pte_pfn(*ptep));
> @@ -2850,7 +2995,6 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
> }
>
> pte_unmap_unlock(ptep, ptl);
> - *src = MIGRATE_PFN_MIGRATE;
> return;
>
> unlock_abort:
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 48eb0f1410d4..a852ed2f204c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -686,6 +686,7 @@ void prep_compound_page(struct page *page, unsigned int order)
> if (hpage_pincount_available(page))
> atomic_set(compound_pincount_ptr(page), 0);
> }
> +EXPORT_SYMBOL_GPL(prep_compound_page);
>
> #ifdef CONFIG_DEBUG_PAGEALLOC
> unsigned int _debug_guardpage_minorder;
> --
> 2.20.1


--
Best Regards,
Yan Zi


Attachments:
signature.asc (871.00 B)
OpenPGP digital signature

2020-06-22 00:18:42

by Zi Yan

[permalink] [raw]
Subject: Re: [PATCH 14/16] mm/thp: add THP allocation helper

On 19 Jun 2020, at 17:56, Ralph Campbell wrote:

> Transparent huge page allocation policy is controlled by several sysfs
> variables. Rather than expose these to each device driver that needs to
> allocate THPs, provide a helper function.
>
> Signed-off-by: Ralph Campbell <[email protected]>
> ---
> include/linux/gfp.h | 10 ++++++++++
> mm/huge_memory.c | 16 ++++++++++++++++
> 2 files changed, 26 insertions(+)
>
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 67a0774e080b..1c7d968a27d3 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -562,6 +562,16 @@ extern struct page *alloc_pages_vma(gfp_t gfp_mask, int order,
> alloc_pages_vma(gfp_mask, 0, vma, addr, numa_node_id(), false)
> #define alloc_page_vma_node(gfp_mask, vma, addr, node) \
> alloc_pages_vma(gfp_mask, 0, vma, addr, node, false)
> +#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
> +extern struct page *alloc_transhugepage(struct vm_area_struct *vma,
> + unsigned long addr);
> +#else
> +static inline struct page *alloc_transhugepage(struct vm_area_struct *vma,
> + unsigned long addr)
> +{
> + return NULL;
> +}
> +#endif
>
> extern unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order);
> extern unsigned long get_zeroed_page(gfp_t gfp_mask);
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 25d95f7b1e98..f749633ed350 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -775,6 +775,22 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
> return __do_huge_pmd_anonymous_page(vmf, page, gfp);
> }
>
> +#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
> +struct page *alloc_transhugepage(struct vm_area_struct *vma,
> + unsigned long haddr)
> +{
> + gfp_t gfp;
> + struct page *page;
> +
> + gfp = alloc_hugepage_direct_gfpmask(vma);
> + page = alloc_hugepage_vma(gfp, vma, haddr, HPAGE_PMD_ORDER);
> + if (page)
> + prep_transhuge_page(page);
> + return page;
> +}
> +EXPORT_SYMBOL_GPL(alloc_transhugepage);
> +#endif
> +
> static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
> pmd_t *pmd, pfn_t pfn, pgprot_t prot, bool write,
> pgtable_t pgtable)
> --
> 2.20.1

Why use CONFIG_ARCH_ENABLE_THP_MIGRATION to guard THP allocator helper?
Shouldn’t CONFIG_TRANSPARENT_HUGEPAGE be used? Also the helper still allocates
a THP even if transparent_hugepage_enabled(vma) is false, which is wrong, right?


--
Best Regards,
Yan Zi


Attachments:
signature.asc (871.00 B)
OpenPGP digital signature

2020-06-22 12:42:07

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 00/16] mm/hmm/nouveau: THP mapping and migration

On Fri, Jun 19, 2020 at 02:56:33PM -0700, Ralph Campbell wrote:
> These patches apply to linux-5.8.0-rc1. Patches 1-3 should probably go
> into 5.8, the others can be queued for 5.9. Patches 4-6 improve the HMM
> self tests. Patch 7-8 prepare nouveau for the meat of this series which
> adds support and testing for compound page mapping of system memory
> (patches 9-11) and compound page migration to device private memory
> (patches 12-16). Since these changes are split across mm core, nouveau,
> and testing, I'm guessing Jason Gunthorpe's HMM tree would be appropriate.

You need to break this up into parts that go where they need to
go. Nouveau rc changes should go to DRM or some series needs to
explain the linkage

> Ralph Campbell (16):
> mm: fix migrate_vma_setup() src_owner and normal pages
> nouveau: fix migrate page regression
> nouveau: fix mixed normal and device private page migration
> mm/hmm: fix test timeout on slower machines
> mm/hmm/test: remove redundant page table invalidate
> mm/hmm: test mixed normal and device private migrations
> nouveau: make nvkm_vmm_ctor() and nvkm_mmu_ptp_get() static
> nouveau/hmm: fault one page at a time
> mm/hmm: add output flag for compound page mapping
> nouveau/hmm: support mapping large sysmem pages
> hmm: add tests for HMM_PFN_COMPOUND flag
> mm/hmm: optimize migrate_vma_setup() for holes

Order things so it is hmm, test, noeveau

> mm: support THP migration to device private memory
> mm/thp: add THP allocation helper
> mm/hmm/test: add self tests for THP migration
> nouveau: support THP migration to private memory

This is another series, you should split it even if it has to go
through the hmm tree

Jason

2020-06-22 17:03:08

by Ralph Campbell

[permalink] [raw]
Subject: Re: [PATCH 00/16] mm/hmm/nouveau: THP mapping and migration


On 6/22/20 5:39 AM, Jason Gunthorpe wrote:
> On Fri, Jun 19, 2020 at 02:56:33PM -0700, Ralph Campbell wrote:
>> These patches apply to linux-5.8.0-rc1. Patches 1-3 should probably go
>> into 5.8, the others can be queued for 5.9. Patches 4-6 improve the HMM
>> self tests. Patch 7-8 prepare nouveau for the meat of this series which
>> adds support and testing for compound page mapping of system memory
>> (patches 9-11) and compound page migration to device private memory
>> (patches 12-16). Since these changes are split across mm core, nouveau,
>> and testing, I'm guessing Jason Gunthorpe's HMM tree would be appropriate.
>
> You need to break this up into parts that go where they need to
> go. Nouveau rc changes should go to DRM or some series needs to
> explain the linkage
>
>> Ralph Campbell (16):
>> mm: fix migrate_vma_setup() src_owner and normal pages
>> nouveau: fix migrate page regression
>> nouveau: fix mixed normal and device private page migration
>> mm/hmm: fix test timeout on slower machines
>> mm/hmm/test: remove redundant page table invalidate
>> mm/hmm: test mixed normal and device private migrations
>> nouveau: make nvkm_vmm_ctor() and nvkm_mmu_ptp_get() static
>> nouveau/hmm: fault one page at a time
>> mm/hmm: add output flag for compound page mapping
>> nouveau/hmm: support mapping large sysmem pages
>> hmm: add tests for HMM_PFN_COMPOUND flag
>> mm/hmm: optimize migrate_vma_setup() for holes
>
> Order things so it is hmm, test, noeveau
>
>> mm: support THP migration to device private memory
>> mm/thp: add THP allocation helper
>> mm/hmm/test: add self tests for THP migration
>> nouveau: support THP migration to private memory
>
> This is another series, you should split it even if it has to go
> through the hmm tree
>
> Jason

Thanks. I thought there was probably a better way to submit this but
I posted everything so people could see how it all fit together.

2020-06-22 19:40:33

by Ralph Campbell

[permalink] [raw]
Subject: Re: [PATCH 13/16] mm: support THP migration to device private memory


On 6/21/20 4:20 PM, Zi Yan wrote:
> On 19 Jun 2020, at 17:56, Ralph Campbell wrote:
>
>> Support transparent huge page migration to ZONE_DEVICE private memory.
>> A new flag (MIGRATE_PFN_COMPOUND) is added to the input PFN array to
>> indicate the huge page was fully mapped by the CPU.
>> Export prep_compound_page() so that device drivers can create huge
>> device private pages after calling memremap_pages().
>>
>> Signed-off-by: Ralph Campbell <[email protected]>
>> ---
>> include/linux/migrate.h | 1 +
>> include/linux/mm.h | 1 +
>> mm/huge_memory.c | 30 ++++--
>> mm/internal.h | 1 -
>> mm/memory.c | 10 +-
>> mm/memremap.c | 9 +-
>> mm/migrate.c | 226 ++++++++++++++++++++++++++++++++--------
>> mm/page_alloc.c | 1 +
>> 8 files changed, 226 insertions(+), 53 deletions(-)
>>
>> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
>> index 3e546cbf03dd..f6a64965c8bd 100644
>> --- a/include/linux/migrate.h
>> +++ b/include/linux/migrate.h
>> @@ -166,6 +166,7 @@ static inline int migrate_misplaced_transhuge_page(struct mm_struct *mm,
>> #define MIGRATE_PFN_MIGRATE (1UL << 1)
>> #define MIGRATE_PFN_LOCKED (1UL << 2)
>> #define MIGRATE_PFN_WRITE (1UL << 3)
>> +#define MIGRATE_PFN_COMPOUND (1UL << 4)
>> #define MIGRATE_PFN_SHIFT 6
>>
>> static inline struct page *migrate_pfn_to_page(unsigned long mpfn)
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index dc7b87310c10..020b9dd3cddb 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -932,6 +932,7 @@ static inline unsigned int page_shift(struct page *page)
>> }
>>
>> void free_compound_page(struct page *page);
>> +void prep_compound_page(struct page *page, unsigned int order);
>>
>> #ifdef CONFIG_MMU
>> /*
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 78c84bee7e29..25d95f7b1e98 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -1663,23 +1663,35 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>> } else {
>> struct page *page = NULL;
>> int flush_needed = 1;
>> + bool is_anon = false;
>>
>> if (pmd_present(orig_pmd)) {
>> page = pmd_page(orig_pmd);
>> + is_anon = PageAnon(page);
>> page_remove_rmap(page, true);
>> VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
>> VM_BUG_ON_PAGE(!PageHead(page), page);
>> } else if (thp_migration_supported()) {
>> swp_entry_t entry;
>>
>> - VM_BUG_ON(!is_pmd_migration_entry(orig_pmd));
>> entry = pmd_to_swp_entry(orig_pmd);
>> - page = pfn_to_page(swp_offset(entry));
>> + if (is_device_private_entry(entry)) {
>> + page = device_private_entry_to_page(entry);
>> + is_anon = PageAnon(page);
>> + page_remove_rmap(page, true);
>> + VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
>> + VM_BUG_ON_PAGE(!PageHead(page), page);
>> + put_page(page);
>
> Why do you hide this code behind thp_migration_supported()? It seems that you just need
> pmd swap entry not pmd migration entry. Also the condition is not consistent with the code
> in __handle_mm_fault(), in which you handle is_device_private_entry() directly without
> checking thp_migration_support().

Good point, I think "else if (thp_migration_supported())" should be
"else if (is_pmd_migration_entry(orig_pmd))" since if the PMD *is*
a device private or migration entry, then it should be handled and the
VM_BUG_ON() should be that thp_migration_supported() is true
(or maybe remove the VM_BUG_ON?).

> Do we need to support split_huge_pmd() if a page is migrated to device? Any new code
> needed in split_huge_pmd()?

I was thinking that any CPU usage of the device private page would cause it to be
migrated back to system memory as a whole PMD/PUD page but I'll double check.
At least there should be a check that the page isn't a device private page.

2020-06-22 20:13:45

by Zi Yan

[permalink] [raw]
Subject: Re: [PATCH 13/16] mm: support THP migration to device private memory

On 22 Jun 2020, at 15:36, Ralph Campbell wrote:

> On 6/21/20 4:20 PM, Zi Yan wrote:
>> On 19 Jun 2020, at 17:56, Ralph Campbell wrote:
>>
>>> Support transparent huge page migration to ZONE_DEVICE private memory.
>>> A new flag (MIGRATE_PFN_COMPOUND) is added to the input PFN array to
>>> indicate the huge page was fully mapped by the CPU.
>>> Export prep_compound_page() so that device drivers can create huge
>>> device private pages after calling memremap_pages().
>>>
>>> Signed-off-by: Ralph Campbell <[email protected]>
>>> ---
>>> include/linux/migrate.h | 1 +
>>> include/linux/mm.h | 1 +
>>> mm/huge_memory.c | 30 ++++--
>>> mm/internal.h | 1 -
>>> mm/memory.c | 10 +-
>>> mm/memremap.c | 9 +-
>>> mm/migrate.c | 226 ++++++++++++++++++++++++++++++++--------
>>> mm/page_alloc.c | 1 +
>>> 8 files changed, 226 insertions(+), 53 deletions(-)
>>>
>>> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
>>> index 3e546cbf03dd..f6a64965c8bd 100644
>>> --- a/include/linux/migrate.h
>>> +++ b/include/linux/migrate.h
>>> @@ -166,6 +166,7 @@ static inline int migrate_misplaced_transhuge_page(struct mm_struct *mm,
>>> #define MIGRATE_PFN_MIGRATE (1UL << 1)
>>> #define MIGRATE_PFN_LOCKED (1UL << 2)
>>> #define MIGRATE_PFN_WRITE (1UL << 3)
>>> +#define MIGRATE_PFN_COMPOUND (1UL << 4)
>>> #define MIGRATE_PFN_SHIFT 6
>>>
>>> static inline struct page *migrate_pfn_to_page(unsigned long mpfn)
>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>> index dc7b87310c10..020b9dd3cddb 100644
>>> --- a/include/linux/mm.h
>>> +++ b/include/linux/mm.h
>>> @@ -932,6 +932,7 @@ static inline unsigned int page_shift(struct page *page)
>>> }
>>>
>>> void free_compound_page(struct page *page);
>>> +void prep_compound_page(struct page *page, unsigned int order);
>>>
>>> #ifdef CONFIG_MMU
>>> /*
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index 78c84bee7e29..25d95f7b1e98 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -1663,23 +1663,35 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>>> } else {
>>> struct page *page = NULL;
>>> int flush_needed = 1;
>>> + bool is_anon = false;
>>>
>>> if (pmd_present(orig_pmd)) {
>>> page = pmd_page(orig_pmd);
>>> + is_anon = PageAnon(page);
>>> page_remove_rmap(page, true);
>>> VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
>>> VM_BUG_ON_PAGE(!PageHead(page), page);
>>> } else if (thp_migration_supported()) {
>>> swp_entry_t entry;
>>>
>>> - VM_BUG_ON(!is_pmd_migration_entry(orig_pmd));
>>> entry = pmd_to_swp_entry(orig_pmd);
>>> - page = pfn_to_page(swp_offset(entry));
>>> + if (is_device_private_entry(entry)) {
>>> + page = device_private_entry_to_page(entry);
>>> + is_anon = PageAnon(page);
>>> + page_remove_rmap(page, true);
>>> + VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
>>> + VM_BUG_ON_PAGE(!PageHead(page), page);
>>> + put_page(page);
>>
>> Why do you hide this code behind thp_migration_supported()? It seems that you just need
>> pmd swap entry not pmd migration entry. Also the condition is not consistent with the code
>> in __handle_mm_fault(), in which you handle is_device_private_entry() directly without
>> checking thp_migration_support().
>
> Good point, I think "else if (thp_migration_supported())" should be
> "else if (is_pmd_migration_entry(orig_pmd))" since if the PMD *is*
> a device private or migration entry, then it should be handled and the
> VM_BUG_ON() should be that thp_migration_supported() is true
> (or maybe remove the VM_BUG_ON?).

I disagree. A device private entry is independent of a PMD migration entry, since a device private
entry is just a swap entry, which is available when CONFIG_TRANSPARENT_HUGEPAGE. So for architectures
support THP but not THP migration (like ARM64), your code should still work.

I would suggest you to check all the use of is_swap_pmd() and make sure the code
can handle is_device_private_entry().

For new device private code, you might need to guard it either statically or dynamically in case
CONFIG_DEVICE_PRIVATE is disabled. Potentially, you would like to make sure a system without
CONFIG_DEVICE_PRIVATE will not see is_device_private_entry() == true and give errors when it does.


>
>> Do we need to support split_huge_pmd() if a page is migrated to device? Any new code
>> needed in split_huge_pmd()?
>
> I was thinking that any CPU usage of the device private page would cause it to be
> migrated back to system memory as a whole PMD/PUD page but I'll double check.
> At least there should be a check that the page isn't a device private page.

Well, that depends. If we can allocate a THP on CPU memory, we can migrate the whole page back.
But if no THP is allocated due to low on free memory or memory fragmentation, I think you
might need a fallback plan, either splitting the device private page and migrating smaller
pages instead or reclaiming CPU memory until you get a THP. IMHO, the former might be preferred,
since the latter might cost a lot of CPU cycles but still gives no THP after all.




Best Regards,
Yan Zi


Attachments:
signature.asc (871.00 B)
OpenPGP digital signature

2020-06-22 21:35:26

by Ralph Campbell

[permalink] [raw]
Subject: Re: [PATCH 14/16] mm/thp: add THP allocation helper


On 6/21/20 5:15 PM, Zi Yan wrote:
> On 19 Jun 2020, at 17:56, Ralph Campbell wrote:
>
>> Transparent huge page allocation policy is controlled by several sysfs
>> variables. Rather than expose these to each device driver that needs to
>> allocate THPs, provide a helper function.
>>
>> Signed-off-by: Ralph Campbell <[email protected]>
>> ---
>> include/linux/gfp.h | 10 ++++++++++
>> mm/huge_memory.c | 16 ++++++++++++++++
>> 2 files changed, 26 insertions(+)
>>
>> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
>> index 67a0774e080b..1c7d968a27d3 100644
>> --- a/include/linux/gfp.h
>> +++ b/include/linux/gfp.h
>> @@ -562,6 +562,16 @@ extern struct page *alloc_pages_vma(gfp_t gfp_mask, int order,
>> alloc_pages_vma(gfp_mask, 0, vma, addr, numa_node_id(), false)
>> #define alloc_page_vma_node(gfp_mask, vma, addr, node) \
>> alloc_pages_vma(gfp_mask, 0, vma, addr, node, false)
>> +#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
>> +extern struct page *alloc_transhugepage(struct vm_area_struct *vma,
>> + unsigned long addr);
>> +#else
>> +static inline struct page *alloc_transhugepage(struct vm_area_struct *vma,
>> + unsigned long addr)
>> +{
>> + return NULL;
>> +}
>> +#endif
>>
>> extern unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order);
>> extern unsigned long get_zeroed_page(gfp_t gfp_mask);
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 25d95f7b1e98..f749633ed350 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -775,6 +775,22 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
>> return __do_huge_pmd_anonymous_page(vmf, page, gfp);
>> }
>>
>> +#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
>> +struct page *alloc_transhugepage(struct vm_area_struct *vma,
>> + unsigned long haddr)
>> +{
>> + gfp_t gfp;
>> + struct page *page;
>> +
>> + gfp = alloc_hugepage_direct_gfpmask(vma);
>> + page = alloc_hugepage_vma(gfp, vma, haddr, HPAGE_PMD_ORDER);
>> + if (page)
>> + prep_transhuge_page(page);
>> + return page;
>> +}
>> +EXPORT_SYMBOL_GPL(alloc_transhugepage);
>> +#endif
>> +
>> static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
>> pmd_t *pmd, pfn_t pfn, pgprot_t prot, bool write,
>> pgtable_t pgtable)
>> --
>> 2.20.1
>
> Why use CONFIG_ARCH_ENABLE_THP_MIGRATION to guard THP allocator helper?
> Shouldn’t CONFIG_TRANSPARENT_HUGEPAGE be used? Also the helper still allocates
> a THP even if transparent_hugepage_enabled(vma) is false, which is wrong, right?
>
>
> --
> Best Regards,
> Yan Zi
>

Oops, I'm not sure why I thought that was needed. The whole file is only compiled
if CONFIG_TRANSPARENT_HUGEPAGE is defined and the calls to alloc_hugepage_vma()
and alloc_hugepage_direct_gfpmask() are unprotected just above this in
do_huge_pmd_anonymous_page(). I'll fix that in v2.

The helper is intended to be called by a device driver to allocate a THP when
migrating device private memory back to system memory. The THP should never be
migrated to device private memory in the first place if
transparent_hugepage_enabled(vma) is false.
I suppose I could add a if (WARN_ON_ONCE()) return NULL as a sanity check.
The real checks are in migrate_vma_setup() and migrate_vma_pages().

2020-06-22 21:57:49

by Zi Yan

[permalink] [raw]
Subject: Re: [PATCH 13/16] mm: support THP migration to device private memory

On 22 Jun 2020, at 17:31, Ralph Campbell wrote:

> On 6/22/20 1:10 PM, Zi Yan wrote:
>> On 22 Jun 2020, at 15:36, Ralph Campbell wrote:
>>
>>> On 6/21/20 4:20 PM, Zi Yan wrote:
>>>> On 19 Jun 2020, at 17:56, Ralph Campbell wrote:
>>>>
>>>>> Support transparent huge page migration to ZONE_DEVICE private memory.
>>>>> A new flag (MIGRATE_PFN_COMPOUND) is added to the input PFN array to
>>>>> indicate the huge page was fully mapped by the CPU.
>>>>> Export prep_compound_page() so that device drivers can create huge
>>>>> device private pages after calling memremap_pages().
>>>>>
>>>>> Signed-off-by: Ralph Campbell <[email protected]>
>>>>> ---
>>>>> include/linux/migrate.h | 1 +
>>>>> include/linux/mm.h | 1 +
>>>>> mm/huge_memory.c | 30 ++++--
>>>>> mm/internal.h | 1 -
>>>>> mm/memory.c | 10 +-
>>>>> mm/memremap.c | 9 +-
>>>>> mm/migrate.c | 226 ++++++++++++++++++++++++++++++++--------
>>>>> mm/page_alloc.c | 1 +
>>>>> 8 files changed, 226 insertions(+), 53 deletions(-)
>>>>>
>>>>> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
>>>>> index 3e546cbf03dd..f6a64965c8bd 100644
>>>>> --- a/include/linux/migrate.h
>>>>> +++ b/include/linux/migrate.h
>>>>> @@ -166,6 +166,7 @@ static inline int migrate_misplaced_transhuge_page(struct mm_struct *mm,
>>>>> #define MIGRATE_PFN_MIGRATE (1UL << 1)
>>>>> #define MIGRATE_PFN_LOCKED (1UL << 2)
>>>>> #define MIGRATE_PFN_WRITE (1UL << 3)
>>>>> +#define MIGRATE_PFN_COMPOUND (1UL << 4)
>>>>> #define MIGRATE_PFN_SHIFT 6
>>>>>
>>>>> static inline struct page *migrate_pfn_to_page(unsigned long mpfn)
>>>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>>>> index dc7b87310c10..020b9dd3cddb 100644
>>>>> --- a/include/linux/mm.h
>>>>> +++ b/include/linux/mm.h
>>>>> @@ -932,6 +932,7 @@ static inline unsigned int page_shift(struct page *page)
>>>>> }
>>>>>
>>>>> void free_compound_page(struct page *page);
>>>>> +void prep_compound_page(struct page *page, unsigned int order);
>>>>>
>>>>> #ifdef CONFIG_MMU
>>>>> /*
>>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>>> index 78c84bee7e29..25d95f7b1e98 100644
>>>>> --- a/mm/huge_memory.c
>>>>> +++ b/mm/huge_memory.c
>>>>> @@ -1663,23 +1663,35 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>>>>> } else {
>>>>> struct page *page = NULL;
>>>>> int flush_needed = 1;
>>>>> + bool is_anon = false;
>>>>>
>>>>> if (pmd_present(orig_pmd)) {
>>>>> page = pmd_page(orig_pmd);
>>>>> + is_anon = PageAnon(page);
>>>>> page_remove_rmap(page, true);
>>>>> VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
>>>>> VM_BUG_ON_PAGE(!PageHead(page), page);
>>>>> } else if (thp_migration_supported()) {
>>>>> swp_entry_t entry;
>>>>>
>>>>> - VM_BUG_ON(!is_pmd_migration_entry(orig_pmd));
>>>>> entry = pmd_to_swp_entry(orig_pmd);
>>>>> - page = pfn_to_page(swp_offset(entry));
>>>>> + if (is_device_private_entry(entry)) {
>>>>> + page = device_private_entry_to_page(entry);
>>>>> + is_anon = PageAnon(page);
>>>>> + page_remove_rmap(page, true);
>>>>> + VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
>>>>> + VM_BUG_ON_PAGE(!PageHead(page), page);
>>>>> + put_page(page);
>>>>
>>>> Why do you hide this code behind thp_migration_supported()? It seems that you just need
>>>> pmd swap entry not pmd migration entry. Also the condition is not consistent with the code
>>>> in __handle_mm_fault(), in which you handle is_device_private_entry() directly without
>>>> checking thp_migration_support().
>>>
>>> Good point, I think "else if (thp_migration_supported())" should be
>>> "else if (is_pmd_migration_entry(orig_pmd))" since if the PMD *is*
>>> a device private or migration entry, then it should be handled and the
>>> VM_BUG_ON() should be that thp_migration_supported() is true
>>> (or maybe remove the VM_BUG_ON?).
>>
>> I disagree. A device private entry is independent of a PMD migration entry, since a device private
>> entry is just a swap entry, which is available when CONFIG_TRANSPARENT_HUGEPAGE. So for architectures
>> support THP but not THP migration (like ARM64), your code should still work.
>
> I'll fix this up for v2 and you can double check me.

Sure.

>
>> I would suggest you to check all the use of is_swap_pmd() and make sure the code
>> can handle is_device_private_entry().
>
> OK.
>
>> For new device private code, you might need to guard it either statically or dynamically in case
>> CONFIG_DEVICE_PRIVATE is disabled. Potentially, you would like to make sure a system without
>> CONFIG_DEVICE_PRIVATE will not see is_device_private_entry() == true and give errors when it does.
>
> I have compiled and run with CONFIG_DEVICE_PRIVATE off but I can test more combinations of
> config settings.

Thanks.

>
>>>
>>>> Do we need to support split_huge_pmd() if a page is migrated to device? Any new code
>>>> needed in split_huge_pmd()?
>>>
>>> I was thinking that any CPU usage of the device private page would cause it to be
>>> migrated back to system memory as a whole PMD/PUD page but I'll double check.
>>> At least there should be a check that the page isn't a device private page.
>>
>> Well, that depends. If we can allocate a THP on CPU memory, we can migrate the whole page back.
>> But if no THP is allocated due to low on free memory or memory fragmentation, I think you
>> might need a fallback plan, either splitting the device private page and migrating smaller
>> pages instead or reclaiming CPU memory until you get a THP. IMHO, the former might be preferred,
>> since the latter might cost a lot of CPU cycles but still gives no THP after all.
>
> Sounds reasonable. I'll work on adding the fallback path for v2.

Ying(cc’d) developed the code to swapout and swapin THP in one piece: https://lore.kernel.org/linux-mm/[email protected]/.
I am not sure whether the patchset makes into mainstream or not. It could be a good technical reference
for swapping in device private pages, although swapping in pages from disk and from device private
memory are two different scenarios.

Since the device private memory swapin impacts core mm performance, we might want to discuss your patches
with more people, like the ones from Ying’s patchset, in the next version.




Best Regards,
Yan Zi


Attachments:
signature.asc (871.00 B)
OpenPGP digital signature

2020-06-22 22:35:22

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCH 13/16] mm: support THP migration to device private memory

On Mon, Jun 22, 2020 at 2:53 PM Zi Yan <[email protected]> wrote:
>
> On 22 Jun 2020, at 17:31, Ralph Campbell wrote:
>
> > On 6/22/20 1:10 PM, Zi Yan wrote:
> >> On 22 Jun 2020, at 15:36, Ralph Campbell wrote:
> >>
> >>> On 6/21/20 4:20 PM, Zi Yan wrote:
> >>>> On 19 Jun 2020, at 17:56, Ralph Campbell wrote:
> >>>>
> >>>>> Support transparent huge page migration to ZONE_DEVICE private memory.
> >>>>> A new flag (MIGRATE_PFN_COMPOUND) is added to the input PFN array to
> >>>>> indicate the huge page was fully mapped by the CPU.
> >>>>> Export prep_compound_page() so that device drivers can create huge
> >>>>> device private pages after calling memremap_pages().
> >>>>>
> >>>>> Signed-off-by: Ralph Campbell <[email protected]>
> >>>>> ---
> >>>>> include/linux/migrate.h | 1 +
> >>>>> include/linux/mm.h | 1 +
> >>>>> mm/huge_memory.c | 30 ++++--
> >>>>> mm/internal.h | 1 -
> >>>>> mm/memory.c | 10 +-
> >>>>> mm/memremap.c | 9 +-
> >>>>> mm/migrate.c | 226 ++++++++++++++++++++++++++++++++--------
> >>>>> mm/page_alloc.c | 1 +
> >>>>> 8 files changed, 226 insertions(+), 53 deletions(-)
> >>>>>
> >>>>> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> >>>>> index 3e546cbf03dd..f6a64965c8bd 100644
> >>>>> --- a/include/linux/migrate.h
> >>>>> +++ b/include/linux/migrate.h
> >>>>> @@ -166,6 +166,7 @@ static inline int migrate_misplaced_transhuge_page(struct mm_struct *mm,
> >>>>> #define MIGRATE_PFN_MIGRATE (1UL << 1)
> >>>>> #define MIGRATE_PFN_LOCKED (1UL << 2)
> >>>>> #define MIGRATE_PFN_WRITE (1UL << 3)
> >>>>> +#define MIGRATE_PFN_COMPOUND (1UL << 4)
> >>>>> #define MIGRATE_PFN_SHIFT 6
> >>>>>
> >>>>> static inline struct page *migrate_pfn_to_page(unsigned long mpfn)
> >>>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
> >>>>> index dc7b87310c10..020b9dd3cddb 100644
> >>>>> --- a/include/linux/mm.h
> >>>>> +++ b/include/linux/mm.h
> >>>>> @@ -932,6 +932,7 @@ static inline unsigned int page_shift(struct page *page)
> >>>>> }
> >>>>>
> >>>>> void free_compound_page(struct page *page);
> >>>>> +void prep_compound_page(struct page *page, unsigned int order);
> >>>>>
> >>>>> #ifdef CONFIG_MMU
> >>>>> /*
> >>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >>>>> index 78c84bee7e29..25d95f7b1e98 100644
> >>>>> --- a/mm/huge_memory.c
> >>>>> +++ b/mm/huge_memory.c
> >>>>> @@ -1663,23 +1663,35 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> >>>>> } else {
> >>>>> struct page *page = NULL;
> >>>>> int flush_needed = 1;
> >>>>> + bool is_anon = false;
> >>>>>
> >>>>> if (pmd_present(orig_pmd)) {
> >>>>> page = pmd_page(orig_pmd);
> >>>>> + is_anon = PageAnon(page);
> >>>>> page_remove_rmap(page, true);
> >>>>> VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
> >>>>> VM_BUG_ON_PAGE(!PageHead(page), page);
> >>>>> } else if (thp_migration_supported()) {
> >>>>> swp_entry_t entry;
> >>>>>
> >>>>> - VM_BUG_ON(!is_pmd_migration_entry(orig_pmd));
> >>>>> entry = pmd_to_swp_entry(orig_pmd);
> >>>>> - page = pfn_to_page(swp_offset(entry));
> >>>>> + if (is_device_private_entry(entry)) {
> >>>>> + page = device_private_entry_to_page(entry);
> >>>>> + is_anon = PageAnon(page);
> >>>>> + page_remove_rmap(page, true);
> >>>>> + VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
> >>>>> + VM_BUG_ON_PAGE(!PageHead(page), page);
> >>>>> + put_page(page);
> >>>>
> >>>> Why do you hide this code behind thp_migration_supported()? It seems that you just need
> >>>> pmd swap entry not pmd migration entry. Also the condition is not consistent with the code
> >>>> in __handle_mm_fault(), in which you handle is_device_private_entry() directly without
> >>>> checking thp_migration_support().
> >>>
> >>> Good point, I think "else if (thp_migration_supported())" should be
> >>> "else if (is_pmd_migration_entry(orig_pmd))" since if the PMD *is*
> >>> a device private or migration entry, then it should be handled and the
> >>> VM_BUG_ON() should be that thp_migration_supported() is true
> >>> (or maybe remove the VM_BUG_ON?).
> >>
> >> I disagree. A device private entry is independent of a PMD migration entry, since a device private
> >> entry is just a swap entry, which is available when CONFIG_TRANSPARENT_HUGEPAGE. So for architectures
> >> support THP but not THP migration (like ARM64), your code should still work.
> >
> > I'll fix this up for v2 and you can double check me.
>
> Sure.
>
> >
> >> I would suggest you to check all the use of is_swap_pmd() and make sure the code
> >> can handle is_device_private_entry().
> >
> > OK.
> >
> >> For new device private code, you might need to guard it either statically or dynamically in case
> >> CONFIG_DEVICE_PRIVATE is disabled. Potentially, you would like to make sure a system without
> >> CONFIG_DEVICE_PRIVATE will not see is_device_private_entry() == true and give errors when it does.
> >
> > I have compiled and run with CONFIG_DEVICE_PRIVATE off but I can test more combinations of
> > config settings.
>
> Thanks.
>
> >
> >>>
> >>>> Do we need to support split_huge_pmd() if a page is migrated to device? Any new code
> >>>> needed in split_huge_pmd()?
> >>>
> >>> I was thinking that any CPU usage of the device private page would cause it to be
> >>> migrated back to system memory as a whole PMD/PUD page but I'll double check.
> >>> At least there should be a check that the page isn't a device private page.
> >>
> >> Well, that depends. If we can allocate a THP on CPU memory, we can migrate the whole page back.
> >> But if no THP is allocated due to low on free memory or memory fragmentation, I think you
> >> might need a fallback plan, either splitting the device private page and migrating smaller
> >> pages instead or reclaiming CPU memory until you get a THP. IMHO, the former might be preferred,
> >> since the latter might cost a lot of CPU cycles but still gives no THP after all.
> >
> > Sounds reasonable. I'll work on adding the fallback path for v2.
>
> Ying(cc’d) developed the code to swapout and swapin THP in one piece: https://lore.kernel.org/linux-mm/[email protected]/.
> I am not sure whether the patchset makes into mainstream or not. It could be a good technical reference
> for swapping in device private pages, although swapping in pages from disk and from device private
> memory are two different scenarios.
>
> Since the device private memory swapin impacts core mm performance, we might want to discuss your patches
> with more people, like the ones from Ying’s patchset, in the next version.

I believe Ying will give you more insights about how THP swap works.

But, IMHO device memory migration (migrate to system memory) seems
like THP CoW more than swap.

When migrating in:

>
>
>
> —
> Best Regards,
> Yan Zi

2020-06-22 22:39:22

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCH 13/16] mm: support THP migration to device private memory

On Mon, Jun 22, 2020 at 3:30 PM Yang Shi <[email protected]> wrote:
>
> On Mon, Jun 22, 2020 at 2:53 PM Zi Yan <[email protected]> wrote:
> >
> > On 22 Jun 2020, at 17:31, Ralph Campbell wrote:
> >
> > > On 6/22/20 1:10 PM, Zi Yan wrote:
> > >> On 22 Jun 2020, at 15:36, Ralph Campbell wrote:
> > >>
> > >>> On 6/21/20 4:20 PM, Zi Yan wrote:
> > >>>> On 19 Jun 2020, at 17:56, Ralph Campbell wrote:
> > >>>>
> > >>>>> Support transparent huge page migration to ZONE_DEVICE private memory.
> > >>>>> A new flag (MIGRATE_PFN_COMPOUND) is added to the input PFN array to
> > >>>>> indicate the huge page was fully mapped by the CPU.
> > >>>>> Export prep_compound_page() so that device drivers can create huge
> > >>>>> device private pages after calling memremap_pages().
> > >>>>>
> > >>>>> Signed-off-by: Ralph Campbell <[email protected]>
> > >>>>> ---
> > >>>>> include/linux/migrate.h | 1 +
> > >>>>> include/linux/mm.h | 1 +
> > >>>>> mm/huge_memory.c | 30 ++++--
> > >>>>> mm/internal.h | 1 -
> > >>>>> mm/memory.c | 10 +-
> > >>>>> mm/memremap.c | 9 +-
> > >>>>> mm/migrate.c | 226 ++++++++++++++++++++++++++++++++--------
> > >>>>> mm/page_alloc.c | 1 +
> > >>>>> 8 files changed, 226 insertions(+), 53 deletions(-)
> > >>>>>
> > >>>>> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> > >>>>> index 3e546cbf03dd..f6a64965c8bd 100644
> > >>>>> --- a/include/linux/migrate.h
> > >>>>> +++ b/include/linux/migrate.h
> > >>>>> @@ -166,6 +166,7 @@ static inline int migrate_misplaced_transhuge_page(struct mm_struct *mm,
> > >>>>> #define MIGRATE_PFN_MIGRATE (1UL << 1)
> > >>>>> #define MIGRATE_PFN_LOCKED (1UL << 2)
> > >>>>> #define MIGRATE_PFN_WRITE (1UL << 3)
> > >>>>> +#define MIGRATE_PFN_COMPOUND (1UL << 4)
> > >>>>> #define MIGRATE_PFN_SHIFT 6
> > >>>>>
> > >>>>> static inline struct page *migrate_pfn_to_page(unsigned long mpfn)
> > >>>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
> > >>>>> index dc7b87310c10..020b9dd3cddb 100644
> > >>>>> --- a/include/linux/mm.h
> > >>>>> +++ b/include/linux/mm.h
> > >>>>> @@ -932,6 +932,7 @@ static inline unsigned int page_shift(struct page *page)
> > >>>>> }
> > >>>>>
> > >>>>> void free_compound_page(struct page *page);
> > >>>>> +void prep_compound_page(struct page *page, unsigned int order);
> > >>>>>
> > >>>>> #ifdef CONFIG_MMU
> > >>>>> /*
> > >>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > >>>>> index 78c84bee7e29..25d95f7b1e98 100644
> > >>>>> --- a/mm/huge_memory.c
> > >>>>> +++ b/mm/huge_memory.c
> > >>>>> @@ -1663,23 +1663,35 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> > >>>>> } else {
> > >>>>> struct page *page = NULL;
> > >>>>> int flush_needed = 1;
> > >>>>> + bool is_anon = false;
> > >>>>>
> > >>>>> if (pmd_present(orig_pmd)) {
> > >>>>> page = pmd_page(orig_pmd);
> > >>>>> + is_anon = PageAnon(page);
> > >>>>> page_remove_rmap(page, true);
> > >>>>> VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
> > >>>>> VM_BUG_ON_PAGE(!PageHead(page), page);
> > >>>>> } else if (thp_migration_supported()) {
> > >>>>> swp_entry_t entry;
> > >>>>>
> > >>>>> - VM_BUG_ON(!is_pmd_migration_entry(orig_pmd));
> > >>>>> entry = pmd_to_swp_entry(orig_pmd);
> > >>>>> - page = pfn_to_page(swp_offset(entry));
> > >>>>> + if (is_device_private_entry(entry)) {
> > >>>>> + page = device_private_entry_to_page(entry);
> > >>>>> + is_anon = PageAnon(page);
> > >>>>> + page_remove_rmap(page, true);
> > >>>>> + VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
> > >>>>> + VM_BUG_ON_PAGE(!PageHead(page), page);
> > >>>>> + put_page(page);
> > >>>>
> > >>>> Why do you hide this code behind thp_migration_supported()? It seems that you just need
> > >>>> pmd swap entry not pmd migration entry. Also the condition is not consistent with the code
> > >>>> in __handle_mm_fault(), in which you handle is_device_private_entry() directly without
> > >>>> checking thp_migration_support().
> > >>>
> > >>> Good point, I think "else if (thp_migration_supported())" should be
> > >>> "else if (is_pmd_migration_entry(orig_pmd))" since if the PMD *is*
> > >>> a device private or migration entry, then it should be handled and the
> > >>> VM_BUG_ON() should be that thp_migration_supported() is true
> > >>> (or maybe remove the VM_BUG_ON?).
> > >>
> > >> I disagree. A device private entry is independent of a PMD migration entry, since a device private
> > >> entry is just a swap entry, which is available when CONFIG_TRANSPARENT_HUGEPAGE. So for architectures
> > >> support THP but not THP migration (like ARM64), your code should still work.
> > >
> > > I'll fix this up for v2 and you can double check me.
> >
> > Sure.
> >
> > >
> > >> I would suggest you to check all the use of is_swap_pmd() and make sure the code
> > >> can handle is_device_private_entry().
> > >
> > > OK.
> > >
> > >> For new device private code, you might need to guard it either statically or dynamically in case
> > >> CONFIG_DEVICE_PRIVATE is disabled. Potentially, you would like to make sure a system without
> > >> CONFIG_DEVICE_PRIVATE will not see is_device_private_entry() == true and give errors when it does.
> > >
> > > I have compiled and run with CONFIG_DEVICE_PRIVATE off but I can test more combinations of
> > > config settings.
> >
> > Thanks.
> >
> > >
> > >>>
> > >>>> Do we need to support split_huge_pmd() if a page is migrated to device? Any new code
> > >>>> needed in split_huge_pmd()?
> > >>>
> > >>> I was thinking that any CPU usage of the device private page would cause it to be
> > >>> migrated back to system memory as a whole PMD/PUD page but I'll double check.
> > >>> At least there should be a check that the page isn't a device private page.
> > >>
> > >> Well, that depends. If we can allocate a THP on CPU memory, we can migrate the whole page back.
> > >> But if no THP is allocated due to low on free memory or memory fragmentation, I think you
> > >> might need a fallback plan, either splitting the device private page and migrating smaller
> > >> pages instead or reclaiming CPU memory until you get a THP. IMHO, the former might be preferred,
> > >> since the latter might cost a lot of CPU cycles but still gives no THP after all.
> > >
> > > Sounds reasonable. I'll work on adding the fallback path for v2.
> >
> > Ying(cc’d) developed the code to swapout and swapin THP in one piece: https://lore.kernel.org/linux-mm/[email protected]/.
> > I am not sure whether the patchset makes into mainstream or not. It could be a good technical reference
> > for swapping in device private pages, although swapping in pages from disk and from device private
> > memory are two different scenarios.
> >
> > Since the device private memory swapin impacts core mm performance, we might want to discuss your patches
> > with more people, like the ones from Ying’s patchset, in the next version.
>
> I believe Ying will give you more insights about how THP swap works.
>
> But, IMHO device memory migration (migrate to system memory) seems
> like THP CoW more than swap.
>
> When migrating in:

Sorry for my fat finger, hit sent button inadvertently, let me finish here.

When migrating in:

- if THP is enabled: allocate THP, but need handle allocation
failure by falling back to base page
- if THP is disabled: fallback to base page

>
> >
> >
> >
> > —
> > Best Regards,
> > Yan Zi

2020-06-22 23:04:22

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH 13/16] mm: support THP migration to device private memory

On 2020-06-22 15:33, Yang Shi wrote:
> On Mon, Jun 22, 2020 at 3:30 PM Yang Shi <[email protected]> wrote:
>> On Mon, Jun 22, 2020 at 2:53 PM Zi Yan <[email protected]> wrote:
>>> On 22 Jun 2020, at 17:31, Ralph Campbell wrote:
>>>> On 6/22/20 1:10 PM, Zi Yan wrote:
>>>>> On 22 Jun 2020, at 15:36, Ralph Campbell wrote:
>>>>>> On 6/21/20 4:20 PM, Zi Yan wrote:
>>>>>>> On 19 Jun 2020, at 17:56, Ralph Campbell wrote:
...
>>> Ying(cc’d) developed the code to swapout and swapin THP in one piece: https://lore.kernel.org/linux-mm/[email protected]/.
>>> I am not sure whether the patchset makes into mainstream or not. It could be a good technical reference
>>> for swapping in device private pages, although swapping in pages from disk and from device private
>>> memory are two different scenarios.
>>>
>>> Since the device private memory swapin impacts core mm performance, we might want to discuss your patches
>>> with more people, like the ones from Ying’s patchset, in the next version.
>>
>> I believe Ying will give you more insights about how THP swap works.
>>
>> But, IMHO device memory migration (migrate to system memory) seems
>> like THP CoW more than swap.


A fine point: overall, the desired behavior is "migrate", not CoW.
That's important. Migrate means that you don't leave a page behind, even
a read-only one. And that's exactly how device private migration is
specified.

We should try to avoid any erosion of clarity here. Even if somehow
(really?) the underlying implementation calls this THP CoW, the actual
goal is to migrate pages over to the device (and back).


>>
>> When migrating in:
>
> Sorry for my fat finger, hit sent button inadvertently, let me finish here.
>
> When migrating in:
>
> - if THP is enabled: allocate THP, but need handle allocation
> failure by falling back to base page
> - if THP is disabled: fallback to base page
>

OK, but *all* page entries (base and huge/large pages) need to be cleared,
when migrating to device memory, unless I'm really confused here.
So: not CoW.

thanks,
--
John Hubbard
NVIDIA

2020-06-22 23:32:37

by Ralph Campbell

[permalink] [raw]
Subject: Re: [PATCH 13/16] mm: support THP migration to device private memory


On 6/22/20 1:10 PM, Zi Yan wrote:
> On 22 Jun 2020, at 15:36, Ralph Campbell wrote:
>
>> On 6/21/20 4:20 PM, Zi Yan wrote:
>>> On 19 Jun 2020, at 17:56, Ralph Campbell wrote:
>>>
>>>> Support transparent huge page migration to ZONE_DEVICE private memory.
>>>> A new flag (MIGRATE_PFN_COMPOUND) is added to the input PFN array to
>>>> indicate the huge page was fully mapped by the CPU.
>>>> Export prep_compound_page() so that device drivers can create huge
>>>> device private pages after calling memremap_pages().
>>>>
>>>> Signed-off-by: Ralph Campbell <[email protected]>
>>>> ---
>>>> include/linux/migrate.h | 1 +
>>>> include/linux/mm.h | 1 +
>>>> mm/huge_memory.c | 30 ++++--
>>>> mm/internal.h | 1 -
>>>> mm/memory.c | 10 +-
>>>> mm/memremap.c | 9 +-
>>>> mm/migrate.c | 226 ++++++++++++++++++++++++++++++++--------
>>>> mm/page_alloc.c | 1 +
>>>> 8 files changed, 226 insertions(+), 53 deletions(-)
>>>>
>>>> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
>>>> index 3e546cbf03dd..f6a64965c8bd 100644
>>>> --- a/include/linux/migrate.h
>>>> +++ b/include/linux/migrate.h
>>>> @@ -166,6 +166,7 @@ static inline int migrate_misplaced_transhuge_page(struct mm_struct *mm,
>>>> #define MIGRATE_PFN_MIGRATE (1UL << 1)
>>>> #define MIGRATE_PFN_LOCKED (1UL << 2)
>>>> #define MIGRATE_PFN_WRITE (1UL << 3)
>>>> +#define MIGRATE_PFN_COMPOUND (1UL << 4)
>>>> #define MIGRATE_PFN_SHIFT 6
>>>>
>>>> static inline struct page *migrate_pfn_to_page(unsigned long mpfn)
>>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>>> index dc7b87310c10..020b9dd3cddb 100644
>>>> --- a/include/linux/mm.h
>>>> +++ b/include/linux/mm.h
>>>> @@ -932,6 +932,7 @@ static inline unsigned int page_shift(struct page *page)
>>>> }
>>>>
>>>> void free_compound_page(struct page *page);
>>>> +void prep_compound_page(struct page *page, unsigned int order);
>>>>
>>>> #ifdef CONFIG_MMU
>>>> /*
>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>> index 78c84bee7e29..25d95f7b1e98 100644
>>>> --- a/mm/huge_memory.c
>>>> +++ b/mm/huge_memory.c
>>>> @@ -1663,23 +1663,35 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>>>> } else {
>>>> struct page *page = NULL;
>>>> int flush_needed = 1;
>>>> + bool is_anon = false;
>>>>
>>>> if (pmd_present(orig_pmd)) {
>>>> page = pmd_page(orig_pmd);
>>>> + is_anon = PageAnon(page);
>>>> page_remove_rmap(page, true);
>>>> VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
>>>> VM_BUG_ON_PAGE(!PageHead(page), page);
>>>> } else if (thp_migration_supported()) {
>>>> swp_entry_t entry;
>>>>
>>>> - VM_BUG_ON(!is_pmd_migration_entry(orig_pmd));
>>>> entry = pmd_to_swp_entry(orig_pmd);
>>>> - page = pfn_to_page(swp_offset(entry));
>>>> + if (is_device_private_entry(entry)) {
>>>> + page = device_private_entry_to_page(entry);
>>>> + is_anon = PageAnon(page);
>>>> + page_remove_rmap(page, true);
>>>> + VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
>>>> + VM_BUG_ON_PAGE(!PageHead(page), page);
>>>> + put_page(page);
>>>
>>> Why do you hide this code behind thp_migration_supported()? It seems that you just need
>>> pmd swap entry not pmd migration entry. Also the condition is not consistent with the code
>>> in __handle_mm_fault(), in which you handle is_device_private_entry() directly without
>>> checking thp_migration_support().
>>
>> Good point, I think "else if (thp_migration_supported())" should be
>> "else if (is_pmd_migration_entry(orig_pmd))" since if the PMD *is*
>> a device private or migration entry, then it should be handled and the
>> VM_BUG_ON() should be that thp_migration_supported() is true
>> (or maybe remove the VM_BUG_ON?).
>
> I disagree. A device private entry is independent of a PMD migration entry, since a device private
> entry is just a swap entry, which is available when CONFIG_TRANSPARENT_HUGEPAGE. So for architectures
> support THP but not THP migration (like ARM64), your code should still work.

I'll fix this up for v2 and you can double check me.

> I would suggest you to check all the use of is_swap_pmd() and make sure the code
> can handle is_device_private_entry().

OK.

> For new device private code, you might need to guard it either statically or dynamically in case
> CONFIG_DEVICE_PRIVATE is disabled. Potentially, you would like to make sure a system without
> CONFIG_DEVICE_PRIVATE will not see is_device_private_entry() == true and give errors when it does.

I have compiled and run with CONFIG_DEVICE_PRIVATE off but I can test more combinations of
config settings.

>>
>>> Do we need to support split_huge_pmd() if a page is migrated to device? Any new code
>>> needed in split_huge_pmd()?
>>
>> I was thinking that any CPU usage of the device private page would cause it to be
>> migrated back to system memory as a whole PMD/PUD page but I'll double check.
>> At least there should be a check that the page isn't a device private page.
>
> Well, that depends. If we can allocate a THP on CPU memory, we can migrate the whole page back.
> But if no THP is allocated due to low on free memory or memory fragmentation, I think you
> might need a fallback plan, either splitting the device private page and migrating smaller
> pages instead or reclaiming CPU memory until you get a THP. IMHO, the former might be preferred,
> since the latter might cost a lot of CPU cycles but still gives no THP after all.

Sounds reasonable. I'll work on adding the fallback path for v2.

2020-06-23 02:01:23

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCH 13/16] mm: support THP migration to device private memory

On Mon, Jun 22, 2020 at 4:02 PM John Hubbard <[email protected]> wrote:
>
> On 2020-06-22 15:33, Yang Shi wrote:
> > On Mon, Jun 22, 2020 at 3:30 PM Yang Shi <[email protected]> wrote:
> >> On Mon, Jun 22, 2020 at 2:53 PM Zi Yan <[email protected]> wrote:
> >>> On 22 Jun 2020, at 17:31, Ralph Campbell wrote:
> >>>> On 6/22/20 1:10 PM, Zi Yan wrote:
> >>>>> On 22 Jun 2020, at 15:36, Ralph Campbell wrote:
> >>>>>> On 6/21/20 4:20 PM, Zi Yan wrote:
> >>>>>>> On 19 Jun 2020, at 17:56, Ralph Campbell wrote:
> ...
> >>> Ying(cc’d) developed the code to swapout and swapin THP in one piece: https://lore.kernel.org/linux-mm/[email protected]/.
> >>> I am not sure whether the patchset makes into mainstream or not. It could be a good technical reference
> >>> for swapping in device private pages, although swapping in pages from disk and from device private
> >>> memory are two different scenarios.
> >>>
> >>> Since the device private memory swapin impacts core mm performance, we might want to discuss your patches
> >>> with more people, like the ones from Ying’s patchset, in the next version.
> >>
> >> I believe Ying will give you more insights about how THP swap works.
> >>
> >> But, IMHO device memory migration (migrate to system memory) seems
> >> like THP CoW more than swap.
>
>
> A fine point: overall, the desired behavior is "migrate", not CoW.
> That's important. Migrate means that you don't leave a page behind, even
> a read-only one. And that's exactly how device private migration is
> specified.
>
> We should try to avoid any erosion of clarity here. Even if somehow
> (really?) the underlying implementation calls this THP CoW, the actual
> goal is to migrate pages over to the device (and back).
>
>
> >>
> >> When migrating in:
> >
> > Sorry for my fat finger, hit sent button inadvertently, let me finish here.
> >
> > When migrating in:
> >
> > - if THP is enabled: allocate THP, but need handle allocation
> > failure by falling back to base page
> > - if THP is disabled: fallback to base page
> >
>
> OK, but *all* page entries (base and huge/large pages) need to be cleared,
> when migrating to device memory, unless I'm really confused here.
> So: not CoW.

I realized the comment caused more confusion. I apologize for the
confusion. Yes, the trigger condition for swap/migration and CoW are
definitely different. Here I mean the fault handling part of migrating
into system memory.

Swap-in just needs to handle the base page case since THP swapin is
not supported in upstream yet and the PMD is split in swap-out phase
(see shrink_page_list).

The patch adds THP migration support to device memory, but you need to
handle migrate in (back to system memory) case correctly. The fault
handling should look like THP CoW fault handling behavior (before
5.8):
- if THP is enabled: allocate THP, fallback if allocation is failed
- if THP is disabled: fallback to base page

Swap fault handling doesn't look like the above. So, I said it seems
like more THP CoW (fault handling part only before 5.8). I hope I
articulate my mind.

However, I didn't see such fallback is handled. It looks if THP
allocation is failed, it just returns SIGBUS; and no check about THP
status if I read the patches correctly. The THP might be disabled for
the specific vma or system wide before migrating from device memory
back to system memory.

>
> thanks,
> --
> John Hubbard
> NVIDIA

2020-06-23 02:01:33

by Ralph Campbell

[permalink] [raw]
Subject: Re: [PATCH 13/16] mm: support THP migration to device private memory


On 6/22/20 4:54 PM, Yang Shi wrote:
> On Mon, Jun 22, 2020 at 4:02 PM John Hubbard <[email protected]> wrote:
>>
>> On 2020-06-22 15:33, Yang Shi wrote:
>>> On Mon, Jun 22, 2020 at 3:30 PM Yang Shi <[email protected]> wrote:
>>>> On Mon, Jun 22, 2020 at 2:53 PM Zi Yan <[email protected]> wrote:
>>>>> On 22 Jun 2020, at 17:31, Ralph Campbell wrote:
>>>>>> On 6/22/20 1:10 PM, Zi Yan wrote:
>>>>>>> On 22 Jun 2020, at 15:36, Ralph Campbell wrote:
>>>>>>>> On 6/21/20 4:20 PM, Zi Yan wrote:
>>>>>>>>> On 19 Jun 2020, at 17:56, Ralph Campbell wrote:
>> ...
>>>>> Ying(cc’d) developed the code to swapout and swapin THP in one piece: https://lore.kernel.org/linux-mm/[email protected]/.
>>>>> I am not sure whether the patchset makes into mainstream or not. It could be a good technical reference
>>>>> for swapping in device private pages, although swapping in pages from disk and from device private
>>>>> memory are two different scenarios.
>>>>>
>>>>> Since the device private memory swapin impacts core mm performance, we might want to discuss your patches
>>>>> with more people, like the ones from Ying’s patchset, in the next version.
>>>>
>>>> I believe Ying will give you more insights about how THP swap works.
>>>>
>>>> But, IMHO device memory migration (migrate to system memory) seems
>>>> like THP CoW more than swap.
>>
>>
>> A fine point: overall, the desired behavior is "migrate", not CoW.
>> That's important. Migrate means that you don't leave a page behind, even
>> a read-only one. And that's exactly how device private migration is
>> specified.
>>
>> We should try to avoid any erosion of clarity here. Even if somehow
>> (really?) the underlying implementation calls this THP CoW, the actual
>> goal is to migrate pages over to the device (and back).
>>
>>
>>>>
>>>> When migrating in:
>>>
>>> Sorry for my fat finger, hit sent button inadvertently, let me finish here.
>>>
>>> When migrating in:
>>>
>>> - if THP is enabled: allocate THP, but need handle allocation
>>> failure by falling back to base page
>>> - if THP is disabled: fallback to base page
>>>
>>
>> OK, but *all* page entries (base and huge/large pages) need to be cleared,
>> when migrating to device memory, unless I'm really confused here.
>> So: not CoW.
>
> I realized the comment caused more confusion. I apologize for the
> confusion. Yes, the trigger condition for swap/migration and CoW are
> definitely different. Here I mean the fault handling part of migrating
> into system memory.
>
> Swap-in just needs to handle the base page case since THP swapin is
> not supported in upstream yet and the PMD is split in swap-out phase
> (see shrink_page_list).
>
> The patch adds THP migration support to device memory, but you need to
> handle migrate in (back to system memory) case correctly. The fault
> handling should look like THP CoW fault handling behavior (before
> 5.8):
> - if THP is enabled: allocate THP, fallback if allocation is failed
> - if THP is disabled: fallback to base page
>
> Swap fault handling doesn't look like the above. So, I said it seems
> like more THP CoW (fault handling part only before 5.8). I hope I
> articulate my mind.
>
> However, I didn't see such fallback is handled. It looks if THP
> allocation is failed, it just returns SIGBUS; and no check about THP
> status if I read the patches correctly. The THP might be disabled for
> the specific vma or system wide before migrating from device memory
> back to system memory.

You are correct, the patch wasn't handling the fallback case.
I'll add that in the next version.

>>
>> thanks,
>> --
>> John Hubbard
>> NVIDIA

2020-06-23 02:53:43

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH 13/16] mm: support THP migration to device private memory

Ralph Campbell <[email protected]> writes:

> On 6/22/20 4:54 PM, Yang Shi wrote:
>> On Mon, Jun 22, 2020 at 4:02 PM John Hubbard <[email protected]> wrote:
>>>
>>> On 2020-06-22 15:33, Yang Shi wrote:
>>>> On Mon, Jun 22, 2020 at 3:30 PM Yang Shi <[email protected]> wrote:
>>>>> On Mon, Jun 22, 2020 at 2:53 PM Zi Yan <[email protected]> wrote:
>>>>>> On 22 Jun 2020, at 17:31, Ralph Campbell wrote:
>>>>>>> On 6/22/20 1:10 PM, Zi Yan wrote:
>>>>>>>> On 22 Jun 2020, at 15:36, Ralph Campbell wrote:
>>>>>>>>> On 6/21/20 4:20 PM, Zi Yan wrote:
>>>>>>>>>> On 19 Jun 2020, at 17:56, Ralph Campbell wrote:
>>> ...
>>>>>> Ying(cc’d) developed the code to swapout and swapin THP in one piece: https://lore.kernel.org/linux-mm/[email protected]/.
>>>>>> I am not sure whether the patchset makes into mainstream or not. It could be a good technical reference
>>>>>> for swapping in device private pages, although swapping in pages from disk and from device private
>>>>>> memory are two different scenarios.
>>>>>>
>>>>>> Since the device private memory swapin impacts core mm performance, we might want to discuss your patches
>>>>>> with more people, like the ones from Ying’s patchset, in the next version.
>>>>>
>>>>> I believe Ying will give you more insights about how THP swap works.
>>>>>
>>>>> But, IMHO device memory migration (migrate to system memory) seems
>>>>> like THP CoW more than swap.
>>>
>>>
>>> A fine point: overall, the desired behavior is "migrate", not CoW.
>>> That's important. Migrate means that you don't leave a page behind, even
>>> a read-only one. And that's exactly how device private migration is
>>> specified.
>>>
>>> We should try to avoid any erosion of clarity here. Even if somehow
>>> (really?) the underlying implementation calls this THP CoW, the actual
>>> goal is to migrate pages over to the device (and back).
>>>
>>>
>>>>>
>>>>> When migrating in:
>>>>
>>>> Sorry for my fat finger, hit sent button inadvertently, let me finish here.
>>>>
>>>> When migrating in:
>>>>
>>>> - if THP is enabled: allocate THP, but need handle allocation
>>>> failure by falling back to base page
>>>> - if THP is disabled: fallback to base page
>>>>
>>>
>>> OK, but *all* page entries (base and huge/large pages) need to be cleared,
>>> when migrating to device memory, unless I'm really confused here.
>>> So: not CoW.
>>
>> I realized the comment caused more confusion. I apologize for the
>> confusion. Yes, the trigger condition for swap/migration and CoW are
>> definitely different. Here I mean the fault handling part of migrating
>> into system memory.
>>
>> Swap-in just needs to handle the base page case since THP swapin is
>> not supported in upstream yet and the PMD is split in swap-out phase
>> (see shrink_page_list).
>>
>> The patch adds THP migration support to device memory, but you need to
>> handle migrate in (back to system memory) case correctly. The fault
>> handling should look like THP CoW fault handling behavior (before
>> 5.8):
>> - if THP is enabled: allocate THP, fallback if allocation is failed
>> - if THP is disabled: fallback to base page
>>
>> Swap fault handling doesn't look like the above. So, I said it seems
>> like more THP CoW (fault handling part only before 5.8). I hope I
>> articulate my mind.
>>
>> However, I didn't see such fallback is handled. It looks if THP
>> allocation is failed, it just returns SIGBUS; and no check about THP
>> status if I read the patches correctly. The THP might be disabled for
>> the specific vma or system wide before migrating from device memory
>> back to system memory.
>
> You are correct, the patch wasn't handling the fallback case.
> I'll add that in the next version.

For fallback, you need to split the THP in device firstly. Because you
will migrate a base page instead a whole THP now.

Best Regards,
Huang, Ying

>>>
>>> thanks,
>>> --
>>> John Hubbard
>>> NVIDIA