2019-01-08 04:52:46

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [PATCH V6 0/4] mm/kvm/vfio/ppc64: Migrate compound pages out of CMA region

ppc64 use CMA area for the allocation of guest page table (hash page table). We won't
be able to start guest if we fail to allocate hash page table. We have observed
hash table allocation failure because we failed to migrate pages out of CMA region
because they were pinned. This happen when we are using VFIO. VFIO on ppc64 pins
the entire guest RAM. If the guest RAM pages get allocated out of CMA region, we
won't be able to migrate those pages. The pages are also pinned for the lifetime of the
guest.

Currently we support migration of non-compound pages. With THP and with the addition of
hugetlb migration we can end up allocating compound pages from CMA region. This
patch series add support for migrating compound pages. The first path adds the helper
get_user_pages_cma_migrate() which pin the page making sure we migrate them out of
CMA region before incrementing the reference count.

Changes from V5:
* Add PF_MEMALLOC_NOCMA
* remote __GFP_THISNODE when allocating target page for migration

Changes from V4:
* use __GFP_NOWARN when allocating pages to avoid page allocation failure warnings.

Changes from V3:
* Move the hugetlb check before transhuge check
* Use compound head page when isolating hugetlb page


Aneesh Kumar K.V (4):
mm/cma: Add PF flag to force non cma alloc
mm: Add get_user_pages_cma_migrate
powerpc/mm/iommu: Allow migration of cma allocated pages during
mm_iommu_get
powerpc/mm/iommu: Allow large IOMMU page size only for hugetlb backing

arch/powerpc/mm/mmu_context_iommu.c | 144 ++++++++-------------------
include/linux/hugetlb.h | 2 +
include/linux/migrate.h | 3 +
include/linux/sched.h | 1 +
include/linux/sched/mm.h | 36 +++++--
mm/hugetlb.c | 4 +-
mm/migrate.c | 149 ++++++++++++++++++++++++++++
7 files changed, 227 insertions(+), 112 deletions(-)

--
2.20.1



2019-01-08 04:52:54

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [PATCH V6 1/4] mm/cma: Add PF flag to force non cma alloc

This patch add PF_MEMALLOC_NOCMA which make sure any allocation in that context
is marked non movable and hence cannot be satisfied by CMA region.

This is useful with get_user_pages_cma_migrate where we take a page pin by
migrating pages from CMA region. Marking the section PF_MEMALLOC_NOCMA ensures
that we avoid uncessary page migration later.

Suggested-by: Andrea Arcangeli <[email protected]>
Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
include/linux/sched.h | 1 +
include/linux/sched/mm.h | 36 ++++++++++++++++++++++++++++--------
2 files changed, 29 insertions(+), 8 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 89541d248893..047c8c469774 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1406,6 +1406,7 @@ extern struct pid *cad_pid;
#define PF_RANDOMIZE 0x00400000 /* Randomize virtual address space */
#define PF_SWAPWRITE 0x00800000 /* Allowed to write to swap */
#define PF_MEMSTALL 0x01000000 /* Stalled due to lack of memory */
+#define PF_MEMALLOC_NOCMA 0x02000000 /* All allocation request will have _GFP_MOVABLE cleared */
#define PF_NO_SETAFFINITY 0x04000000 /* Userland is not allowed to meddle with cpus_allowed */
#define PF_MCE_EARLY 0x08000000 /* Early kill for mce process policy */
#define PF_MUTEX_TESTER 0x20000000 /* Thread belongs to the rt mutex tester */
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 3bfa6a0cbba4..b336e7e2ca49 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -148,17 +148,24 @@ static inline bool in_vfork(struct task_struct *tsk)
* Applies per-task gfp context to the given allocation flags.
* PF_MEMALLOC_NOIO implies GFP_NOIO
* PF_MEMALLOC_NOFS implies GFP_NOFS
+ * PF_MEMALLOC_NOCMA implies no allocation from CMA region.
*/
static inline gfp_t current_gfp_context(gfp_t flags)
{
- /*
- * NOIO implies both NOIO and NOFS and it is a weaker context
- * so always make sure it makes precedence
- */
- if (unlikely(current->flags & PF_MEMALLOC_NOIO))
- flags &= ~(__GFP_IO | __GFP_FS);
- else if (unlikely(current->flags & PF_MEMALLOC_NOFS))
- flags &= ~__GFP_FS;
+ if (unlikely(current->flags &
+ (PF_MEMALLOC_NOIO | PF_MEMALLOC_NOFS | PF_MEMALLOC_NOCMA))) {
+ /*
+ * NOIO implies both NOIO and NOFS and it is a weaker context
+ * so always make sure it makes precedence
+ */
+ if (current->flags & PF_MEMALLOC_NOIO)
+ flags &= ~(__GFP_IO | __GFP_FS);
+ else if (current->flags & PF_MEMALLOC_NOFS)
+ flags &= ~__GFP_FS;
+
+ if (current->flags & PF_MEMALLOC_NOCMA)
+ flags &= ~__GFP_MOVABLE;
+ }
return flags;
}

@@ -248,6 +255,19 @@ static inline void memalloc_noreclaim_restore(unsigned int flags)
current->flags = (current->flags & ~PF_MEMALLOC) | flags;
}

+static inline unsigned int memalloc_nocma_save(void)
+{
+ unsigned int flags = current->flags & PF_MEMALLOC_NOCMA;
+
+ current->flags |= PF_MEMALLOC_NOCMA;
+ return flags;
+}
+
+static inline void memalloc_nocma_restore(unsigned int flags)
+{
+ current->flags = (current->flags & ~PF_MEMALLOC_NOCMA) | flags;
+}
+
#ifdef CONFIG_MEMCG
/**
* memalloc_use_memcg - Starts the remote memcg charging scope.
--
2.20.1


2019-01-08 04:53:04

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [PATCH V6 2/4] mm: Add get_user_pages_cma_migrate

This helper does a get_user_pages_fast making sure we migrate pages found in the
CMA area before taking page reference. This makes sure that we don't keep
non-movable pages (due to page reference count) in the CMA area.

This will be used by ppc64 in a later patch to avoid pinning pages in the CMA
region. ppc64 uses CMA region for allocation of hardware page table (hash page
table) and not able to migrate pages out of CMA region results in page table
allocation failures.

One case where we hit this easy is when a guest using VFIO passthrough device.
VFIO locks all the guests memory and if the guest memory is backed by CMA
region, it becomes unmovable resulting in fragmenting the CMA and possibly
preventing other guest from allocation a large enough hash page table.

NOTE: We allocate new page without using __GFP_THISNODE

Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
include/linux/hugetlb.h | 2 +
include/linux/migrate.h | 3 +
mm/hugetlb.c | 4 +-
mm/migrate.c | 149 ++++++++++++++++++++++++++++++++++++++++
4 files changed, 156 insertions(+), 2 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 087fd5f48c91..1eed0cdaec0e 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -371,6 +371,8 @@ struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
nodemask_t *nmask);
struct page *alloc_huge_page_vma(struct hstate *h, struct vm_area_struct *vma,
unsigned long address);
+struct page *alloc_migrate_huge_page(struct hstate *h, gfp_t gfp_mask,
+ int nid, nodemask_t *nmask);
int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
pgoff_t idx);

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index e13d9bf2f9a5..bc83e12a06e9 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -285,6 +285,9 @@ static inline int migrate_vma(const struct migrate_vma_ops *ops,
}
#endif /* IS_ENABLED(CONFIG_MIGRATE_VMA_HELPER) */

+extern int get_user_pages_cma_migrate(unsigned long start, int nr_pages, int write,
+ struct page **pages);
+
#endif /* CONFIG_MIGRATION */

#endif /* _LINUX_MIGRATE_H */
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 745088810965..fc4afaec1055 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1586,8 +1586,8 @@ static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask,
return page;
}

-static struct page *alloc_migrate_huge_page(struct hstate *h, gfp_t gfp_mask,
- int nid, nodemask_t *nmask)
+struct page *alloc_migrate_huge_page(struct hstate *h, gfp_t gfp_mask,
+ int nid, nodemask_t *nmask)
{
struct page *page;

diff --git a/mm/migrate.c b/mm/migrate.c
index ccf8966caf6f..5e21c7aee942 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2982,3 +2982,152 @@ int migrate_vma(const struct migrate_vma_ops *ops,
}
EXPORT_SYMBOL(migrate_vma);
#endif /* defined(MIGRATE_VMA_HELPER) */
+
+static struct page *new_non_cma_page(struct page *page, unsigned long private)
+{
+ /*
+ * We want to make sure we allocate the new page from the same node
+ * as the source page.
+ */
+ int nid = page_to_nid(page);
+ /*
+ * Trying to allocate a page for migration. Ignore allocation
+ * failure warnings. We don't force __GFP_THISNODE here because
+ * this node here is the node where we have CMA reservation and
+ * in some case these nodes will have really less non movable
+ * allocation memory.
+ */
+ gfp_t gfp_mask = GFP_USER | __GFP_NOWARN;
+
+ if (PageHighMem(page))
+ gfp_mask |= __GFP_HIGHMEM;
+
+#ifdef CONFIG_HUGETLB_PAGE
+ if (PageHuge(page)) {
+ struct hstate *h = page_hstate(page);
+ /*
+ * We don't want to dequeue from the pool because pool pages will
+ * mostly be from the CMA region.
+ */
+ return alloc_migrate_huge_page(h, gfp_mask, nid, NULL);
+ }
+#endif
+ if (PageTransHuge(page)) {
+ struct page *thp;
+ /*
+ * ignore allocation failure warnings
+ */
+ gfp_t thp_gfpmask = GFP_TRANSHUGE | __GFP_NOWARN;
+
+ /*
+ * Remove the movable mask so that we don't allocate from
+ * CMA area again.
+ */
+ thp_gfpmask &= ~__GFP_MOVABLE;
+ thp = __alloc_pages_node(nid, thp_gfpmask, HPAGE_PMD_ORDER);
+ if (!thp)
+ return NULL;
+ prep_transhuge_page(thp);
+ return thp;
+ }
+
+ return __alloc_pages_node(nid, gfp_mask, 0);
+}
+
+/**
+ * get_user_pages_cma_migrate() - pin user pages in memory by migrating pages in CMA region
+ * @start: starting user address
+ * @nr_pages: number of pages from start to pin
+ * @write: whether pages will be written to
+ * @pages: array that receives pointers to the pages pinned.
+ * Should be at least nr_pages long.
+ *
+ * Attempt to pin user pages in memory without taking mm->mmap_sem.
+ * If not successful, it will fall back to taking the lock and
+ * calling get_user_pages().
+ *
+ * If the pinned pages are backed by CMA region, we migrate those pages out,
+ * allocating new pages from non-CMA region. This helps in avoiding keeping
+ * pages pinned in the CMA region for a long time thereby resulting in
+ * CMA allocation failures.
+ *
+ * Returns number of pages pinned. This may be fewer than the number
+ * requested. If nr_pages is 0 or negative, returns 0. If no pages
+ * were pinned, returns -errno.
+ */
+
+int get_user_pages_cma_migrate(unsigned long start, int nr_pages, int write,
+ struct page **pages)
+{
+ int i, ret;
+ unsigned long flags;
+ bool drain_allow = true;
+ bool migrate_allow = true;
+ LIST_HEAD(cma_page_list);
+
+get_user_again:
+ /*
+ * If get_user_pages ends up allocating pages, make sure we don't
+ * allocate from CMA region so that we can avoid the migration below.
+ */
+ flags = memalloc_nocma_save();
+ ret = get_user_pages_fast(start, nr_pages, write, pages);
+ memalloc_nocma_restore(flags);
+ if (ret <= 0)
+ return ret;
+
+ for (i = 0; i < ret; ++i) {
+ /*
+ * If we get a page from the CMA zone, since we are going to
+ * be pinning these entries, we might as well move them out
+ * of the CMA zone if possible.
+ */
+ if (is_migrate_cma_page(pages[i]) && migrate_allow) {
+
+ struct page *head = compound_head(pages[i]);
+
+ if (PageHuge(head)) {
+ isolate_huge_page(head, &cma_page_list);
+ } else {
+ if (!PageLRU(head) && drain_allow) {
+ lru_add_drain_all();
+ drain_allow = false;
+ }
+
+ if (!isolate_lru_page(head)) {
+ list_add_tail(&head->lru, &cma_page_list);
+ mod_node_page_state(page_pgdat(head),
+ NR_ISOLATED_ANON +
+ page_is_file_cache(head),
+ hpage_nr_pages(head));
+ }
+ }
+ }
+ }
+ if (!list_empty(&cma_page_list)) {
+ /*
+ * drop the above get_user_pages reference.
+ */
+ for (i = 0; i < ret; ++i)
+ put_page(pages[i]);
+
+ if (migrate_pages(&cma_page_list, new_non_cma_page,
+ NULL, 0, MIGRATE_SYNC, MR_CONTIG_RANGE)) {
+ /*
+ * some of the pages failed migration. Do get_user_pages
+ * without migration.
+ */
+ migrate_allow = false;
+
+ if (!list_empty(&cma_page_list))
+ putback_movable_pages(&cma_page_list);
+ }
+ /*
+ * We did migrate all the pages, Try to get the page references again
+ * migrating any new CMA pages which we failed to isolate earlier.
+ */
+ drain_allow = true;
+ goto get_user_again;
+ }
+ return ret;
+}
--
2.20.1


2019-01-08 04:54:19

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [PATCH V6 3/4] powerpc/mm/iommu: Allow migration of cma allocated pages during mm_iommu_get

Current code doesn't do page migration if the page allocated is a compound page.
With HugeTLB migration support, we can end up allocating hugetlb pages from
CMA region. Also THP pages can be allocated from CMA region. This patch updates
the code to handle compound pages correctly.

This use the new helper get_user_pages_cma_migrate. It does single get_user_pages
with right count, instead of doing one get_user_pages per page. That avoids
reading page table multiple times.

The patch also convert the hpas member of mm_iommu_table_group_mem_t to a union.
We use the same storage location to store pointers to struct page. We cannot
update all the code path use struct page *, because we access hpas in real mode
and we can't do that struct page * to pfn conversion in real mode.

Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
arch/powerpc/mm/mmu_context_iommu.c | 124 +++++++++-------------------
1 file changed, 37 insertions(+), 87 deletions(-)

diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
index a712a650a8b6..52ccab294b47 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -21,6 +21,7 @@
#include <linux/sizes.h>
#include <asm/mmu_context.h>
#include <asm/pte-walk.h>
+#include <linux/mm_inline.h>

static DEFINE_MUTEX(mem_list_mutex);

@@ -34,8 +35,18 @@ struct mm_iommu_table_group_mem_t {
atomic64_t mapped;
unsigned int pageshift;
u64 ua; /* userspace address */
- u64 entries; /* number of entries in hpas[] */
- u64 *hpas; /* vmalloc'ed */
+ u64 entries; /* number of entries in hpas/hpages[] */
+ /*
+ * in mm_iommu_get we temporarily use this to store
+ * struct page address.
+ *
+ * We need to convert ua to hpa in real mode. Make it
+ * simpler by storing physical address.
+ */
+ union {
+ struct page **hpages; /* vmalloc'ed */
+ phys_addr_t *hpas;
+ };
#define MM_IOMMU_TABLE_INVALID_HPA ((uint64_t)-1)
u64 dev_hpa; /* Device memory base address */
};
@@ -80,64 +91,15 @@ bool mm_iommu_preregistered(struct mm_struct *mm)
}
EXPORT_SYMBOL_GPL(mm_iommu_preregistered);

-/*
- * Taken from alloc_migrate_target with changes to remove CMA allocations
- */
-struct page *new_iommu_non_cma_page(struct page *page, unsigned long private)
-{
- gfp_t gfp_mask = GFP_USER;
- struct page *new_page;
-
- if (PageCompound(page))
- return NULL;
-
- if (PageHighMem(page))
- gfp_mask |= __GFP_HIGHMEM;
-
- /*
- * We don't want the allocation to force an OOM if possibe
- */
- new_page = alloc_page(gfp_mask | __GFP_NORETRY | __GFP_NOWARN);
- return new_page;
-}
-
-static int mm_iommu_move_page_from_cma(struct page *page)
-{
- int ret = 0;
- LIST_HEAD(cma_migrate_pages);
-
- /* Ignore huge pages for now */
- if (PageCompound(page))
- return -EBUSY;
-
- lru_add_drain();
- ret = isolate_lru_page(page);
- if (ret)
- return ret;
-
- list_add(&page->lru, &cma_migrate_pages);
- put_page(page); /* Drop the gup reference */
-
- ret = migrate_pages(&cma_migrate_pages, new_iommu_non_cma_page,
- NULL, 0, MIGRATE_SYNC, MR_CONTIG_RANGE);
- if (ret) {
- if (!list_empty(&cma_migrate_pages))
- putback_movable_pages(&cma_migrate_pages);
- }
-
- return 0;
-}
-
static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
- unsigned long entries, unsigned long dev_hpa,
- struct mm_iommu_table_group_mem_t **pmem)
+ unsigned long entries, unsigned long dev_hpa,
+ struct mm_iommu_table_group_mem_t **pmem)
{
struct mm_iommu_table_group_mem_t *mem;
- long i, j, ret = 0, locked_entries = 0;
+ long i, ret = 0, locked_entries = 0;
unsigned int pageshift;
unsigned long flags;
unsigned long cur_ua;
- struct page *page = NULL;

mutex_lock(&mem_list_mutex);

@@ -187,41 +149,25 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
goto unlock_exit;
}

+ ret = get_user_pages_cma_migrate(ua, entries, 1, mem->hpages);
+ if (ret != entries) {
+ /* free the reference taken */
+ for (i = 0; i < ret; i++)
+ put_page(mem->hpages[i]);
+
+ vfree(mem->hpas);
+ kfree(mem);
+ ret = -EFAULT;
+ goto unlock_exit;
+ } else {
+ ret = 0;
+ }
+
+ pageshift = PAGE_SHIFT;
for (i = 0; i < entries; ++i) {
+ struct page *page = mem->hpages[i];
+
cur_ua = ua + (i << PAGE_SHIFT);
- if (1 != get_user_pages_fast(cur_ua,
- 1/* pages */, 1/* iswrite */, &page)) {
- ret = -EFAULT;
- for (j = 0; j < i; ++j)
- put_page(pfn_to_page(mem->hpas[j] >>
- PAGE_SHIFT));
- vfree(mem->hpas);
- kfree(mem);
- goto unlock_exit;
- }
- /*
- * If we get a page from the CMA zone, since we are going to
- * be pinning these entries, we might as well move them out
- * of the CMA zone if possible. NOTE: faulting in + migration
- * can be expensive. Batching can be considered later
- */
- if (is_migrate_cma_page(page)) {
- if (mm_iommu_move_page_from_cma(page))
- goto populate;
- if (1 != get_user_pages_fast(cur_ua,
- 1/* pages */, 1/* iswrite */,
- &page)) {
- ret = -EFAULT;
- for (j = 0; j < i; ++j)
- put_page(pfn_to_page(mem->hpas[j] >>
- PAGE_SHIFT));
- vfree(mem->hpas);
- kfree(mem);
- goto unlock_exit;
- }
- }
-populate:
- pageshift = PAGE_SHIFT;
if (mem->pageshift > PAGE_SHIFT && PageCompound(page)) {
pte_t *pte;
struct page *head = compound_head(page);
@@ -239,6 +185,10 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
local_irq_restore(flags);
}
mem->pageshift = min(mem->pageshift, pageshift);
+ /*
+ * We don't need struct page reference any more, switch
+ * to physical address.
+ */
mem->hpas[i] = page_to_pfn(page) << PAGE_SHIFT;
}

--
2.20.1


2019-01-08 04:59:51

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [PATCH V6 4/4] powerpc/mm/iommu: Allow large IOMMU page size only for hugetlb backing

THP pages can get split during different code paths. An incremented reference
count do imply we will not split the compound page. But the pmd entry can be
converted to level 4 pte entries. Keep the code simpler by allowing large
IOMMU page size only if the guest ram is backed by hugetlb pages.

Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
arch/powerpc/mm/mmu_context_iommu.c | 24 +++++++-----------------
1 file changed, 7 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
index 52ccab294b47..62c7590378d4 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -98,8 +98,6 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
struct mm_iommu_table_group_mem_t *mem;
long i, ret = 0, locked_entries = 0;
unsigned int pageshift;
- unsigned long flags;
- unsigned long cur_ua;

mutex_lock(&mem_list_mutex);

@@ -167,22 +165,14 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
for (i = 0; i < entries; ++i) {
struct page *page = mem->hpages[i];

- cur_ua = ua + (i << PAGE_SHIFT);
- if (mem->pageshift > PAGE_SHIFT && PageCompound(page)) {
- pte_t *pte;
+ /*
+ * Allow to use larger than 64k IOMMU pages. Only do that
+ * if we are backed by hugetlb.
+ */
+ if ((mem->pageshift > PAGE_SHIFT) && PageHuge(page)) {
struct page *head = compound_head(page);
- unsigned int compshift = compound_order(head);
- unsigned int pteshift;
-
- local_irq_save(flags); /* disables as well */
- pte = find_linux_pte(mm->pgd, cur_ua, NULL, &pteshift);
-
- /* Double check it is still the same pinned page */
- if (pte && pte_page(*pte) == head &&
- pteshift == compshift + PAGE_SHIFT)
- pageshift = max_t(unsigned int, pteshift,
- PAGE_SHIFT);
- local_irq_restore(flags);
+
+ pageshift = compound_order(head) + PAGE_SHIFT;
}
mem->pageshift = min(mem->pageshift, pageshift);
/*
--
2.20.1


2019-01-08 23:28:35

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH V6 0/4] mm/kvm/vfio/ppc64: Migrate compound pages out of CMA region

On Tue, 8 Jan 2019 10:21:06 +0530 "Aneesh Kumar K.V" <[email protected]> wrote:

> ppc64 use CMA area for the allocation of guest page table (hash page table). We won't
> be able to start guest if we fail to allocate hash page table. We have observed
> hash table allocation failure because we failed to migrate pages out of CMA region
> because they were pinned. This happen when we are using VFIO. VFIO on ppc64 pins
> the entire guest RAM. If the guest RAM pages get allocated out of CMA region, we
> won't be able to migrate those pages. The pages are also pinned for the lifetime of the
> guest.
>
> Currently we support migration of non-compound pages. With THP and with the addition of
> hugetlb migration we can end up allocating compound pages from CMA region. This
> patch series add support for migrating compound pages. The first path adds the helper
> get_user_pages_cma_migrate() which pin the page making sure we migrate them out of
> CMA region before incrementing the reference count.

Does this code do anything for architectures other than powerpc? If
not, should we be adding the ifdefs to avoid burdening other
architectures with unused code?


2019-01-09 01:56:18

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH V6 3/4] powerpc/mm/iommu: Allow migration of cma allocated pages during mm_iommu_get

Hello,

On Tue, Jan 08, 2019 at 10:21:09AM +0530, Aneesh Kumar K.V wrote:
> @@ -187,41 +149,25 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
> goto unlock_exit;
> }
>
> + ret = get_user_pages_cma_migrate(ua, entries, 1, mem->hpages);

In terms of gup APIs, I've been wondering if this shall become
get_user_pages_longerm(FOLL_CMA_MIGRATE). So basically moving this
CMA migrate logic inside get_user_pages_longerm.

It depends if powerpc will ever need to bail on dax and/or if other
non-powerpc vfio drivers which are already bailing on dax may also
later optionally need to avoid interfering with CMA.

Aside from the API detail above, this CMA page migration logic seems a
good solution for the problem.

Thanks,
Andrea

2019-01-09 02:40:28

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH V6 1/4] mm/cma: Add PF flag to force non cma alloc

On Tue, Jan 08, 2019 at 10:21:07AM +0530, Aneesh Kumar K.V wrote:
> This patch add PF_MEMALLOC_NOCMA which make sure any allocation in that context
> is marked non movable and hence cannot be satisfied by CMA region.
>
> This is useful with get_user_pages_cma_migrate where we take a page pin by
> migrating pages from CMA region. Marking the section PF_MEMALLOC_NOCMA ensures
> that we avoid uncessary page migration later.
>
> Suggested-by: Andrea Arcangeli <[email protected]>
> Signed-off-by: Aneesh Kumar K.V <[email protected]>

Reviewed-by: Andrea Arcangeli <[email protected]>

2019-01-09 08:43:17

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH V6 0/4] mm/kvm/vfio/ppc64: Migrate compound pages out of CMA region

Andrew Morton <[email protected]> writes:

> On Tue, 8 Jan 2019 10:21:06 +0530 "Aneesh Kumar K.V" <[email protected]> wrote:
>
>> ppc64 use CMA area for the allocation of guest page table (hash page table). We won't
>> be able to start guest if we fail to allocate hash page table. We have observed
>> hash table allocation failure because we failed to migrate pages out of CMA region
>> because they were pinned. This happen when we are using VFIO. VFIO on ppc64 pins
>> the entire guest RAM. If the guest RAM pages get allocated out of CMA region, we
>> won't be able to migrate those pages. The pages are also pinned for the lifetime of the
>> guest.
>>
>> Currently we support migration of non-compound pages. With THP and with the addition of
>> hugetlb migration we can end up allocating compound pages from CMA region. This
>> patch series add support for migrating compound pages. The first path adds the helper
>> get_user_pages_cma_migrate() which pin the page making sure we migrate them out of
>> CMA region before incrementing the reference count.
>
> Does this code do anything for architectures other than powerpc? If
> not, should we be adding the ifdefs to avoid burdening other
> architectures with unused code?

Any architecture enabling CMA may need this. I will move most of this below
CONFIG_CMA.

-aneesh


2019-01-09 08:43:42

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH V6 3/4] powerpc/mm/iommu: Allow migration of cma allocated pages during mm_iommu_get

Andrea Arcangeli <[email protected]> writes:

> Hello,
>
> On Tue, Jan 08, 2019 at 10:21:09AM +0530, Aneesh Kumar K.V wrote:
>> @@ -187,41 +149,25 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
>> goto unlock_exit;
>> }
>>
>> + ret = get_user_pages_cma_migrate(ua, entries, 1, mem->hpages);
>
> In terms of gup APIs, I've been wondering if this shall become
> get_user_pages_longerm(FOLL_CMA_MIGRATE). So basically moving this
> CMA migrate logic inside get_user_pages_longerm.

Do we need the FOLL_CMA_MIGRATE flag? Wondering whether a long term pin
won't imply a CMA migrate? What is the benefit of that FOLL_CMA_MIGRATE
flags. We can do better by taking a list of pages for migration and I
guess it is much simpler if we limit that migration logic to
get_user_pages_longterm()?

I ended up with something like below. Do you suggest we should add those
isolate_lru and other details via FOLL_CMA_MIGRATE flag and do that when
we take the page reference instead of doing this by iterating the page array in
get_user_pages_longterm as in the below diff?

diff --git a/mm/gup.c b/mm/gup.c
index 05acd7e2eb22..6e8152594e83 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -13,6 +13,9 @@
#include <linux/sched/signal.h>
#include <linux/rwsem.h>
#include <linux/hugetlb.h>
+#include <linux/migrate.h>
+#include <linux/mm_inline.h>
+#include <linux/sched/mm.h>

#include <asm/mmu_context.h>
#include <asm/pgtable.h>
@@ -1126,7 +1129,167 @@ long get_user_pages(unsigned long start, unsigned long nr_pages,
}
EXPORT_SYMBOL(get_user_pages);

+#if defined(CONFIG_FS_DAX) || defined (CONFIG_CMA)
+
#ifdef CONFIG_FS_DAX
+static bool check_dax_vmas(struct vm_area_struct **vmas, long nr_pages)
+{
+ long i;
+ struct vm_area_struct *vma_prev = NULL;
+
+ for (i = 0; i < nr_pages; i++) {
+ struct vm_area_struct *vma = vmas[i];
+
+ if (vma == vma_prev)
+ continue;
+
+ vma_prev = vma;
+
+ if (vma_is_fsdax(vma))
+ return true;
+ }
+ return false;
+}
+#else
+static inline bool check_dax_vmas(struct vm_area_struct **vmas, long nr_pages)
+{
+ return false;
+}
+#endif
+
+#ifdef CONFIG_CMA
+static struct page *new_non_cma_page(struct page *page, unsigned long private)
+{
+ /*
+ * We want to make sure we allocate the new page from the same node
+ * as the source page.
+ */
+ int nid = page_to_nid(page);
+ /*
+ * Trying to allocate a page for migration. Ignore allocation
+ * failure warnings. We don't force __GFP_THISNODE here because
+ * this node here is the node where we have CMA reservation and
+ * in some case these nodes will have really less non movable
+ * allocation memory.
+ */
+ gfp_t gfp_mask = GFP_USER | __GFP_NOWARN;
+
+ if (PageHighMem(page))
+ gfp_mask |= __GFP_HIGHMEM;
+
+#ifdef CONFIG_HUGETLB_PAGE
+ if (PageHuge(page)) {
+ struct hstate *h = page_hstate(page);
+ /*
+ * We don't want to dequeue from the pool because pool pages will
+ * mostly be from the CMA region.
+ */
+ return alloc_migrate_huge_page(h, gfp_mask, nid, NULL);
+ }
+#endif
+ if (PageTransHuge(page)) {
+ struct page *thp;
+ /*
+ * ignore allocation failure warnings
+ */
+ gfp_t thp_gfpmask = GFP_TRANSHUGE | __GFP_NOWARN;
+
+ /*
+ * Remove the movable mask so that we don't allocate from
+ * CMA area again.
+ */
+ thp_gfpmask &= ~__GFP_MOVABLE;
+ thp = __alloc_pages_node(nid, thp_gfpmask, HPAGE_PMD_ORDER);
+ if (!thp)
+ return NULL;
+ prep_transhuge_page(thp);
+ return thp;
+ }
+
+ return __alloc_pages_node(nid, gfp_mask, 0);
+}
+
+static long check_and_migrate_cma_pages(unsigned long start, long nr_pages,
+ unsigned int gup_flags,
+ struct page **pages,
+ struct vm_area_struct **vmas)
+{
+ long i;
+ bool drain_allow = true;
+ bool migrate_allow = true;
+ LIST_HEAD(cma_page_list);
+
+check_again:
+ for (i = 0; i < nr_pages; i++) {
+ /*
+ * If we get a page from the CMA zone, since we are going to
+ * be pinning these entries, we might as well move them out
+ * of the CMA zone if possible.
+ */
+ if (is_migrate_cma_page(pages[i])) {
+
+ struct page *head = compound_head(pages[i]);
+
+ if (PageHuge(head)) {
+ isolate_huge_page(head, &cma_page_list);
+ } else {
+ if (!PageLRU(head) && drain_allow) {
+ lru_add_drain_all();
+ drain_allow = false;
+ }
+
+ if (!isolate_lru_page(head)) {
+ list_add_tail(&head->lru, &cma_page_list);
+ mod_node_page_state(page_pgdat(head),
+ NR_ISOLATED_ANON +
+ page_is_file_cache(head),
+ hpage_nr_pages(head));
+ }
+ }
+ }
+ }
+
+ if (!list_empty(&cma_page_list)) {
+ /*
+ * drop the above get_user_pages reference.
+ */
+ for (i = 0; i < nr_pages; i++)
+ put_page(pages[i]);
+
+ if (migrate_pages(&cma_page_list, new_non_cma_page,
+ NULL, 0, MIGRATE_SYNC, MR_CONTIG_RANGE)) {
+ /*
+ * some of the pages failed migration. Do get_user_pages
+ * without migration.
+ */
+ migrate_allow = false;
+
+ if (!list_empty(&cma_page_list))
+ putback_movable_pages(&cma_page_list);
+ }
+ /*
+ * We did migrate all the pages, Try to get the page references again
+ * migrating any new CMA pages which we failed to isolate earlier.
+ */
+ nr_pages = get_user_pages(start, nr_pages, gup_flags, pages, vmas);
+ if ((nr_pages > 0) && migrate_allow) {
+ drain_allow = true;
+ goto check_again;
+ }
+ }
+
+ return nr_pages;
+}
+#else
+static inline long check_and_migrate_cma_pages(unsigned long start, long nr_pages,
+ unsigned int gup_flags,
+ struct page **pages,
+ struct vm_area_struct **vmas)
+{
+ return nr_pages;
+}
+#endif
+
/*
* This is the same as get_user_pages() in that it assumes we are
* operating on the current task's mm, but it goes further to validate
@@ -1140,11 +1303,11 @@ EXPORT_SYMBOL(get_user_pages);
* Contrast this to iov_iter_get_pages() usages which are transient.
*/
long get_user_pages_longterm(unsigned long start, unsigned long nr_pages,
- unsigned int gup_flags, struct page **pages,
- struct vm_area_struct **vmas_arg)
+ unsigned int gup_flags, struct page **pages,
+ struct vm_area_struct **vmas_arg)
{
struct vm_area_struct **vmas = vmas_arg;
- struct vm_area_struct *vma_prev = NULL;
+ unsigned long flags;
long rc, i;

if (!pages)
@@ -1157,31 +1320,20 @@ long get_user_pages_longterm(unsigned long start, unsigned long nr_pages,
return -ENOMEM;
}

+ flags = memalloc_nocma_save();
rc = get_user_pages(start, nr_pages, gup_flags, pages, vmas);
+ memalloc_nocma_restore(flags);
+ if (rc < 0)
+ goto out;

- for (i = 0; i < rc; i++) {
- struct vm_area_struct *vma = vmas[i];
-
- if (vma == vma_prev)
- continue;
-
- vma_prev = vma;
-
- if (vma_is_fsdax(vma))
- break;
- }
-
- /*
- * Either get_user_pages() failed, or the vma validation
- * succeeded, in either case we don't need to put_page() before
- * returning.
- */
- if (i >= rc)
+ if (check_dax_vmas(vmas, rc)) {
+ for (i = 0; i < rc; i++)
+ put_page(pages[i]);
+ rc = -EOPNOTSUPP;
goto out;
+ }

- for (i = 0; i < rc; i++)
- put_page(pages[i]);
- rc = -EOPNOTSUPP;
+ rc = check_and_migrate_cma_pages(start, rc, gup_flags, pages, vmas);
out:
if (vmas != vmas_arg)
kfree(vmas);


2019-01-11 00:26:27

by David Gibson

[permalink] [raw]
Subject: Re: [PATCH V6 0/4] mm/kvm/vfio/ppc64: Migrate compound pages out of CMA region

On Wed, Jan 09, 2019 at 02:11:25PM +0530, Aneesh Kumar K.V wrote:
> Andrew Morton <[email protected]> writes:
>
> > On Tue, 8 Jan 2019 10:21:06 +0530 "Aneesh Kumar K.V" <[email protected]> wrote:
> >
> >> ppc64 use CMA area for the allocation of guest page table (hash page table). We won't
> >> be able to start guest if we fail to allocate hash page table. We have observed
> >> hash table allocation failure because we failed to migrate pages out of CMA region
> >> because they were pinned. This happen when we are using VFIO. VFIO on ppc64 pins
> >> the entire guest RAM. If the guest RAM pages get allocated out of CMA region, we
> >> won't be able to migrate those pages. The pages are also pinned for the lifetime of the
> >> guest.
> >>
> >> Currently we support migration of non-compound pages. With THP and with the addition of
> >> hugetlb migration we can end up allocating compound pages from CMA region. This
> >> patch series add support for migrating compound pages. The first path adds the helper
> >> get_user_pages_cma_migrate() which pin the page making sure we migrate them out of
> >> CMA region before incrementing the reference count.
> >
> > Does this code do anything for architectures other than powerpc? If
> > not, should we be adding the ifdefs to avoid burdening other
> > architectures with unused code?
>
> Any architecture enabling CMA may need this. I will move most of this below
> CONFIG_CMA.

In theory it could affect any architecture using CMA. I suspect it's
much less likely to bite in practice on architectures other than ppc.
IIUC the main use of CMA there is to allocate things like framebuffers
or other large contiguous blocks used for hardware devices. That's
usually going to happen rarely and during boot up. What makes ppc
different is that we need a substantial CMA allocation every time we
start a (POWER8) guest for the HPT. It's the fact that running guests
on a system both means we need the CMA unfragment and (with vfio added
in) can cause CMA fragmentation which makes this particularly
problematic.

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (2.27 kB)
signature.asc (849.00 B)
Download all attachments