2020-09-08 15:37:05

by Ming Mao

[permalink] [raw]
Subject: [PATCH V4 0/2] vfio: optimized for hugetlbf pages when dma map/unmap

This series deletes the for loop in dma_map/unmap for hugetlb pages.
In the original process, the for loop could spend much time to check all
normal pages.If we use hugetlb pages, it is not necessary to do this.

Changes from v3
- add a new API unpin_user_hugetlb_pages_dirty_lock()
- use the new API to unpin hugetlb pages

Ming Mao (2):
vfio dma_map/unmap: optimized for hugetlbfs pages
vfio: optimized for unpinning pages

drivers/vfio/vfio_iommu_type1.c | 373 ++++++++++++++++++++++++++++++--
include/linux/mm.h | 3 +
mm/gup.c | 91 ++++++++
3 files changed, 450 insertions(+), 17 deletions(-)

--
2.23.0



2020-09-08 15:38:09

by Ming Mao

[permalink] [raw]
Subject: [PATCH V4 1/2] vfio dma_map/unmap: optimized for hugetlbfs pages

In the original process of dma_map/unmap pages for VFIO-devices,
to make sure the pages are contiguous, we have to check them one by one.
As a result, dma_map/unmap could spend a long time.
Using the hugetlb pages, we can avoid this problem.
All pages in hugetlb pages are contiguous.And the hugetlb
page should not be split.So we can delete the for loops.

According to the suggestions of Peter Xu,
we should use the API unpin_user_pages_dirty_lock() to unpin hugetlb pages.
And the pages are unpinned one by one in this API.
So it is better to optimize the API.
In this patch, we do not optimize the process of unpinning.
We will do this in another patch.

Signed-off-by: Ming Mao <[email protected]>
---
drivers/vfio/vfio_iommu_type1.c | 289 +++++++++++++++++++++++++++++++-
1 file changed, 281 insertions(+), 8 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 5e556ac91..8c1dc5136 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -479,6 +479,222 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
return ret;
}

+static bool is_hugetlb_page(unsigned long pfn)
+{
+ struct page *page;
+
+ if (!pfn_valid(pfn))
+ return false;
+
+ page = pfn_to_page(pfn);
+ /* only check for hugetlb pages */
+ return page && PageHuge(page);
+}
+
+static bool vaddr_is_hugetlb_page(unsigned long vaddr, int prot)
+{
+ unsigned long pfn;
+ int ret;
+ bool result;
+
+ if (!current->mm)
+ return false;
+
+ ret = vaddr_get_pfn(current->mm, vaddr, prot, &pfn);
+ if (ret)
+ return false;
+
+ result = is_hugetlb_page(pfn);
+
+ put_pfn(pfn, prot);
+
+ return result;
+}
+
+/*
+ * get the number of residual PAGE_SIZE-pages in a hugetlb page
+ * (including the page which pointed by this address)
+ * @address: we count residual pages from this address to the end of
+ * a hugetlb page
+ * @order: the order of the same hugetlb page
+ */
+static long
+hugetlb_get_residual_pages(unsigned long address, unsigned int order)
+{
+ unsigned long hugetlb_npage;
+ unsigned long hugetlb_mask;
+
+ if (!order)
+ return -EINVAL;
+
+ hugetlb_npage = 1UL << order;
+ hugetlb_mask = hugetlb_npage - 1;
+ address = address >> PAGE_SHIFT;
+
+ /*
+ * Since we count the page pointed by this address, the number of
+ * residual PAGE_SIZE-pages is greater than or equal to 1.
+ */
+ return hugetlb_npage - (address & hugetlb_mask);
+}
+
+static unsigned int
+hugetlb_page_get_externally_pinned_num(struct vfio_dma *dma,
+ unsigned long start,
+ unsigned long npage)
+{
+ struct vfio_pfn *vpfn;
+ struct rb_node *node;
+ unsigned long end;
+ unsigned int num = 0;
+
+ if (!dma || !npage)
+ return 0;
+
+ end = start + npage - 1;
+ /* If we find a page in dma->pfn_list, this page has been pinned externally */
+ for (node = rb_first(&dma->pfn_list); node; node = rb_next(node)) {
+ vpfn = rb_entry(node, struct vfio_pfn, node);
+ if ((vpfn->pfn >= start) && (vpfn->pfn <= end))
+ num++;
+ }
+
+ return num;
+}
+
+static long hugetlb_page_vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
+ int prot, long npage, unsigned long pfn)
+{
+ long hugetlb_residual_npage;
+ struct page *head;
+ int ret = 0;
+ unsigned int contiguous_npage;
+ struct page **pages = NULL;
+ unsigned int flags = 0;
+
+ if ((npage < 0) || !pfn_valid(pfn))
+ return -EINVAL;
+
+ /* all pages are done? */
+ if (!npage)
+ goto out;
+ /*
+ * Since pfn is valid,
+ * hugetlb_residual_npage is greater than or equal to 1.
+ */
+ head = compound_head(pfn_to_page(pfn));
+ hugetlb_residual_npage = hugetlb_get_residual_pages(vaddr,
+ compound_order(head));
+ /* The page of vaddr has been gotten by vaddr_get_pfn */
+ contiguous_npage = min_t(long, (hugetlb_residual_npage - 1), npage);
+ /* There is on page left in this hugetlb page. */
+ if (!contiguous_npage)
+ goto out;
+
+ pages = kvmalloc_array(contiguous_npage, sizeof(struct page *), GFP_KERNEL);
+ if (!pages)
+ return -ENOMEM;
+
+ if (prot & IOMMU_WRITE)
+ flags |= FOLL_WRITE;
+
+ mmap_read_lock(mm);
+ /* The number of pages pinned may be less than contiguous_npage */
+ ret = pin_user_pages_remote(NULL, mm, vaddr + PAGE_SIZE, contiguous_npage,
+ flags | FOLL_LONGTERM, pages, NULL, NULL);
+ mmap_read_unlock(mm);
+out:
+ if (pages)
+ kvfree(pages);
+ return ret;
+}
+
+static long vfio_pin_hugetlb_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
+ long npage, unsigned long *pfn_base,
+ unsigned long limit)
+{
+ unsigned long pfn = 0;
+ long ret, pinned = 0, lock_acct = 0;
+ dma_addr_t iova = vaddr - dma->vaddr + dma->iova;
+ long pinned_loop, i;
+
+ /* This code path is only user initiated */
+ if (!current->mm)
+ return -ENODEV;
+
+ ret = vaddr_get_pfn(current->mm, vaddr, dma->prot, pfn_base);
+ if (ret)
+ return ret;
+
+ pinned++;
+ /*
+ * Since PG_reserved is not relevant for compound pages
+ * and the pfn of PAGE_SIZE-page which in hugetlb pages is valid,
+ * it is not necessary to check rsvd for hugetlb pages.
+ */
+ if (!vfio_find_vpfn(dma, iova)) {
+ if (!dma->lock_cap && current->mm->locked_vm + 1 > limit) {
+ put_pfn(*pfn_base, dma->prot);
+ pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n", __func__,
+ limit << PAGE_SHIFT);
+ return -ENOMEM;
+ }
+ lock_acct++;
+ }
+
+ /* Lock all the consecutive pages from pfn_base */
+ for (vaddr += PAGE_SIZE, iova += PAGE_SIZE; pinned < npage;
+ pinned += pinned_loop, vaddr += pinned_loop * PAGE_SIZE,
+ iova += pinned_loop * PAGE_SIZE) {
+ ret = vaddr_get_pfn(current->mm, vaddr, dma->prot, &pfn);
+ if (ret)
+ break;
+
+ if (pfn != *pfn_base + pinned ||
+ !is_hugetlb_page(pfn)) {
+ put_pfn(pfn, dma->prot);
+ break;
+ }
+
+ pinned_loop = 1;
+ /*
+ * It is possible that the page of vaddr is the last PAGE_SIZE-page.
+ * In this case, vaddr + PAGE_SIZE might be another hugetlb page.
+ */
+ ret = hugetlb_page_vaddr_get_pfn(current->mm, vaddr, dma->prot,
+ npage - pinned - pinned_loop, pfn);
+ if (ret < 0) {
+ put_pfn(pfn, dma->prot);
+ break;
+ }
+
+ pinned_loop += ret;
+ lock_acct += pinned_loop - hugetlb_page_get_externally_pinned_num(dma,
+ pfn, pinned_loop);
+
+ if (!dma->lock_cap &&
+ current->mm->locked_vm + lock_acct > limit) {
+ for (i = 0; i < pinned_loop; i++)
+ put_pfn(pfn++, dma->prot);
+ pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n",
+ __func__, limit << PAGE_SHIFT);
+ ret = -ENOMEM;
+ goto unpin_out;
+ }
+ }
+
+ ret = vfio_lock_acct(dma, lock_acct, false);
+
+unpin_out:
+ if (ret) {
+ for (pfn = *pfn_base ; pinned ; pfn++, pinned--)
+ put_pfn(pfn, dma->prot);
+ return ret;
+ }
+
+ return pinned;
+}
+
/*
* Attempt to pin pages. We really don't want to track all the pfns and
* the iommu can only map chunks of consecutive pfns anyway, so get the
@@ -858,6 +1074,57 @@ static size_t unmap_unpin_slow(struct vfio_domain *domain,
return unmapped;
}

+static size_t get_contiguous_pages(struct vfio_domain *domain, dma_addr_t start,
+ dma_addr_t end, phys_addr_t phys_base)
+{
+ size_t len;
+ phys_addr_t next;
+
+ if (!domain)
+ return 0;
+
+ for (len = PAGE_SIZE;
+ !domain->fgsp && start + len < end; len += PAGE_SIZE) {
+ next = iommu_iova_to_phys(domain->domain, start + len);
+ if (next != phys_base + len)
+ break;
+ }
+
+ return len;
+}
+
+static size_t hugetlb_get_contiguous_pages(struct vfio_domain *domain, dma_addr_t start,
+ dma_addr_t end, phys_addr_t phys_base)
+{
+ size_t len;
+ phys_addr_t next;
+ unsigned long contiguous_npage;
+ dma_addr_t max_len;
+ unsigned long hugetlb_residual_npage;
+ struct page *head;
+ unsigned long limit;
+
+ if (!domain)
+ return 0;
+
+ max_len = end - start;
+ for (len = PAGE_SIZE;
+ !domain->fgsp && start + len < end; len += contiguous_npage * PAGE_SIZE) {
+ next = iommu_iova_to_phys(domain->domain, start + len);
+ if ((next != phys_base + len) ||
+ !is_hugetlb_page(next >> PAGE_SHIFT))
+ break;
+
+ head = compound_head(pfn_to_page(next >> PAGE_SHIFT));
+ hugetlb_residual_npage = hugetlb_get_residual_pages(start + len,
+ compound_order(head));
+ limit = ALIGN((max_len - len), PAGE_SIZE) >> PAGE_SHIFT;
+ contiguous_npage = min_t(unsigned long, hugetlb_residual_npage, limit);
+ }
+
+ return len;
+}
+
static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
bool do_accounting)
{
@@ -892,7 +1159,7 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
iommu_iotlb_gather_init(&iotlb_gather);
while (iova < end) {
size_t unmapped, len;
- phys_addr_t phys, next;
+ phys_addr_t phys;

phys = iommu_iova_to_phys(domain->domain, iova);
if (WARN_ON(!phys)) {
@@ -905,12 +1172,10 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
* may require hardware cache flushing, try to find the
* largest contiguous physical memory chunk to unmap.
*/
- for (len = PAGE_SIZE;
- !domain->fgsp && iova + len < end; len += PAGE_SIZE) {
- next = iommu_iova_to_phys(domain->domain, iova + len);
- if (next != phys + len)
- break;
- }
+ if (is_hugetlb_page(phys >> PAGE_SHIFT))
+ len = hugetlb_get_contiguous_pages(domain, iova, end, phys);
+ else
+ len = get_contiguous_pages(domain, iova, end, phys);

/*
* First, try to use fast unmap/unpin. In case of failure,
@@ -1243,7 +1508,15 @@ static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct vfio_dma *dma,

while (size) {
/* Pin a contiguous chunk of memory */
- npage = vfio_pin_pages_remote(dma, vaddr + dma->size,
+ if (vaddr_is_hugetlb_page(vaddr + dma->size, dma->prot)) {
+ npage = vfio_pin_hugetlb_pages_remote(dma, vaddr + dma->size,
+ size >> PAGE_SHIFT, &pfn, limit);
+ /* try the normal page if failed */
+ if (npage <= 0)
+ npage = vfio_pin_pages_remote(dma, vaddr + dma->size,
+ size >> PAGE_SHIFT, &pfn, limit);
+ } else
+ npage = vfio_pin_pages_remote(dma, vaddr + dma->size,
size >> PAGE_SHIFT, &pfn, limit);
if (npage <= 0) {
WARN_ON(!npage);
--
2.23.0


2020-09-08 20:01:21

by Ming Mao

[permalink] [raw]
Subject: [PATCH V4 2/2] vfio: optimized for unpinning pages

The pages are unpinned one by one in unpin_user_pages_dirty_lock().
We add a new API unpin_user_hugetlb_pages_dirty_lock() which deletes
the for loop to optimize this.
If we want to unpin the hugetlb pages, all work can be done by a single
operation to the head page in this API.

Signed-off-by: Ming Mao <[email protected]>
---
drivers/vfio/vfio_iommu_type1.c | 90 +++++++++++++++++++++++++++-----
include/linux/mm.h | 3 ++
mm/gup.c | 91 +++++++++++++++++++++++++++++++++
3 files changed, 172 insertions(+), 12 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 8c1dc5136..44fc5f16c 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -609,6 +609,26 @@ static long hugetlb_page_vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr
return ret;
}

+/*
+ * put pfns for a hugetlb page
+ * @start: the PAGE_SIZE-page we start to put,can be any page in this hugetlb page
+ * @npages: the number of PAGE_SIZE-pages to put
+ * @prot: IOMMU_READ/WRITE
+ */
+static int hugetlb_put_pfn(unsigned long start, unsigned long npages, int prot)
+{
+ struct page *page;
+
+ if (!pfn_valid(start))
+ return -EFAULT;
+
+ page = pfn_to_page(start);
+ if (!page || !PageHuge(page))
+ return -EINVAL;
+
+ return unpin_user_hugetlb_pages_dirty_lock(page, npages, prot & IOMMU_WRITE);
+}
+
static long vfio_pin_hugetlb_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
long npage, unsigned long *pfn_base,
unsigned long limit)
@@ -616,7 +636,7 @@ static long vfio_pin_hugetlb_pages_remote(struct vfio_dma *dma, unsigned long va
unsigned long pfn = 0;
long ret, pinned = 0, lock_acct = 0;
dma_addr_t iova = vaddr - dma->vaddr + dma->iova;
- long pinned_loop, i;
+ long pinned_loop;

/* This code path is only user initiated */
if (!current->mm)
@@ -674,8 +694,7 @@ static long vfio_pin_hugetlb_pages_remote(struct vfio_dma *dma, unsigned long va

if (!dma->lock_cap &&
current->mm->locked_vm + lock_acct > limit) {
- for (i = 0; i < pinned_loop; i++)
- put_pfn(pfn++, dma->prot);
+ hugetlb_put_pfn(pfn, pinned_loop, dma->prot);
pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n",
__func__, limit << PAGE_SHIFT);
ret = -ENOMEM;
@@ -695,6 +714,40 @@ static long vfio_pin_hugetlb_pages_remote(struct vfio_dma *dma, unsigned long va
return pinned;
}

+static long vfio_unpin_hugetlb_pages_remote(struct vfio_dma *dma, dma_addr_t iova,
+ unsigned long pfn, long npage,
+ bool do_accounting)
+{
+ long unlocked = 0, locked = 0;
+ long i, unpinned;
+
+ for (i = 0; i < npage; i += unpinned, iova += unpinned * PAGE_SIZE) {
+ if (!is_hugetlb_page(pfn))
+ goto slow_path;
+
+ unpinned = hugetlb_put_pfn(pfn, npage - i, dma->prot);
+ if (unpinned > 0) {
+ pfn += unpinned;
+ unlocked += unpinned;
+ locked += hugetlb_page_get_externally_pinned_num(dma, pfn, unpinned);
+ } else
+ goto slow_path;
+ }
+slow_path:
+ for (; i < npage; i++, iova += PAGE_SIZE) {
+ if (put_pfn(pfn++, dma->prot)) {
+ unlocked++;
+ if (vfio_find_vpfn(dma, iova))
+ locked++;
+ }
+ }
+
+ if (do_accounting)
+ vfio_lock_acct(dma, locked - unlocked, true);
+
+ return unlocked;
+}
+
/*
* Attempt to pin pages. We really don't want to track all the pfns and
* the iommu can only map chunks of consecutive pfns anyway, so get the
@@ -993,11 +1046,18 @@ static long vfio_sync_unpin(struct vfio_dma *dma, struct vfio_domain *domain,
iommu_tlb_sync(domain->domain, iotlb_gather);

list_for_each_entry_safe(entry, next, regions, list) {
- unlocked += vfio_unpin_pages_remote(dma,
- entry->iova,
- entry->phys >> PAGE_SHIFT,
- entry->len >> PAGE_SHIFT,
- false);
+ if (is_hugetlb_page(entry->phys >> PAGE_SHIFT))
+ unlocked += vfio_unpin_hugetlb_pages_remote(dma,
+ entry->iova,
+ entry->phys >> PAGE_SHIFT,
+ entry->len >> PAGE_SHIFT,
+ false);
+ else
+ unlocked += vfio_unpin_pages_remote(dma,
+ entry->iova,
+ entry->phys >> PAGE_SHIFT,
+ entry->len >> PAGE_SHIFT,
+ false);
list_del(&entry->list);
kfree(entry);
}
@@ -1064,10 +1124,16 @@ static size_t unmap_unpin_slow(struct vfio_domain *domain,
size_t unmapped = iommu_unmap(domain->domain, *iova, len);

if (unmapped) {
- *unlocked += vfio_unpin_pages_remote(dma, *iova,
- phys >> PAGE_SHIFT,
- unmapped >> PAGE_SHIFT,
- false);
+ if (is_hugetlb_page(phys >> PAGE_SHIFT))
+ *unlocked += vfio_unpin_hugetlb_pages_remote(dma, *iova,
+ phys >> PAGE_SHIFT,
+ unmapped >> PAGE_SHIFT,
+ false);
+ else
+ *unlocked += vfio_unpin_pages_remote(dma, *iova,
+ phys >> PAGE_SHIFT,
+ unmapped >> PAGE_SHIFT,
+ false);
*iova += unmapped;
cond_resched();
}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index dc7b87310..a425135d0 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1202,8 +1202,11 @@ static inline void put_page(struct page *page)
#define GUP_PIN_COUNTING_BIAS (1U << 10)

void unpin_user_page(struct page *page);
+int unpin_user_hugetlb_page(struct page *page, unsigned long npages);
void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,
bool make_dirty);
+int unpin_user_hugetlb_pages_dirty_lock(struct page *pages, unsigned long npages,
+ bool make_dirty);
void unpin_user_pages(struct page **pages, unsigned long npages);

/**
diff --git a/mm/gup.c b/mm/gup.c
index 6f47697f8..14ee321eb 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -205,11 +205,42 @@ static bool __unpin_devmap_managed_user_page(struct page *page)

return true;
}
+
+static bool __unpin_devmap_managed_user_hugetlb_page(struct page *page, unsigned long npages)
+{
+ int count;
+ struct page *head = compound_head(page);
+
+ if (!page_is_devmap_managed(head))
+ return false;
+
+ hpage_pincount_sub(head, npages);
+
+ count = page_ref_sub_return(head, npages);
+
+ mod_node_page_state(page_pgdat(head), NR_FOLL_PIN_RELEASED, npages);
+ /*
+ * devmap page refcounts are 1-based, rather than 0-based: if
+ * refcount is 1, then the page is free and the refcount is
+ * stable because nobody holds a reference on the page.
+ */
+ if (count == 1)
+ free_devmap_managed_page(head);
+ else if (!count)
+ __put_page(head);
+
+ return true;
+}
#else
static bool __unpin_devmap_managed_user_page(struct page *page)
{
return false;
}
+
+static bool __unpin_devmap_managed_user_hugetlb_page(struct page *page, unsigned long npages)
+{
+ return false;
+}
#endif /* CONFIG_DEV_PAGEMAP_OPS */

/**
@@ -248,6 +279,66 @@ void unpin_user_page(struct page *page)
}
EXPORT_SYMBOL(unpin_user_page);

+int unpin_user_hugetlb_page(struct page *page, unsigned long npages)
+{
+ struct page *head;
+ long page_offset, unpinned, hugetlb_npages;
+
+ if (!page || !PageHuge(page))
+ return -EINVAL;
+
+ if (!npages)
+ return 0;
+
+ head = compound_head(page);
+ hugetlb_npages = 1UL << compound_order(head);
+ /* the offset of this subpage in the hugetlb page */
+ page_offset = page_to_pfn(page) & (hugetlb_npages - 1);
+ /* unpinned > 0, because page_offset is always less than hugetlb_npages */
+ unpinned = min_t(long, npages, (hugetlb_npages - page_offset));
+
+ if (__unpin_devmap_managed_user_hugetlb_page(page, unpinned))
+ return unpinned;
+
+ hpage_pincount_sub(head, unpinned);
+
+ if (page_ref_sub_and_test(head, unpinned))
+ __put_page(head);
+ mod_node_page_state(page_pgdat(head), NR_FOLL_PIN_RELEASED, unpinned);
+
+ return unpinned;
+}
+EXPORT_SYMBOL(unpin_user_hugetlb_page);
+
+/*
+ * @page:the first subpage to unpin in a hugetlb page
+ * @npages: number of pages to unpin
+ * @make_dirty: whether to mark the pages dirty
+ *
+ * Nearly the same as unpin_user_pages_dirty_lock
+ * If npages is 0, returns 0.
+ * If npages is >0, returns the number of
+ * pages unpinned. And this may be less than npages.
+ */
+int unpin_user_hugetlb_pages_dirty_lock(struct page *page, unsigned long npages,
+ bool make_dirty)
+{
+ struct page *head;
+
+ if (!page || !PageHuge(page))
+ return -EINVAL;
+
+ if (!npages)
+ return 0;
+
+ head = compound_head(page);
+ if (make_dirty && !PageDirty(head))
+ set_page_dirty_lock(head);
+
+ return unpin_user_hugetlb_page(page, npages);
+}
+EXPORT_SYMBOL(unpin_user_hugetlb_pages_dirty_lock);
+
/**
* unpin_user_pages_dirty_lock() - release and optionally dirty gup-pinned pages
* @pages: array of pages to be maybe marked dirty, and definitely released.
--
2.23.0


2020-09-09 08:03:14

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH V4 1/2] vfio dma_map/unmap: optimized for hugetlbfs pages

I really don't think this approach is any good. You workaround
a deficiency in the pin_user_pages API in one particular caller for
one particular use case.

I think you'd rather want either:

(1) a FOLL_HUGEPAGE flag for the pin_user_pages API family that returns
a single struct page for any kind of huge page, which would also
benefit all kinds of other users rather than adding these kinds of
hacks to vfio.
(2) add a bvec version of the API that returns a variable size
"extent"

I had started on (2) a while ago, and here is branch with my code (which
is broken and fails test, but might be a start):

http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/gup-bvec

But for now I wonder if (1) is the better start, which could still be
reused to (2) later.

2020-09-09 14:21:14

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH V4 1/2] vfio dma_map/unmap: optimized for hugetlbfs pages

On Wed, Sep 09, 2020 at 09:01:14AM +0100, Christoph Hellwig wrote:
> I really don't think this approach is any good. You workaround
> a deficiency in the pin_user_pages API in one particular caller for
> one particular use case.

RDMA has the same basic issues, this should should not be solved with
workarounds in VFIO - a common API would be good

> I think you'd rather want either:
>
> (1) a FOLL_HUGEPAGE flag for the pin_user_pages API family that returns
> a single struct page for any kind of huge page, which would also
> benefit all kinds of other users rather than adding these kinds of
> hacks to vfio.

How to use? The VMAs can have mixed page sizes so the caller would
have to somehow switch and call twice? Not sure this is faster.

> (2) add a bvec version of the API that returns a variable size
> "extent"

This is the best one, I think.. The IOMMU setup can have multiple page
sizes, so having largest contiguous blocks pre-computed should speed
that up.

vfio should be a win to use a sgl rather than a page list?

Especially if we can also reduce the number of pages pinned by only
pinning head pages..

> I had started on (2) a while ago, and here is branch with my code (which
> is broken and fails test, but might be a start):
>
> http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/gup-bvec
>
> But for now I wonder if (1) is the better start, which could still be
> reused to (2) later.

What about some 'pin_user_page_sgl' as a stepping stone?

Switching from that point to bvec seems like a smaller step?

Jason

2020-09-09 16:04:53

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH V4 1/2] vfio dma_map/unmap: optimized for hugetlbfs pages

On Wed, Sep 09, 2020 at 09:01:14AM +0100, Christoph Hellwig wrote:
> I really don't think this approach is any good. You workaround
> a deficiency in the pin_user_pages API in one particular caller for
> one particular use case.
>
> I think you'd rather want either:
>
> (1) a FOLL_HUGEPAGE flag for the pin_user_pages API family that returns
> a single struct page for any kind of huge page, which would also
> benefit all kinds of other users rather than adding these kinds of
> hacks to vfio.

This seems to be similar to a flag I added last week to
pagecache_get_page() called FGP_HEAD:

+ * * %FGP_HEAD - If the page is present and a THP, return the head page
+ * rather than the exact page specified by the index.

I think "return the head page" is probably what we want from what I
understand of this patch. The caller can figure out the appropriate
bv_offset / bv_len for a bio_vec, if that's what they want to do with it.

http://git.infradead.org/users/willy/pagecache.git/commitdiff/ee88eeeb6b0f35e95ef82b11dfc24dc04c3dcad8 is the exact commit where I added that, but it depends on a number of other patches in this series:
http://git.infradead.org/users/willy/pagecache.git/shortlog

I'm going to send out a subset of patches later today which will include
that one and some others. I haven't touched the GUP paths at all in
that series, but it's certainly going to make THPs (of various sizes)
much more present in the system.

2020-09-09 17:05:16

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH V4 1/2] vfio dma_map/unmap: optimized for hugetlbfs pages

On Wed, Sep 09, 2020 at 09:01:14AM +0100, Christoph Hellwig wrote:
> (1) a FOLL_HUGEPAGE flag for the pin_user_pages API family that returns
> a single struct page for any kind of huge page, which would also
> benefit all kinds of other users rather than adding these kinds of
> hacks to vfio.
> (2) add a bvec version of the API that returns a variable size
> "extent"
>
> I had started on (2) a while ago, and here is branch with my code (which
> is broken and fails test, but might be a start):
>
> http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/gup-bvec
>
> But for now I wonder if (1) is the better start, which could still be
> reused to (2) later.

(1) doesn't work. Suppose you have a 16kB MAP_PRIVATE of a file which happens
to have a THP in the page cache. When you write a byte, that base page gets
copied and your bvec for that 16kB needs to look something like:

{ page A, offset 0, length 4096 },
{ page B, offset 0, length 4096 },
{ page A, offset 8192, length 8192 },

You can come up with other scenarios, like partial mappings of THPs with
an adjacent VMA of a different THP, but I think we just need to make
(2) work.

2020-09-09 17:10:15

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH V4 1/2] vfio dma_map/unmap: optimized for hugetlbfs pages

On Wed, Sep 09, 2020 at 10:05:18AM -0300, Jason Gunthorpe wrote:
> How to use? The VMAs can have mixed page sizes so the caller would
> have to somehow switch and call twice? Not sure this is faster.

We can find out the page size based on the page. Right now it is
rather cumbersome, but one of willys pending series has a nicer helper
for that.

> > (2) add a bvec version of the API that returns a variable size
> > "extent"
>
> This is the best one, I think.. The IOMMU setup can have multiple page
> sizes, so having largest contiguous blocks pre-computed should speed
> that up.
>
> vfio should be a win to use a sgl rather than a page list?
>
> Especially if we can also reduce the number of pages pinned by only
> pinning head pages..

Especially for raw use of the iommu API as in vfio we don't even need
the scatterlist now, we can call ->map on each chunk. For the DMA API
we'd kinda need it, but not for long.

> What about some 'pin_user_page_sgl' as a stepping stone?

I'd rather avoid adding new scatterlist based APIs.

2020-09-09 17:14:02

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH V4 1/2] vfio dma_map/unmap: optimized for hugetlbfs pages

On Wed, Sep 09, 2020 at 03:29:41PM +0100, Christoph Hellwig wrote:
> On Wed, Sep 09, 2020 at 10:05:18AM -0300, Jason Gunthorpe wrote:
> > How to use? The VMAs can have mixed page sizes so the caller would
> > have to somehow switch and call twice? Not sure this is faster.
>
> We can find out the page size based on the page. Right now it is
> rather cumbersome, but one of willys pending series has a nicer helper
> for that.

So, returns a packed array of pinned head page pointers where each
element's length is determined by page_size()?

Some maths from VA figure out the initial page offset and final page
length?

Could this representation work effectively for general DMA mapping
somehow?

Put the array in chained pages like SGL so it can efficiently cover
very large amounts of VA?

Jason

2020-09-09 17:15:23

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH V4 1/2] vfio dma_map/unmap: optimized for hugetlbfs pages

On Wed, Sep 09, 2020 at 03:29:41PM +0100, Christoph Hellwig wrote:
> On Wed, Sep 09, 2020 at 10:05:18AM -0300, Jason Gunthorpe wrote:
> > How to use? The VMAs can have mixed page sizes so the caller would
> > have to somehow switch and call twice? Not sure this is faster.
>
> We can find out the page size based on the page. Right now it is
> rather cumbersome, but one of willys pending series has a nicer helper
> for that.

Actually already merged. There's page_size() which went into 5.4, and
is the one you'd want to use (also page_shift() and compound_nr()).
The thp_* equivalents (merged in 5.9) compile away to nothing if you
don't have CONFIG_TRANSPARENT_HUGEPAGE enabled, but since there are
many ways of getting a compound page mapped into userspace, page_size()
is the helper to use for VFIO.