LinuxLists.cc - [PATCH 0/2] mm/hmm: more HMM clean up

2019-07-24 02:35:36

Subject: [PATCH 0/2] mm/hmm: more HMM clean up

Here are two more patches for things I found to clean up.
I assume this will go into Jason's tree since there will likely be
more HMM changes in this cycle.

Ralph Campbell (2):
mm/hmm: a few more C style and comment clean ups
mm/hmm: make full use of walk_page_range()

mm/hmm.c | 231 +++++++++++++++++++++----------------------------------
1 file changed, 86 insertions(+), 145 deletions(-)

--
2.20.1

2019-07-24 02:36:56

by Ralph Campbell

[permalink] [raw]

Subject: [PATCH 2/2] mm/hmm: make full use of walk_page_range()

hmm_range_snapshot() and hmm_range_fault() both call find_vma() and
walk_page_range() in a loop. This is unnecessary duplication since
walk_page_range() calls find_vma() in a loop already.
Simplify hmm_range_snapshot() and hmm_range_fault() by defining a
walk_test() callback function to filter unhandled vmas.

Signed-off-by: Ralph Campbell <[email protected]>
Cc: "Jérôme Glisse" <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Christoph Hellwig <[email protected]>
---
mm/hmm.c | 197 ++++++++++++++++++++-----------------------------------
1 file changed, 70 insertions(+), 127 deletions(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index 8271f110c243..218557451070 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -839,13 +839,40 @@ static int hmm_vma_walk_hugetlb_entry(pte_t *pte, unsigned long hmask,
#endif
}

-static void hmm_pfns_clear(struct hmm_range *range,
- uint64_t *pfns,
- unsigned long addr,
- unsigned long end)
+static int hmm_vma_walk_test(unsigned long start,
+ unsigned long end,
+ struct mm_walk *walk)
{
- for (; addr < end; addr += PAGE_SIZE, pfns++)
- *pfns = range->values[HMM_PFN_NONE];
+ const unsigned long device_vma = VM_IO | VM_PFNMAP | VM_MIXEDMAP;
+ struct hmm_vma_walk *hmm_vma_walk = walk->private;
+ struct hmm_range *range = hmm_vma_walk->range;
+ struct vm_area_struct *vma = walk->vma;
+
+ /* If range is no longer valid, force retry. */
+ if (!range->valid)
+ return -EBUSY;
+
+ if (vma->vm_flags & device_vma)
+ return -EFAULT;
+
+ if (is_vm_hugetlb_page(vma)) {
+ if (huge_page_shift(hstate_vma(vma)) != range->page_shift &&
+ range->page_shift != PAGE_SHIFT)
+ return -EINVAL;
+ } else {
+ if (range->page_shift != PAGE_SHIFT)
+ return -EINVAL;
+ }
+
+ /*
+ * If vma does not allow read access, then assume that it does not
+ * allow write access, either. HMM does not support architectures
+ * that allow write without read.
+ */
+ if (!(vma->vm_flags & VM_READ))
+ return -EPERM;
+
+ return 0;
}

/*
@@ -949,65 +976,28 @@ EXPORT_SYMBOL(hmm_range_unregister);
*/
long hmm_range_snapshot(struct hmm_range *range)
{
- const unsigned long device_vma = VM_IO | VM_PFNMAP | VM_MIXEDMAP;
- unsigned long start = range->start, end;
- struct hmm_vma_walk hmm_vma_walk;
+ unsigned long start = range->start;
+ struct hmm_vma_walk hmm_vma_walk = {};
struct hmm *hmm = range->hmm;
- struct vm_area_struct *vma;
- struct mm_walk mm_walk;
+ struct mm_walk mm_walk = {};
+ int ret;

lockdep_assert_held(&hmm->mm->mmap_sem);
- do {
- /* If range is no longer valid force retry. */
- if (!range->valid)
- return -EBUSY;

- vma = find_vma(hmm->mm, start);
- if (vma == NULL || (vma->vm_flags & device_vma))
- return -EFAULT;
-
- if (is_vm_hugetlb_page(vma)) {
- if (huge_page_shift(hstate_vma(vma)) !=
- range->page_shift &&
- range->page_shift != PAGE_SHIFT)
- return -EINVAL;
- } else {
- if (range->page_shift != PAGE_SHIFT)
- return -EINVAL;
- }
+ hmm_vma_walk.range = range;
+ hmm_vma_walk.last = start;
+ mm_walk.private = &hmm_vma_walk;

- if (!(vma->vm_flags & VM_READ)) {
- /*
- * If vma do not allow read access, then assume that it
- * does not allow write access, either. HMM does not
- * support architecture that allow write without read.
- */
- hmm_pfns_clear(range, range->pfns,
- range->start, range->end);
- return -EPERM;
- }
+ mm_walk.mm = hmm->mm;
+ mm_walk.pud_entry = hmm_vma_walk_pud;
+ mm_walk.pmd_entry = hmm_vma_walk_pmd;
+ mm_walk.pte_hole = hmm_vma_walk_hole;
+ mm_walk.hugetlb_entry = hmm_vma_walk_hugetlb_entry;
+ mm_walk.test_walk = hmm_vma_walk_test;

- range->vma = vma;
- hmm_vma_walk.pgmap = NULL;
- hmm_vma_walk.last = start;
- hmm_vma_walk.fault = false;
- hmm_vma_walk.range = range;
- mm_walk.private = &hmm_vma_walk;
- end = min(range->end, vma->vm_end);
-
- mm_walk.vma = vma;
- mm_walk.mm = vma->vm_mm;
- mm_walk.pte_entry = NULL;
- mm_walk.test_walk = NULL;
- mm_walk.hugetlb_entry = NULL;
- mm_walk.pud_entry = hmm_vma_walk_pud;
- mm_walk.pmd_entry = hmm_vma_walk_pmd;
- mm_walk.pte_hole = hmm_vma_walk_hole;
- mm_walk.hugetlb_entry = hmm_vma_walk_hugetlb_entry;
-
- walk_page_range(start, end, &mm_walk);
- start = end;
- } while (start < range->end);
+ ret = walk_page_range(start, range->end, &mm_walk);
+ if (ret)
+ return ret;

return (hmm_vma_walk.last - range->start) >> PAGE_SHIFT;
}
@@ -1043,83 +1033,36 @@ EXPORT_SYMBOL(hmm_range_snapshot);
*/
long hmm_range_fault(struct hmm_range *range, bool block)
{
- const unsigned long device_vma = VM_IO | VM_PFNMAP | VM_MIXEDMAP;
- unsigned long start = range->start, end;
- struct hmm_vma_walk hmm_vma_walk;
+ unsigned long start = range->start;
+ struct hmm_vma_walk hmm_vma_walk = {};
struct hmm *hmm = range->hmm;
- struct vm_area_struct *vma;
- struct mm_walk mm_walk;
+ struct mm_walk mm_walk = {};
int ret;

lockdep_assert_held(&hmm->mm->mmap_sem);

- do {
- /* If range is no longer valid force retry. */
- if (!range->valid)
- return -EBUSY;
+ hmm_vma_walk.range = range;
+ hmm_vma_walk.last = start;
+ hmm_vma_walk.fault = true;
+ hmm_vma_walk.block = block;
+ mm_walk.private = &hmm_vma_walk;

- vma = find_vma(hmm->mm, start);
- if (vma == NULL || (vma->vm_flags & device_vma))
- return -EFAULT;
-
- if (is_vm_hugetlb_page(vma)) {
- if (huge_page_shift(hstate_vma(vma)) !=
- range->page_shift &&
- range->page_shift != PAGE_SHIFT)
- return -EINVAL;
- } else {
- if (range->page_shift != PAGE_SHIFT)
- return -EINVAL;
- }
+ mm_walk.mm = hmm->mm;
+ mm_walk.pud_entry = hmm_vma_walk_pud;
+ mm_walk.pmd_entry = hmm_vma_walk_pmd;
+ mm_walk.pte_hole = hmm_vma_walk_hole;
+ mm_walk.hugetlb_entry = hmm_vma_walk_hugetlb_entry;
+ mm_walk.test_walk = hmm_vma_walk_test;

- if (!(vma->vm_flags & VM_READ)) {
- /*
- * If vma do not allow read access, then assume that it
- * does not allow write access, either. HMM does not
- * support architecture that allow write without read.
- */
- hmm_pfns_clear(range, range->pfns,
- range->start, range->end);
- return -EPERM;
- }
+ do {
+ ret = walk_page_range(start, range->end, &mm_walk);
+ start = hmm_vma_walk.last;

- range->vma = vma;
- hmm_vma_walk.pgmap = NULL;
- hmm_vma_walk.last = start;
- hmm_vma_walk.fault = true;
- hmm_vma_walk.block = block;
- hmm_vma_walk.range = range;
- mm_walk.private = &hmm_vma_walk;
- end = min(range->end, vma->vm_end);
-
- mm_walk.vma = vma;
- mm_walk.mm = vma->vm_mm;
- mm_walk.pte_entry = NULL;
- mm_walk.test_walk = NULL;
- mm_walk.hugetlb_entry = NULL;
- mm_walk.pud_entry = hmm_vma_walk_pud;
- mm_walk.pmd_entry = hmm_vma_walk_pmd;
- mm_walk.pte_hole = hmm_vma_walk_hole;
- mm_walk.hugetlb_entry = hmm_vma_walk_hugetlb_entry;
-
- do {
- ret = walk_page_range(start, end, &mm_walk);
- start = hmm_vma_walk.last;
-
- /* Keep trying while the range is valid. */
- } while (ret == -EBUSY && range->valid);
-
- if (ret) {
- unsigned long i;
-
- i = (hmm_vma_walk.last - range->start) >> PAGE_SHIFT;
- hmm_pfns_clear(range, &range->pfns[i],
- hmm_vma_walk.last, range->end);
- return ret;
- }
- start = end;
+ /* Keep trying while the range is valid. */
+ } while (ret == -EBUSY && range->valid);

- } while (start < range->end);
+ if (ret)
+ return ret;

return (hmm_vma_walk.last - range->start) >> PAGE_SHIFT;
}
--
2.20.1

2019-07-24 03:15:19

by Ralph Campbell

[permalink] [raw]

Subject: [PATCH 1/2] mm/hmm: a few more C style and comment clean ups

A few more comments and minor programming style clean ups.
There should be no functional changes.

Signed-off-by: Ralph Campbell <[email protected]>
Cc: "Jérôme Glisse" <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Christoph Hellwig <[email protected]>
---
mm/hmm.c | 34 ++++++++++++++++------------------
1 file changed, 16 insertions(+), 18 deletions(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index b810a4fa3de9..8271f110c243 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -32,7 +32,7 @@ static const struct mmu_notifier_ops hmm_mmu_notifier_ops;
* hmm_get_or_create - register HMM against an mm (HMM internal)
*
* @mm: mm struct to attach to
- * Returns: returns an HMM object, either by referencing the existing
+ * Return: an HMM object, either by referencing the existing
* (per-process) object, or by creating a new one.
*
* This is not intended to be used directly by device drivers. If mm already
@@ -323,8 +323,8 @@ static int hmm_pfns_bad(unsigned long addr,
}

/*
- * hmm_vma_walk_hole() - handle a range lacking valid pmd or pte(s)
- * @start: range virtual start address (inclusive)
+ * hmm_vma_walk_hole_() - handle a range lacking valid pmd or pte(s)
+ * @addr: range virtual start address (inclusive)
* @end: range virtual end address (exclusive)
* @fault: should we fault or not ?
* @write_fault: write fault ?
@@ -374,9 +374,9 @@ static inline void hmm_pte_need_fault(const struct hmm_vma_walk *hmm_vma_walk,
/*
* So we not only consider the individual per page request we also
* consider the default flags requested for the range. The API can
- * be use in 2 fashions. The first one where the HMM user coalesce
- * multiple page fault into one request and set flags per pfns for
- * of those faults. The second one where the HMM user want to pre-
+ * be used 2 ways. The first one where the HMM user coalesces
+ * multiple page faults into one request and sets flags per pfn for
+ * those faults. The second one where the HMM user wants to pre-
* fault a range with specific flags. For the latter one it is a
* waste to have the user pre-fill the pfn arrays with a default
* flags value.
@@ -386,7 +386,7 @@ static inline void hmm_pte_need_fault(const struct hmm_vma_walk *hmm_vma_walk,
/* We aren't ask to do anything ... */
if (!(pfns & range->flags[HMM_PFN_VALID]))
return;
- /* If this is device memory than only fault if explicitly requested */
+ /* If this is device memory then only fault if explicitly requested */
if ((cpu_flags & range->flags[HMM_PFN_DEVICE_PRIVATE])) {
/* Do we fault on device memory ? */
if (pfns & range->flags[HMM_PFN_DEVICE_PRIVATE]) {
@@ -500,7 +500,7 @@ static int hmm_vma_handle_pmd(struct mm_walk *walk,
hmm_vma_walk->last = end;
return 0;
#else
- /* If THP is not enabled then we should never reach that code ! */
+ /* If THP is not enabled then we should never reach this code ! */
return -EINVAL;
#endif
}
@@ -624,13 +624,12 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
pte_t *ptep;
pmd_t pmd;

-
again:
pmd = READ_ONCE(*pmdp);
if (pmd_none(pmd))
return hmm_vma_walk_hole(start, end, walk);

- if (pmd_huge(pmd) && (range->vma->vm_flags & VM_HUGETLB))
+ if (pmd_huge(pmd) && is_vm_hugetlb_page(vma))
return hmm_pfns_bad(start, end, walk);

if (thp_migration_supported() && is_pmd_migration_entry(pmd)) {
@@ -655,11 +654,11 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,

if (pmd_devmap(pmd) || pmd_trans_huge(pmd)) {
/*
- * No need to take pmd_lock here, even if some other threads
+ * No need to take pmd_lock here, even if some other thread
* is splitting the huge pmd we will get that event through
* mmu_notifier callback.
*
- * So just read pmd value and check again its a transparent
+ * So just read pmd value and check again it's a transparent
* huge or device mapping one and compute corresponding pfn
* values.
*/
@@ -673,7 +672,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
}

/*
- * We have handled all the valid case above ie either none, migration,
+ * We have handled all the valid cases above ie either none, migration,
* huge or transparent huge. At this point either it is a valid pmd
* entry pointing to pte directory or it is a bad pmd that will not
* recover.
@@ -793,10 +792,10 @@ static int hmm_vma_walk_hugetlb_entry(pte_t *pte, unsigned long hmask,
pte_t entry;
int ret = 0;

- size = 1UL << huge_page_shift(h);
+ size = huge_page_size(h);
mask = size - 1;
if (range->page_shift != PAGE_SHIFT) {
- /* Make sure we are looking at full page. */
+ /* Make sure we are looking at a full page. */
if (start & mask)
return -EINVAL;
if (end < (start + size))
@@ -807,8 +806,7 @@ static int hmm_vma_walk_hugetlb_entry(pte_t *pte, unsigned long hmask,
size = PAGE_SIZE;
}

-
- ptl = huge_pte_lock(hstate_vma(walk->vma), walk->mm, pte);
+ ptl = huge_pte_lock(hstate_vma(vma), walk->mm, pte);
entry = huge_ptep_get(pte);

i = (start - range->start) >> range->page_shift;
@@ -857,7 +855,7 @@ static void hmm_pfns_clear(struct hmm_range *range,
* @start: start virtual address (inclusive)
* @end: end virtual address (exclusive)
* @page_shift: expect page shift for the range
- * Returns 0 on success, -EFAULT if the address space is no longer valid
+ * Return: 0 on success, -EFAULT if the address space is no longer valid
*
* Track updates to the CPU page table see include/linux/hmm.h
*/
--
2.20.1

2019-07-24 06:53:56

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH 2/2] mm/hmm: make full use of walk_page_range()

On Tue, Jul 23, 2019 at 04:30:16PM -0700, Ralph Campbell wrote:
> hmm_range_snapshot() and hmm_range_fault() both call find_vma() and
> walk_page_range() in a loop. This is unnecessary duplication since
> walk_page_range() calls find_vma() in a loop already.
> Simplify hmm_range_snapshot() and hmm_range_fault() by defining a
> walk_test() callback function to filter unhandled vmas.

I like the approach a lot!

But we really need to sort out the duplication between hmm_range_fault
and hmm_range_snapshot first, as they are basically the same code. I
have patches here:

http://git.infradead.org/users/hch/misc.git/commitdiff/a34ccd30ee8a8a3111d9e91711c12901ed7dea74

http://git.infradead.org/users/hch/misc.git/commitdiff/81f442ebac7170815af7770a1efa9c4ab662137e

That being said we don't really have any users for the snapshot mode
or non-blocking faults, and I don't see any in the immediate pipeline
either. It might actually be a better idea to just kill that stuff off
for now until we have a user, as code without users is per definition
untested and will just bitrot and break.

> + const unsigned long device_vma = VM_IO | VM_PFNMAP | VM_MIXEDMAP;
> + struct hmm_vma_walk *hmm_vma_walk = walk->private;
> + struct hmm_range *range = hmm_vma_walk->range;
> + struct vm_area_struct *vma = walk->vma;
> +
> + /* If range is no longer valid, force retry. */
> + if (!range->valid)
> + return -EBUSY;
> +
> + if (vma->vm_flags & device_vma)

Can we just kill off this odd device_vma variable?

if (vma->vm_flags & (VM_IO | VM_PFNMAP | VM_MIXEDMAP))

and maybe add a comment on why we are skipping them (because they
don't have struct page backing I guess..)

2019-07-24 11:54:55

by Jason Gunthorpe

[permalink] [raw]

Subject: Re: [PATCH 2/2] mm/hmm: make full use of walk_page_range()

On Wed, Jul 24, 2019 at 08:51:46AM +0200, Christoph Hellwig wrote:
> On Tue, Jul 23, 2019 at 04:30:16PM -0700, Ralph Campbell wrote:
> > hmm_range_snapshot() and hmm_range_fault() both call find_vma() and
> > walk_page_range() in a loop. This is unnecessary duplication since
> > walk_page_range() calls find_vma() in a loop already.
> > Simplify hmm_range_snapshot() and hmm_range_fault() by defining a
> > walk_test() callback function to filter unhandled vmas.
>
> I like the approach a lot!
>
> But we really need to sort out the duplication between hmm_range_fault
> and hmm_range_snapshot first, as they are basically the same code. I
> have patches here:
>
> http://git.infradead.org/users/hch/misc.git/commitdiff/a34ccd30ee8a8a3111d9e91711c12901ed7dea74
>
> http://git.infradead.org/users/hch/misc.git/commitdiff/81f442ebac7170815af7770a1efa9c4ab662137e

Yeah, that is a straightforward improvement, maybe Ralph should grab
these two as part of his series?

> That being said we don't really have any users for the snapshot mode
> or non-blocking faults, and I don't see any in the immediate pipeline
> either.

If this code was production ready I'd use it in ODP right away.

When we first create a ODP MR we'd want to snapshot to pre-load the
NIC tables with something other than page fault, but not fault
anything.

This would be a big performance improvement for ODP.

Jason

2019-07-24 20:19:00

by Ralph Campbell

[permalink] [raw]

Subject: Re: [PATCH 2/2] mm/hmm: make full use of walk_page_range()

On 7/24/19 4:53 AM, Jason Gunthorpe wrote:
> On Wed, Jul 24, 2019 at 08:51:46AM +0200, Christoph Hellwig wrote:
>> On Tue, Jul 23, 2019 at 04:30:16PM -0700, Ralph Campbell wrote:
>>> hmm_range_snapshot() and hmm_range_fault() both call find_vma() and
>>> walk_page_range() in a loop. This is unnecessary duplication since
>>> walk_page_range() calls find_vma() in a loop already.
>>> Simplify hmm_range_snapshot() and hmm_range_fault() by defining a
>>> walk_test() callback function to filter unhandled vmas.
>>
>> I like the approach a lot!
>>
>> But we really need to sort out the duplication between hmm_range_fault
>> and hmm_range_snapshot first, as they are basically the same code. I
>> have patches here:
>>
>> http://git.infradead.org/users/hch/misc.git/commitdiff/a34ccd30ee8a8a3111d9e91711c12901ed7dea74
>>
>> http://git.infradead.org/users/hch/misc.git/commitdiff/81f442ebac7170815af7770a1efa9c4ab662137e
>
> Yeah, that is a straightforward improvement, maybe Ralph should grab
> these two as part of his series?

Sure, no problem.
I'll add them in v2 when I fix the other issues in the series.

>> That being said we don't really have any users for the snapshot mode
>> or non-blocking faults, and I don't see any in the immediate pipeline
>> either.
>
> If this code was production ready I'd use it in ODP right away.
>
> When we first create a ODP MR we'd want to snapshot to pre-load the
> NIC tables with something other than page fault, but not fault
> anything.
>
> This would be a big performance improvement for ODP.
>
> Jason
>