When calling mbind() with MPOL_MF_{MOVE|MOVEALL} | MPOL_MF_STRICT,
kernel should attempt to migrate all existing pages, and return -EIO if
there is misplaced or unmovable page. Then commit 6f4576e3687b
("mempolicy: apply page table walker on queue_pages_range()") messed up
the return value and didn't break VMA scan early ianymore when MPOL_MF_STRICT
alone. The return value problem was fixed by commit a7f40cfe3b7a
("mm: mempolicy: make mbind() return -EIO when MPOL_MF_STRICT is specified"),
but it broke the VMA walk early if unmovable page is met, it may cause some
pages are not migrated as expected.
The code should conceptually do:
if (MPOL_MF_MOVE|MOVEALL)
scan all vmas
try to migrate the existing pages
return success
else if (MPOL_MF_MOVE* | MPOL_MF_STRICT)
scan all vmas
try to migrate the existing pages
return -EIO if unmovable or migration failed
else /* MPOL_MF_STRICT alone */
break early if meets unmovable and don't call mbind_range() at all
else /* none of those flags */
check the ranges in test_walk, EFAULT without mbind_range() if discontig.
Fixed the behavior.
Cc: Hugh Dickins <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Rafael Aquini <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: <[email protected]> v4.9+
Signed-off-by: Yang Shi <[email protected]>
---
mm/mempolicy.c | 39 +++++++++++++++++++--------------------
1 file changed, 19 insertions(+), 20 deletions(-)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 42b5567e3773..f1b00d6ac7ee 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -426,6 +426,7 @@ struct queue_pages {
unsigned long start;
unsigned long end;
struct vm_area_struct *first;
+ bool has_unmovable;
};
/*
@@ -446,9 +447,8 @@ static inline bool queue_folio_required(struct folio *folio,
/*
* queue_folios_pmd() has three possible return values:
* 0 - folios are placed on the right node or queued successfully, or
- * special page is met, i.e. huge zero page.
- * 1 - there is unmovable folio, and MPOL_MF_MOVE* & MPOL_MF_STRICT were
- * specified.
+ * special page is met, i.e. zero page, or unmovable page is found
+ * but continue walking (indicated by queue_pages.has_unmovable).
* -EIO - is migration entry or only MPOL_MF_STRICT was specified and an
* existing folio was already on a node that does not follow the
* policy.
@@ -479,7 +479,7 @@ static int queue_folios_pmd(pmd_t *pmd, spinlock_t *ptl, unsigned long addr,
if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) {
if (!vma_migratable(walk->vma) ||
migrate_folio_add(folio, qp->pagelist, flags)) {
- ret = 1;
+ qp->has_unmovable = true;
goto unlock;
}
} else
@@ -495,9 +495,8 @@ static int queue_folios_pmd(pmd_t *pmd, spinlock_t *ptl, unsigned long addr,
*
* queue_folios_pte_range() has three possible return values:
* 0 - folios are placed on the right node or queued successfully, or
- * special page is met, i.e. zero page.
- * 1 - there is unmovable folio, and MPOL_MF_MOVE* & MPOL_MF_STRICT were
- * specified.
+ * special page is met, i.e. zero page, or unmovable page is found
+ * but continue walking (indicated by queue_pages.has_unmovable).
* -EIO - only MPOL_MF_STRICT was specified and an existing folio was already
* on a node that does not follow the policy.
*/
@@ -508,7 +507,6 @@ static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr,
struct folio *folio;
struct queue_pages *qp = walk->private;
unsigned long flags = qp->flags;
- bool has_unmovable = false;
pte_t *pte, *mapped_pte;
pte_t ptent;
spinlock_t *ptl;
@@ -538,11 +536,12 @@ static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr,
if (!queue_folio_required(folio, qp))
continue;
if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) {
- /* MPOL_MF_STRICT must be specified if we get here */
- if (!vma_migratable(vma)) {
- has_unmovable = true;
- break;
- }
+ /*
+ * MPOL_MF_STRICT must be specified if we get here.
+ * Continue walking vmas due to MPOL_MF_MOVE* flags.
+ */
+ if (!vma_migratable(vma))
+ qp->has_unmovable = true;
/*
* Do not abort immediately since there may be
@@ -550,16 +549,13 @@ static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr,
* need migrate other LRU pages.
*/
if (migrate_folio_add(folio, qp->pagelist, flags))
- has_unmovable = true;
+ qp->has_unmovable = true;
} else
break;
}
pte_unmap_unlock(mapped_pte, ptl);
cond_resched();
- if (has_unmovable)
- return 1;
-
return addr != end ? -EIO : 0;
}
@@ -599,7 +595,7 @@ static int queue_folios_hugetlb(pte_t *pte, unsigned long hmask,
* Detecting misplaced folio but allow migrating folios which
* have been queued.
*/
- ret = 1;
+ qp->has_unmovable = true;
goto unlock;
}
@@ -620,7 +616,7 @@ static int queue_folios_hugetlb(pte_t *pte, unsigned long hmask,
* Failed to isolate folio but allow migrating pages
* which have been queued.
*/
- ret = 1;
+ qp->has_unmovable = true;
}
unlock:
spin_unlock(ptl);
@@ -756,12 +752,15 @@ queue_pages_range(struct mm_struct *mm, unsigned long start, unsigned long end,
.start = start,
.end = end,
.first = NULL,
+ .has_unmovable = false,
};
const struct mm_walk_ops *ops = lock_vma ?
&queue_pages_lock_vma_walk_ops : &queue_pages_walk_ops;
err = walk_page_range(mm, start, end, ops, &qp);
+ if (qp.has_unmovable)
+ err = 1;
if (!qp.first)
/* whole range in hole */
err = -EFAULT;
@@ -1358,7 +1357,7 @@ static long do_mbind(unsigned long start, unsigned long len,
putback_movable_pages(&pagelist);
}
- if ((ret > 0) || (nr_failed && (flags & MPOL_MF_STRICT)))
+ if (((ret > 0) || nr_failed) && (flags & MPOL_MF_STRICT))
err = -EIO;
} else {
up_out:
--
2.39.0
On 9/20/23 3:32 PM, Yang Shi wrote:
> When calling mbind() with MPOL_MF_{MOVE|MOVEALL} | MPOL_MF_STRICT,
> kernel should attempt to migrate all existing pages, and return -EIO if
> there is misplaced or unmovable page. Then commit 6f4576e3687b
> ("mempolicy: apply page table walker on queue_pages_range()") messed up
> the return value and didn't break VMA scan early ianymore when MPOL_MF_STRICT
> alone. The return value problem was fixed by commit a7f40cfe3b7a
> ("mm: mempolicy: make mbind() return -EIO when MPOL_MF_STRICT is specified"),
> but it broke the VMA walk early if unmovable page is met, it may cause some
> pages are not migrated as expected.
>
> The code should conceptually do:
>
> if (MPOL_MF_MOVE|MOVEALL)
> scan all vmas
> try to migrate the existing pages
> return success
> else if (MPOL_MF_MOVE* | MPOL_MF_STRICT)
> scan all vmas
> try to migrate the existing pages
> return -EIO if unmovable or migration failed
> else /* MPOL_MF_STRICT alone */
> break early if meets unmovable and don't call mbind_range() at all
> else /* none of those flags */
> check the ranges in test_walk, EFAULT without mbind_range() if discontig.
>
> Fixed the behavior.
Forgot the fixes.
Fixes: a7f40cfe3b7a ("mm: mempolicy: make mbind() return -EIO when
MPOL_MF_STRICT is specified")
>
> Cc: Hugh Dickins <[email protected]>
> Cc: Suren Baghdasaryan <[email protected]>
> Cc: Matthew Wilcox <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Vlastimil Babka <[email protected]>
> Cc: Oscar Salvador <[email protected]>
> Cc: Rafael Aquini <[email protected]>
> Cc: Kirill A. Shutemov <[email protected]>
> Cc: David Rientjes <[email protected]>
> Cc: <[email protected]> v4.9+
> Signed-off-by: Yang Shi <[email protected]>
> ---
> mm/mempolicy.c | 39 +++++++++++++++++++--------------------
> 1 file changed, 19 insertions(+), 20 deletions(-)
>
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 42b5567e3773..f1b00d6ac7ee 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -426,6 +426,7 @@ struct queue_pages {
> unsigned long start;
> unsigned long end;
> struct vm_area_struct *first;
> + bool has_unmovable;
> };
>
> /*
> @@ -446,9 +447,8 @@ static inline bool queue_folio_required(struct folio *folio,
> /*
> * queue_folios_pmd() has three possible return values:
> * 0 - folios are placed on the right node or queued successfully, or
> - * special page is met, i.e. huge zero page.
> - * 1 - there is unmovable folio, and MPOL_MF_MOVE* & MPOL_MF_STRICT were
> - * specified.
> + * special page is met, i.e. zero page, or unmovable page is found
> + * but continue walking (indicated by queue_pages.has_unmovable).
> * -EIO - is migration entry or only MPOL_MF_STRICT was specified and an
> * existing folio was already on a node that does not follow the
> * policy.
> @@ -479,7 +479,7 @@ static int queue_folios_pmd(pmd_t *pmd, spinlock_t *ptl, unsigned long addr,
> if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) {
> if (!vma_migratable(walk->vma) ||
> migrate_folio_add(folio, qp->pagelist, flags)) {
> - ret = 1;
> + qp->has_unmovable = true;
> goto unlock;
> }
> } else
> @@ -495,9 +495,8 @@ static int queue_folios_pmd(pmd_t *pmd, spinlock_t *ptl, unsigned long addr,
> *
> * queue_folios_pte_range() has three possible return values:
> * 0 - folios are placed on the right node or queued successfully, or
> - * special page is met, i.e. zero page.
> - * 1 - there is unmovable folio, and MPOL_MF_MOVE* & MPOL_MF_STRICT were
> - * specified.
> + * special page is met, i.e. zero page, or unmovable page is found
> + * but continue walking (indicated by queue_pages.has_unmovable).
> * -EIO - only MPOL_MF_STRICT was specified and an existing folio was already
> * on a node that does not follow the policy.
> */
> @@ -508,7 +507,6 @@ static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr,
> struct folio *folio;
> struct queue_pages *qp = walk->private;
> unsigned long flags = qp->flags;
> - bool has_unmovable = false;
> pte_t *pte, *mapped_pte;
> pte_t ptent;
> spinlock_t *ptl;
> @@ -538,11 +536,12 @@ static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr,
> if (!queue_folio_required(folio, qp))
> continue;
> if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) {
> - /* MPOL_MF_STRICT must be specified if we get here */
> - if (!vma_migratable(vma)) {
> - has_unmovable = true;
> - break;
> - }
> + /*
> + * MPOL_MF_STRICT must be specified if we get here.
> + * Continue walking vmas due to MPOL_MF_MOVE* flags.
> + */
> + if (!vma_migratable(vma))
> + qp->has_unmovable = true;
>
> /*
> * Do not abort immediately since there may be
> @@ -550,16 +549,13 @@ static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr,
> * need migrate other LRU pages.
> */
> if (migrate_folio_add(folio, qp->pagelist, flags))
> - has_unmovable = true;
> + qp->has_unmovable = true;
> } else
> break;
> }
> pte_unmap_unlock(mapped_pte, ptl);
> cond_resched();
>
> - if (has_unmovable)
> - return 1;
> -
> return addr != end ? -EIO : 0;
> }
>
> @@ -599,7 +595,7 @@ static int queue_folios_hugetlb(pte_t *pte, unsigned long hmask,
> * Detecting misplaced folio but allow migrating folios which
> * have been queued.
> */
> - ret = 1;
> + qp->has_unmovable = true;
> goto unlock;
> }
>
> @@ -620,7 +616,7 @@ static int queue_folios_hugetlb(pte_t *pte, unsigned long hmask,
> * Failed to isolate folio but allow migrating pages
> * which have been queued.
> */
> - ret = 1;
> + qp->has_unmovable = true;
> }
> unlock:
> spin_unlock(ptl);
> @@ -756,12 +752,15 @@ queue_pages_range(struct mm_struct *mm, unsigned long start, unsigned long end,
> .start = start,
> .end = end,
> .first = NULL,
> + .has_unmovable = false,
> };
> const struct mm_walk_ops *ops = lock_vma ?
> &queue_pages_lock_vma_walk_ops : &queue_pages_walk_ops;
>
> err = walk_page_range(mm, start, end, ops, &qp);
>
> + if (qp.has_unmovable)
> + err = 1;
> if (!qp.first)
> /* whole range in hole */
> err = -EFAULT;
> @@ -1358,7 +1357,7 @@ static long do_mbind(unsigned long start, unsigned long len,
> putback_movable_pages(&pagelist);
> }
>
> - if ((ret > 0) || (nr_failed && (flags & MPOL_MF_STRICT)))
> + if (((ret > 0) || nr_failed) && (flags & MPOL_MF_STRICT))
> err = -EIO;
> } else {
> up_out:
On 9/25/23 8:48 AM, Andrew Morton wrote:
> On Wed, 20 Sep 2023 15:32:42 -0700 Yang Shi <[email protected]> wrote:
>
>> When calling mbind() with MPOL_MF_{MOVE|MOVEALL} | MPOL_MF_STRICT,
>> kernel should attempt to migrate all existing pages, and return -EIO if
>> there is misplaced or unmovable page. Then commit 6f4576e3687b
>> ("mempolicy: apply page table walker on queue_pages_range()") messed up
>> the return value and didn't break VMA scan early ianymore when MPOL_MF_STRICT
>> alone. The return value problem was fixed by commit a7f40cfe3b7a
>> ("mm: mempolicy: make mbind() return -EIO when MPOL_MF_STRICT is specified"),
>> but it broke the VMA walk early if unmovable page is met, it may cause some
>> pages are not migrated as expected.
> So I'm thinking that a7f40cfe3b7a is the suitable Fixes: target?
Yes, thanks. My follow-up email also added this.
>
>> The code should conceptually do:
>>
>> if (MPOL_MF_MOVE|MOVEALL)
>> scan all vmas
>> try to migrate the existing pages
>> return success
>> else if (MPOL_MF_MOVE* | MPOL_MF_STRICT)
>> scan all vmas
>> try to migrate the existing pages
>> return -EIO if unmovable or migration failed
>> else /* MPOL_MF_STRICT alone */
>> break early if meets unmovable and don't call mbind_range() at all
>> else /* none of those flags */
>> check the ranges in test_walk, EFAULT without mbind_range() if discontig.
>>
>> Fixed the behavior.
>>
On Wed, 20 Sep 2023 15:32:42 -0700 Yang Shi <[email protected]> wrote:
> When calling mbind() with MPOL_MF_{MOVE|MOVEALL} | MPOL_MF_STRICT,
> kernel should attempt to migrate all existing pages, and return -EIO if
> there is misplaced or unmovable page. Then commit 6f4576e3687b
> ("mempolicy: apply page table walker on queue_pages_range()") messed up
> the return value and didn't break VMA scan early ianymore when MPOL_MF_STRICT
> alone. The return value problem was fixed by commit a7f40cfe3b7a
> ("mm: mempolicy: make mbind() return -EIO when MPOL_MF_STRICT is specified"),
> but it broke the VMA walk early if unmovable page is met, it may cause some
> pages are not migrated as expected.
So I'm thinking that a7f40cfe3b7a is the suitable Fixes: target?
> The code should conceptually do:
>
> if (MPOL_MF_MOVE|MOVEALL)
> scan all vmas
> try to migrate the existing pages
> return success
> else if (MPOL_MF_MOVE* | MPOL_MF_STRICT)
> scan all vmas
> try to migrate the existing pages
> return -EIO if unmovable or migration failed
> else /* MPOL_MF_STRICT alone */
> break early if meets unmovable and don't call mbind_range() at all
> else /* none of those flags */
> check the ranges in test_walk, EFAULT without mbind_range() if discontig.
>
> Fixed the behavior.
>
On Mon, Sep 25, 2023 at 10:16 AM Yang Shi <[email protected]> wrote:
>
>
>
> On 9/25/23 8:48 AM, Andrew Morton wrote:
> > On Wed, 20 Sep 2023 15:32:42 -0700 Yang Shi <[email protected]> wrote:
> >
> >> When calling mbind() with MPOL_MF_{MOVE|MOVEALL} | MPOL_MF_STRICT,
> >> kernel should attempt to migrate all existing pages, and return -EIO if
> >> there is misplaced or unmovable page. Then commit 6f4576e3687b
> >> ("mempolicy: apply page table walker on queue_pages_range()") messed up
> >> the return value and didn't break VMA scan early ianymore when MPOL_MF_STRICT
> >> alone. The return value problem was fixed by commit a7f40cfe3b7a
> >> ("mm: mempolicy: make mbind() return -EIO when MPOL_MF_STRICT is specified"),
> >> but it broke the VMA walk early if unmovable page is met, it may cause some
> >> pages are not migrated as expected.
> > So I'm thinking that a7f40cfe3b7a is the suitable Fixes: target?
>
> Yes, thanks. My follow-up email also added this.
>
> >
> >> The code should conceptually do:
> >>
> >> if (MPOL_MF_MOVE|MOVEALL)
> >> scan all vmas
> >> try to migrate the existing pages
> >> return success
> >> else if (MPOL_MF_MOVE* | MPOL_MF_STRICT)
> >> scan all vmas
> >> try to migrate the existing pages
> >> return -EIO if unmovable or migration failed
> >> else /* MPOL_MF_STRICT alone */
> >> break early if meets unmovable and don't call mbind_range() at all
> >> else /* none of those flags */
> >> check the ranges in test_walk, EFAULT without mbind_range() if discontig.
With this change I think my temporary fix at
https://lore.kernel.org/all/[email protected]/
can be removed because we either scan all vmas (which means we locked
them all) or we break early and do not call mbind_range() at all (in
which case we don't need vmas to be locked).
> >>
> >> Fixed the behavior.
> >>
>
On 9/28/23 9:38 AM, Andrew Morton wrote:
> On Wed, 27 Sep 2023 14:39:21 -0700 Suren Baghdasaryan <[email protected]> wrote:
>
>>>>> The code should conceptually do:
>>>>>
>>>>> if (MPOL_MF_MOVE|MOVEALL)
>>>>> scan all vmas
>>>>> try to migrate the existing pages
>>>>> return success
>>>>> else if (MPOL_MF_MOVE* | MPOL_MF_STRICT)
>>>>> scan all vmas
>>>>> try to migrate the existing pages
>>>>> return -EIO if unmovable or migration failed
>>>>> else /* MPOL_MF_STRICT alone */
>>>>> break early if meets unmovable and don't call mbind_range() at all
>>>>> else /* none of those flags */
>>>>> check the ranges in test_walk, EFAULT without mbind_range() if discontig.
>> With this change I think my temporary fix at
>> https://lore.kernel.org/all/[email protected]/
>> can be removed because we either scan all vmas (which means we locked
>> them all) or we break early and do not call mbind_range() at all (in
>> which case we don't need vmas to be locked).
Yes, we could just drop it. Keep the code not depend on the subtle
behavior of queue_pages_range() by keeping it is ok to me either. I
don't have strong preference.
> Thanks, I dropped "mm: lock VMAs skipped by a failed queue_pages_range()"
Thanks, Andrew.
On Wed, 27 Sep 2023 14:39:21 -0700 Suren Baghdasaryan <[email protected]> wrote:
> > >
> > >> The code should conceptually do:
> > >>
> > >> if (MPOL_MF_MOVE|MOVEALL)
> > >> scan all vmas
> > >> try to migrate the existing pages
> > >> return success
> > >> else if (MPOL_MF_MOVE* | MPOL_MF_STRICT)
> > >> scan all vmas
> > >> try to migrate the existing pages
> > >> return -EIO if unmovable or migration failed
> > >> else /* MPOL_MF_STRICT alone */
> > >> break early if meets unmovable and don't call mbind_range() at all
> > >> else /* none of those flags */
> > >> check the ranges in test_walk, EFAULT without mbind_range() if discontig.
>
> With this change I think my temporary fix at
> https://lore.kernel.org/all/[email protected]/
> can be removed because we either scan all vmas (which means we locked
> them all) or we break early and do not call mbind_range() at all (in
> which case we don't need vmas to be locked).
Thanks, I dropped "mm: lock VMAs skipped by a failed queue_pages_range()"