Waiting on a page migration entry has used wait_on_page_locked() all
along since 2006: but you cannot safely wait_on_page_locked() without
holding a reference to the page, and that extra reference is enough to
make migrate_page_move_mapping() fail with -EAGAIN, when a racing task
faults on the entry before migrate_page_move_mapping() gets there.
And that failure is retried nine times, amplifying the pain when
trying to migrate a popular page. With a single persistent faulter,
migration sometimes succeeds; with two or three concurrent faulters,
success becomes much less likely (and the more the page was mapped,
the worse the overhead of unmapping and remapping it on each try).
This is especially a problem for memory offlining, where the outer
level retries forever (or until terminated from userspace), because
a heavy refault workload can trigger an endless loop of migration
failures. wait_on_page_locked() is the wrong tool for the job.
David Herrmann (but was he the first?) noticed this issue in 2014:
https://marc.info/?l=linux-mm&m=140110465608116&w=2
Tim Chen started a thread in August 2017 which appears relevant:
https://marc.info/?l=linux-mm&m=150275941014915&w=2
where Kan Liang went on to implicate __migration_entry_wait():
https://marc.info/?l=linux-mm&m=150300268411980&w=2
and the thread ended up with the v4.14 commits:
2554db916586 ("sched/wait: Break up long wake list walk")
11a19c7b099f ("sched/wait: Introduce wakeup boomark in wake_up_page_bit")
Baoquan He reported "Memory hotplug softlock issue" 14 November 2018:
https://marc.info/?l=linux-mm&m=154217936431300&w=2
We have all assumed that it is essential to hold a page reference while
waiting on a page lock: partly to guarantee that there is still a struct
page when MEMORY_HOTREMOVE is configured, but also to protect against
reuse of the struct page going to someone who then holds the page locked
indefinitely, when the waiter can reasonably expect timely unlocking.
But in fact, so long as wait_on_page_bit_common() does the put_page(),
and is careful not to rely on struct page contents thereafter, there is
no need to hold a reference to the page while waiting on it. That does
mean that this case cannot go back through the loop: but that's fine for
the page migration case, and even if used more widely, is limited by the
"Stop walking if it's locked" optimization in wake_page_function().
Add interface put_and_wait_on_page_locked() to do this, using negative
value of the lock arg to wait_on_page_bit_common() to implement it.
No interruptible or killable variant needed yet, but they might follow:
I have a vague notion that reporting -EINTR should take precedence over
return from wait_on_page_bit_common() without knowing the page state,
so arrange it accordingly - but that may be nothing but pedantic.
__migration_entry_wait() still has to take a brief reference to the
page, prior to calling put_and_wait_on_page_locked(): but now that it
is dropped before waiting, the chance of impeding page migration is
very much reduced. Should we perhaps disable preemption across this?
shrink_page_list()'s __ClearPageLocked(): that was a surprise! This
survived a lot of testing before that showed up. PageWaiters may have
been set by wait_on_page_bit_common(), and the reference dropped, just
before shrink_page_list() succeeds in freezing its last page reference:
in such a case, unlock_page() must be used. Follow the suggestion from
Michal Hocko, just revert a978d6f52106 ("mm: unlockless reclaim") now:
that optimization predates PageWaiters, and won't buy much these days;
but we can reinstate it for the !PageWaiters case if anyone notices.
It does raise the question: should vmscan.c's is_page_cache_freeable()
and __remove_mapping() now treat a PageWaiters page as if an extra
reference were held? Perhaps, but I don't think it matters much, since
shrink_page_list() already had to win its trylock_page(), so waiters are
not very common there: I noticed no difference when trying the bigger
change, and it's surely not needed while put_and_wait_on_page_locked()
is only used for page migration.
Reported-and-tested-by: Baoquan He <[email protected]>
Signed-off-by: Hugh Dickins <[email protected]>
Acked-by: Michal Hocko <[email protected]>
---
Linus, I'm addressing this patch to you because I see from Tim Chen's
thread that it would interest you, and you were disappointed not to
root cause the issue back then. I'm not pushing for you to fast-track
this into 4.20-rc, but I expect Andrew will pick it up for mmotm, and
thence linux-next. Or you may spot a terrible defect, but I hope not.
Hugh
include/linux/pagemap.h | 2 ++
mm/filemap.c | 60 ++++++++++++++++++++++++++++++++---------
mm/huge_memory.c | 6 ++---
mm/migrate.c | 12 +++------
mm/vmscan.c | 10 ++-----
5 files changed, 57 insertions(+), 33 deletions(-)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 226f96f0dee0..e2d7039af6a3 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -537,6 +537,8 @@ static inline int wait_on_page_locked_killable(struct page *page)
return wait_on_page_bit_killable(compound_head(page), PG_locked);
}
+extern void put_and_wait_on_page_locked(struct page *page);
+
/*
* Wait for a page to complete writeback
*/
diff --git a/mm/filemap.c b/mm/filemap.c
index 81adec8ee02c..d645fbbf34d7 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -981,7 +981,14 @@ static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync,
if (wait_page->bit_nr != key->bit_nr)
return 0;
- /* Stop walking if it's locked */
+ /*
+ * Stop walking if it's locked.
+ * Is this safe if put_and_wait_on_page_locked() is in use?
+ * Yes: the waker must hold a reference to this page, and if PG_locked
+ * has now already been set by another task, that task must also hold
+ * a reference to the *same usage* of this page; so there is no need
+ * to walk on to wake even the put_and_wait_on_page_locked() callers.
+ */
if (test_bit(key->bit_nr, &key->page->flags))
return -1;
@@ -1050,24 +1057,28 @@ static void wake_up_page(struct page *page, int bit)
}
static inline int wait_on_page_bit_common(wait_queue_head_t *q,
- struct page *page, int bit_nr, int state, bool lock)
+ struct page *page, int bit_nr, int state, int lock)
{
struct wait_page_queue wait_page;
wait_queue_entry_t *wait = &wait_page.wait;
+ bool bit_is_set;
bool thrashing = false;
+ bool delayacct = false;
unsigned long pflags;
int ret = 0;
if (bit_nr == PG_locked &&
!PageUptodate(page) && PageWorkingset(page)) {
- if (!PageSwapBacked(page))
+ if (!PageSwapBacked(page)) {
delayacct_thrashing_start();
+ delayacct = true;
+ }
psi_memstall_enter(&pflags);
thrashing = true;
}
init_wait(wait);
- wait->flags = lock ? WQ_FLAG_EXCLUSIVE : 0;
+ wait->flags = lock > 0 ? WQ_FLAG_EXCLUSIVE : 0;
wait->func = wake_page_function;
wait_page.page = page;
wait_page.bit_nr = bit_nr;
@@ -1084,14 +1095,17 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
spin_unlock_irq(&q->lock);
- if (likely(test_bit(bit_nr, &page->flags))) {
+ bit_is_set = test_bit(bit_nr, &page->flags);
+ if (lock < 0)
+ put_page(page);
+
+ if (likely(bit_is_set))
io_schedule();
- }
- if (lock) {
+ if (lock > 0) {
if (!test_and_set_bit_lock(bit_nr, &page->flags))
break;
- } else {
+ } else if (lock == 0) {
if (!test_bit(bit_nr, &page->flags))
break;
}
@@ -1100,12 +1114,23 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
ret = -EINTR;
break;
}
+
+ if (lock < 0) {
+ /*
+ * We can no longer safely access page->flags:
+ * even if CONFIG_MEMORY_HOTREMOVE is not enabled,
+ * there is a risk of waiting forever on a page reused
+ * for something that keeps it locked indefinitely.
+ * But best check for -EINTR above before breaking.
+ */
+ break;
+ }
}
finish_wait(q, wait);
if (thrashing) {
- if (!PageSwapBacked(page))
+ if (delayacct)
delayacct_thrashing_end();
psi_memstall_leave(&pflags);
}
@@ -1124,17 +1149,26 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
void wait_on_page_bit(struct page *page, int bit_nr)
{
wait_queue_head_t *q = page_waitqueue(page);
- wait_on_page_bit_common(q, page, bit_nr, TASK_UNINTERRUPTIBLE, false);
+ wait_on_page_bit_common(q, page, bit_nr, TASK_UNINTERRUPTIBLE, 0);
}
EXPORT_SYMBOL(wait_on_page_bit);
int wait_on_page_bit_killable(struct page *page, int bit_nr)
{
wait_queue_head_t *q = page_waitqueue(page);
- return wait_on_page_bit_common(q, page, bit_nr, TASK_KILLABLE, false);
+ return wait_on_page_bit_common(q, page, bit_nr, TASK_KILLABLE, 0);
}
EXPORT_SYMBOL(wait_on_page_bit_killable);
+void put_and_wait_on_page_locked(struct page *page)
+{
+ wait_queue_head_t *q;
+
+ page = compound_head(page);
+ q = page_waitqueue(page);
+ wait_on_page_bit_common(q, page, PG_locked, TASK_UNINTERRUPTIBLE, -1);
+}
+
/**
* add_page_wait_queue - Add an arbitrary waiter to a page's wait queue
* @page: Page defining the wait queue of interest
@@ -1264,7 +1298,7 @@ void __lock_page(struct page *__page)
{
struct page *page = compound_head(__page);
wait_queue_head_t *q = page_waitqueue(page);
- wait_on_page_bit_common(q, page, PG_locked, TASK_UNINTERRUPTIBLE, true);
+ wait_on_page_bit_common(q, page, PG_locked, TASK_UNINTERRUPTIBLE, 1);
}
EXPORT_SYMBOL(__lock_page);
@@ -1272,7 +1306,7 @@ int __lock_page_killable(struct page *__page)
{
struct page *page = compound_head(__page);
wait_queue_head_t *q = page_waitqueue(page);
- return wait_on_page_bit_common(q, page, PG_locked, TASK_KILLABLE, true);
+ return wait_on_page_bit_common(q, page, PG_locked, TASK_KILLABLE, 1);
}
EXPORT_SYMBOL_GPL(__lock_page_killable);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 622cced74fd9..832ab11badc2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1501,8 +1501,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd)
if (!get_page_unless_zero(page))
goto out_unlock;
spin_unlock(vmf->ptl);
- wait_on_page_locked(page);
- put_page(page);
+ put_and_wait_on_page_locked(page);
goto out;
}
@@ -1538,8 +1537,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd)
if (!get_page_unless_zero(page))
goto out_unlock;
spin_unlock(vmf->ptl);
- wait_on_page_locked(page);
- put_page(page);
+ put_and_wait_on_page_locked(page);
goto out;
}
diff --git a/mm/migrate.c b/mm/migrate.c
index f7e4bfdc13b7..acda06f99754 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -327,16 +327,13 @@ void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
/*
* Once page cache replacement of page migration started, page_count
- * *must* be zero. And, we don't want to call wait_on_page_locked()
- * against a page without get_page().
- * So, we use get_page_unless_zero(), here. Even failed, page fault
- * will occur again.
+ * is zero; but we must not call put_and_wait_on_page_locked() without
+ * a ref. Use get_page_unless_zero(), and just fault again if it fails.
*/
if (!get_page_unless_zero(page))
goto out;
pte_unmap_unlock(ptep, ptl);
- wait_on_page_locked(page);
- put_page(page);
+ put_and_wait_on_page_locked(page);
return;
out:
pte_unmap_unlock(ptep, ptl);
@@ -370,8 +367,7 @@ void pmd_migration_entry_wait(struct mm_struct *mm, pmd_t *pmd)
if (!get_page_unless_zero(page))
goto unlock;
spin_unlock(ptl);
- wait_on_page_locked(page);
- put_page(page);
+ put_and_wait_on_page_locked(page);
return;
unlock:
spin_unlock(ptl);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 62ac0c488624..9c50d90b9bc5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1456,14 +1456,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
count_memcg_page_event(page, PGLAZYFREED);
} else if (!mapping || !__remove_mapping(mapping, page, true))
goto keep_locked;
- /*
- * At this point, we have no other references and there is
- * no way to pick any more up (removed from LRU, removed
- * from pagecache). Can use non-atomic bitops now (and
- * we obviously don't have to worry about waking up a process
- * waiting on the page lock, because there are no references.
- */
- __ClearPageLocked(page);
+
+ unlock_page(page);
free_it:
nr_reclaimed++;
--
2.20.0.rc0.387.gc7a69e6b6c-goog
On Sat, Nov 24, 2018 at 07:21:07PM -0800, Hugh Dickins wrote:
> Waiting on a page migration entry has used wait_on_page_locked() all
> along since 2006: but you cannot safely wait_on_page_locked() without
> holding a reference to the page, and that extra reference is enough to
> make migrate_page_move_mapping() fail with -EAGAIN, when a racing task
> faults on the entry before migrate_page_move_mapping() gets there.
>
> And that failure is retried nine times, amplifying the pain when
> trying to migrate a popular page. With a single persistent faulter,
> migration sometimes succeeds; with two or three concurrent faulters,
> success becomes much less likely (and the more the page was mapped,
> the worse the overhead of unmapping and remapping it on each try).
>
> This is especially a problem for memory offlining, where the outer
> level retries forever (or until terminated from userspace), because
> a heavy refault workload can trigger an endless loop of migration
> failures. wait_on_page_locked() is the wrong tool for the job.
>
> David Herrmann (but was he the first?) noticed this issue in 2014:
> https://marc.info/?l=linux-mm&m=140110465608116&w=2
>
> Tim Chen started a thread in August 2017 which appears relevant:
> https://marc.info/?l=linux-mm&m=150275941014915&w=2
> where Kan Liang went on to implicate __migration_entry_wait():
> https://marc.info/?l=linux-mm&m=150300268411980&w=2
> and the thread ended up with the v4.14 commits:
> 2554db916586 ("sched/wait: Break up long wake list walk")
> 11a19c7b099f ("sched/wait: Introduce wakeup boomark in wake_up_page_bit")
>
> Baoquan He reported "Memory hotplug softlock issue" 14 November 2018:
> https://marc.info/?l=linux-mm&m=154217936431300&w=2
>
> We have all assumed that it is essential to hold a page reference while
> waiting on a page lock: partly to guarantee that there is still a struct
> page when MEMORY_HOTREMOVE is configured, but also to protect against
> reuse of the struct page going to someone who then holds the page locked
> indefinitely, when the waiter can reasonably expect timely unlocking.
>
> But in fact, so long as wait_on_page_bit_common() does the put_page(),
> and is careful not to rely on struct page contents thereafter, there is
> no need to hold a reference to the page while waiting on it. That does
> mean that this case cannot go back through the loop: but that's fine for
> the page migration case, and even if used more widely, is limited by the
> "Stop walking if it's locked" optimization in wake_page_function().
>
> Add interface put_and_wait_on_page_locked() to do this, using negative
> value of the lock arg to wait_on_page_bit_common() to implement it.
> No interruptible or killable variant needed yet, but they might follow:
> I have a vague notion that reporting -EINTR should take precedence over
> return from wait_on_page_bit_common() without knowing the page state,
> so arrange it accordingly - but that may be nothing but pedantic.
>
> __migration_entry_wait() still has to take a brief reference to the
> page, prior to calling put_and_wait_on_page_locked(): but now that it
> is dropped before waiting, the chance of impeding page migration is
> very much reduced. Should we perhaps disable preemption across this?
>
> shrink_page_list()'s __ClearPageLocked(): that was a surprise! This
> survived a lot of testing before that showed up. PageWaiters may have
> been set by wait_on_page_bit_common(), and the reference dropped, just
> before shrink_page_list() succeeds in freezing its last page reference:
> in such a case, unlock_page() must be used. Follow the suggestion from
> Michal Hocko, just revert a978d6f52106 ("mm: unlockless reclaim") now:
> that optimization predates PageWaiters, and won't buy much these days;
> but we can reinstate it for the !PageWaiters case if anyone notices.
>
> It does raise the question: should vmscan.c's is_page_cache_freeable()
> and __remove_mapping() now treat a PageWaiters page as if an extra
> reference were held? Perhaps, but I don't think it matters much, since
> shrink_page_list() already had to win its trylock_page(), so waiters are
> not very common there: I noticed no difference when trying the bigger
> change, and it's surely not needed while put_and_wait_on_page_locked()
> is only used for page migration.
>
> Reported-and-tested-by: Baoquan He <[email protected]>
> Signed-off-by: Hugh Dickins <[email protected]>
> Acked-by: Michal Hocko <[email protected]>
Reviewed-by: Andrea Arcangeli <[email protected]>
On Sat, Nov 24, 2018 at 7:21 PM Hugh Dickins <[email protected]> wrote:
>
> Linus, I'm addressing this patch to you because I see from Tim Chen's
> thread that it would interest you, and you were disappointed not to
> root cause the issue back then. I'm not pushing for you to fast-track
> this into 4.20-rc, but I expect Andrew will pick it up for mmotm, and
> thence linux-next. Or you may spot a terrible defect, but I hope not.
The only terrible defect I spot is that I wish the change to the
'lock' argument in wait_on_page_bit_common() came with a comment
explaining the new semantics.
The old semantics were somewhat obvious (even if not documented): if
'lock' was set, we'd make the wait exclusive, and we'd lock the page
before returning. That kind of matches the intuitive meaning for the
function prototype, and it's pretty obvious in the callers too.
The new semantics don't have the same kind of really intuitive
meaning, I feel. That "-1" doesn't mean "unlock", it means "drop page
reference", so there is no longer a fairly intuitive and direct
mapping between the argument name and type and the behavior of the
function.
So I don't hate the concept of the patch at all, but I do ask to:
- better documentation.
This might not be "documentation" at all, maybe that "lock"
variable should just be renamed (because it's not about just locking
any more), and would be better off as a tristate enum called
"behavior" that has "LOCK, DROP, WAIT" values?
- while it sounds likely that this is indeed the same issue that
plagues us with the insanely long wait-queues, it would be *really*
nice to have that actually confirmed.
Does somebody still have access to the customer load that triggered
the horrible scaling issues before?
In particular, on that second issue: the "fixes" that went in for the
wait-queues didn't really fix any real scalability problem, it really
just fixed the excessive irq latency issues due to the long traversal
holding a lock.
If this really fixes the fundamental issue, that should show up as an
actual performance difference, I'd expect..
End result: I like and approve of the patch, but I'd like it a lot
more if the code behavior was clarified a bit, and I'd really like to
close the loop on that old nasty page wait queue issue...
Linus
On Sun, 25 Nov 2018, Linus Torvalds wrote:
> On Sat, Nov 24, 2018 at 7:21 PM Hugh Dickins <[email protected]> wrote:
> >
> > Linus, I'm addressing this patch to you because I see from Tim Chen's
> > thread that it would interest you, and you were disappointed not to
> > root cause the issue back then. I'm not pushing for you to fast-track
> > this into 4.20-rc, but I expect Andrew will pick it up for mmotm, and
> > thence linux-next. Or you may spot a terrible defect, but I hope not.
>
> The only terrible defect I spot is that I wish the change to the
> 'lock' argument in wait_on_page_bit_common() came with a comment
> explaining the new semantics.o
Thanks a lot for looking through it.
>
> The old semantics were somewhat obvious (even if not documented): if
> 'lock' was set, we'd make the wait exclusive, and we'd lock the page
> before returning. That kind of matches the intuitive meaning for the
> function prototype, and it's pretty obvious in the callers too.
>
> The new semantics don't have the same kind of really intuitive
> meaning, I feel. That "-1" doesn't mean "unlock", it means "drop page
> reference", so there is no longer a fairly intuitive and direct
> mapping between the argument name and type and the behavior of the
> function.
>
> So I don't hate the concept of the patch at all, but I do ask to:
>
> - better documentation.
>
> This might not be "documentation" at all, maybe that "lock"
> variable should just be renamed (because it's not about just locking
> any more), and would be better off as a tristate enum called
> "behavior" that has "LOCK, DROP, WAIT" values?
Agreed, an enum should be best. I'll try it out now, and see what
naming fits - I'm not all that keen on "LOCK", since (like many of the
existing comments) it forgets that PG_locked is only one of the flags
that comes here. Admittedly, the only other is PG_writeback, and
nobody wants exclusive behavior on that one, but...
>
> - while it sounds likely that this is indeed the same issue that
> plagues us with the insanely long wait-queues, it would be *really*
> nice to have that actually confirmed.
I echo your words: it would be *really* nice. We do already know
that this patch is good for many problem loads, but it would be very
satisfying if it could also wrap that discussion from last year.
>
> Does somebody still have access to the customer load that triggered
> the horrible scaling issues before?
Kan? Tim?
>
> In particular, on that second issue: the "fixes" that went in for the
> wait-queues didn't really fix any real scalability problem, it really
> just fixed the excessive irq latency issues due to the long traversal
> holding a lock.
>
> If this really fixes the fundamental issue, that should show up as an
> actual performance difference, I'd expect..
I guess so, though it might be more convincing to add a hack to suppress
the bookmarking (e.g. #define WAITQUEUE_WALK_BREAK_CNT (INT_MAX - 1))
when trying out the put_and_wait patch - if they can persuade the
customer to go back in time on this, which is asking a lot.
Not that I have any ambitions to do away with the bookmarking myself;
though I do have several reservations about the way it works out (that
I'd rather go into some other time).
>
> End result: I like and approve of the patch, but I'd like it a lot
> more if the code behavior was clarified a bit, and I'd really like to
> close the loop on that old nasty page wait queue issue...
Thanks!
Hugh
Waiting on a page migration entry has used wait_on_page_locked() all
along since 2006: but you cannot safely wait_on_page_locked() without
holding a reference to the page, and that extra reference is enough to
make migrate_page_move_mapping() fail with -EAGAIN, when a racing task
faults on the entry before migrate_page_move_mapping() gets there.
And that failure is retried nine times, amplifying the pain when
trying to migrate a popular page. With a single persistent faulter,
migration sometimes succeeds; with two or three concurrent faulters,
success becomes much less likely (and the more the page was mapped,
the worse the overhead of unmapping and remapping it on each try).
This is especially a problem for memory offlining, where the outer
level retries forever (or until terminated from userspace), because
a heavy refault workload can trigger an endless loop of migration
failures. wait_on_page_locked() is the wrong tool for the job.
David Herrmann (but was he the first?) noticed this issue in 2014:
https://marc.info/?l=linux-mm&m=140110465608116&w=2
Tim Chen started a thread in August 2017 which appears relevant:
https://marc.info/?l=linux-mm&m=150275941014915&w=2
where Kan Liang went on to implicate __migration_entry_wait():
https://marc.info/?l=linux-mm&m=150300268411980&w=2
and the thread ended up with the v4.14 commits:
2554db916586 ("sched/wait: Break up long wake list walk")
11a19c7b099f ("sched/wait: Introduce wakeup boomark in wake_up_page_bit")
Baoquan He reported "Memory hotplug softlock issue" 14 November 2018:
https://marc.info/?l=linux-mm&m=154217936431300&w=2
We have all assumed that it is essential to hold a page reference while
waiting on a page lock: partly to guarantee that there is still a struct
page when MEMORY_HOTREMOVE is configured, but also to protect against
reuse of the struct page going to someone who then holds the page locked
indefinitely, when the waiter can reasonably expect timely unlocking.
But in fact, so long as wait_on_page_bit_common() does the put_page(),
and is careful not to rely on struct page contents thereafter, there is
no need to hold a reference to the page while waiting on it. That does
mean that this case cannot go back through the loop: but that's fine for
the page migration case, and even if used more widely, is limited by the
"Stop walking if it's locked" optimization in wake_page_function().
Add interface put_and_wait_on_page_locked() to do this, using "behavior"
enum in place of "lock" arg to wait_on_page_bit_common() to implement it.
No interruptible or killable variant needed yet, but they might follow:
I have a vague notion that reporting -EINTR should take precedence over
return from wait_on_page_bit_common() without knowing the page state,
so arrange it accordingly - but that may be nothing but pedantic.
__migration_entry_wait() still has to take a brief reference to the
page, prior to calling put_and_wait_on_page_locked(): but now that it
is dropped before waiting, the chance of impeding page migration is
very much reduced. Should we perhaps disable preemption across this?
shrink_page_list()'s __ClearPageLocked(): that was a surprise! This
survived a lot of testing before that showed up. PageWaiters may have
been set by wait_on_page_bit_common(), and the reference dropped, just
before shrink_page_list() succeeds in freezing its last page reference:
in such a case, unlock_page() must be used. Follow the suggestion from
Michal Hocko, just revert a978d6f52106 ("mm: unlockless reclaim") now:
that optimization predates PageWaiters, and won't buy much these days;
but we can reinstate it for the !PageWaiters case if anyone notices.
It does raise the question: should vmscan.c's is_page_cache_freeable()
and __remove_mapping() now treat a PageWaiters page as if an extra
reference were held? Perhaps, but I don't think it matters much, since
shrink_page_list() already had to win its trylock_page(), so waiters are
not very common there: I noticed no difference when trying the bigger
change, and it's surely not needed while put_and_wait_on_page_locked()
is only used for page migration.
Reported-and-tested-by: Baoquan He <[email protected]>
Signed-off-by: Hugh Dickins <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Reviewed-by: Andrea Arcangeli <[email protected]>
---
include/linux/pagemap.h | 2 ++
mm/filemap.c | 77 ++++++++++++++++++++++++++++++++++-------
mm/huge_memory.c | 6 ++--
mm/migrate.c | 12 +++----
mm/vmscan.c | 10 ++----
5 files changed, 74 insertions(+), 33 deletions(-)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 226f96f0dee0..e2d7039af6a3 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -537,6 +537,8 @@ static inline int wait_on_page_locked_killable(struct page *page)
return wait_on_page_bit_killable(compound_head(page), PG_locked);
}
+extern void put_and_wait_on_page_locked(struct page *page);
+
/*
* Wait for a page to complete writeback
*/
diff --git a/mm/filemap.c b/mm/filemap.c
index 81adec8ee02c..575e16c037ca 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -981,7 +981,14 @@ static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync,
if (wait_page->bit_nr != key->bit_nr)
return 0;
- /* Stop walking if it's locked */
+ /*
+ * Stop walking if it's locked.
+ * Is this safe if put_and_wait_on_page_locked() is in use?
+ * Yes: the waker must hold a reference to this page, and if PG_locked
+ * has now already been set by another task, that task must also hold
+ * a reference to the *same usage* of this page; so there is no need
+ * to walk on to wake even the put_and_wait_on_page_locked() callers.
+ */
if (test_bit(key->bit_nr, &key->page->flags))
return -1;
@@ -1049,25 +1056,44 @@ static void wake_up_page(struct page *page, int bit)
wake_up_page_bit(page, bit);
}
+/*
+ * A choice of three behaviors for wait_on_page_bit_common():
+ */
+enum behavior {
+ EXCLUSIVE, /* Hold ref to page and take the bit when woken, like
+ * __lock_page() waiting on then setting PG_locked.
+ */
+ SHARED, /* Hold ref to page and check the bit when woken, like
+ * wait_on_page_writeback() waiting on PG_writeback.
+ */
+ DROP, /* Drop ref to page before wait, no check when woken,
+ * like put_and_wait_on_page_locked() on PG_locked.
+ */
+};
+
static inline int wait_on_page_bit_common(wait_queue_head_t *q,
- struct page *page, int bit_nr, int state, bool lock)
+ struct page *page, int bit_nr, int state, enum behavior behavior)
{
struct wait_page_queue wait_page;
wait_queue_entry_t *wait = &wait_page.wait;
+ bool bit_is_set;
bool thrashing = false;
+ bool delayacct = false;
unsigned long pflags;
int ret = 0;
if (bit_nr == PG_locked &&
!PageUptodate(page) && PageWorkingset(page)) {
- if (!PageSwapBacked(page))
+ if (!PageSwapBacked(page)) {
delayacct_thrashing_start();
+ delayacct = true;
+ }
psi_memstall_enter(&pflags);
thrashing = true;
}
init_wait(wait);
- wait->flags = lock ? WQ_FLAG_EXCLUSIVE : 0;
+ wait->flags = behavior == EXCLUSIVE ? WQ_FLAG_EXCLUSIVE : 0;
wait->func = wake_page_function;
wait_page.page = page;
wait_page.bit_nr = bit_nr;
@@ -1084,14 +1110,17 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
spin_unlock_irq(&q->lock);
- if (likely(test_bit(bit_nr, &page->flags))) {
+ bit_is_set = test_bit(bit_nr, &page->flags);
+ if (behavior == DROP)
+ put_page(page);
+
+ if (likely(bit_is_set))
io_schedule();
- }
- if (lock) {
+ if (behavior == EXCLUSIVE) {
if (!test_and_set_bit_lock(bit_nr, &page->flags))
break;
- } else {
+ } else if (behavior == SHARED) {
if (!test_bit(bit_nr, &page->flags))
break;
}
@@ -1100,12 +1129,23 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
ret = -EINTR;
break;
}
+
+ if (behavior == DROP) {
+ /*
+ * We can no longer safely access page->flags:
+ * even if CONFIG_MEMORY_HOTREMOVE is not enabled,
+ * there is a risk of waiting forever on a page reused
+ * for something that keeps it locked indefinitely.
+ * But best check for -EINTR above before breaking.
+ */
+ break;
+ }
}
finish_wait(q, wait);
if (thrashing) {
- if (!PageSwapBacked(page))
+ if (delayacct)
delayacct_thrashing_end();
psi_memstall_leave(&pflags);
}
@@ -1124,17 +1164,26 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
void wait_on_page_bit(struct page *page, int bit_nr)
{
wait_queue_head_t *q = page_waitqueue(page);
- wait_on_page_bit_common(q, page, bit_nr, TASK_UNINTERRUPTIBLE, false);
+ wait_on_page_bit_common(q, page, bit_nr, TASK_UNINTERRUPTIBLE, SHARED);
}
EXPORT_SYMBOL(wait_on_page_bit);
int wait_on_page_bit_killable(struct page *page, int bit_nr)
{
wait_queue_head_t *q = page_waitqueue(page);
- return wait_on_page_bit_common(q, page, bit_nr, TASK_KILLABLE, false);
+ return wait_on_page_bit_common(q, page, bit_nr, TASK_KILLABLE, SHARED);
}
EXPORT_SYMBOL(wait_on_page_bit_killable);
+void put_and_wait_on_page_locked(struct page *page)
+{
+ wait_queue_head_t *q;
+
+ page = compound_head(page);
+ q = page_waitqueue(page);
+ wait_on_page_bit_common(q, page, PG_locked, TASK_UNINTERRUPTIBLE, DROP);
+}
+
/**
* add_page_wait_queue - Add an arbitrary waiter to a page's wait queue
* @page: Page defining the wait queue of interest
@@ -1264,7 +1313,8 @@ void __lock_page(struct page *__page)
{
struct page *page = compound_head(__page);
wait_queue_head_t *q = page_waitqueue(page);
- wait_on_page_bit_common(q, page, PG_locked, TASK_UNINTERRUPTIBLE, true);
+ wait_on_page_bit_common(q, page, PG_locked, TASK_UNINTERRUPTIBLE,
+ EXCLUSIVE);
}
EXPORT_SYMBOL(__lock_page);
@@ -1272,7 +1322,8 @@ int __lock_page_killable(struct page *__page)
{
struct page *page = compound_head(__page);
wait_queue_head_t *q = page_waitqueue(page);
- return wait_on_page_bit_common(q, page, PG_locked, TASK_KILLABLE, true);
+ return wait_on_page_bit_common(q, page, PG_locked, TASK_KILLABLE,
+ EXCLUSIVE);
}
EXPORT_SYMBOL_GPL(__lock_page_killable);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 622cced74fd9..832ab11badc2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1501,8 +1501,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd)
if (!get_page_unless_zero(page))
goto out_unlock;
spin_unlock(vmf->ptl);
- wait_on_page_locked(page);
- put_page(page);
+ put_and_wait_on_page_locked(page);
goto out;
}
@@ -1538,8 +1537,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd)
if (!get_page_unless_zero(page))
goto out_unlock;
spin_unlock(vmf->ptl);
- wait_on_page_locked(page);
- put_page(page);
+ put_and_wait_on_page_locked(page);
goto out;
}
diff --git a/mm/migrate.c b/mm/migrate.c
index f7e4bfdc13b7..acda06f99754 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -327,16 +327,13 @@ void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
/*
* Once page cache replacement of page migration started, page_count
- * *must* be zero. And, we don't want to call wait_on_page_locked()
- * against a page without get_page().
- * So, we use get_page_unless_zero(), here. Even failed, page fault
- * will occur again.
+ * is zero; but we must not call put_and_wait_on_page_locked() without
+ * a ref. Use get_page_unless_zero(), and just fault again if it fails.
*/
if (!get_page_unless_zero(page))
goto out;
pte_unmap_unlock(ptep, ptl);
- wait_on_page_locked(page);
- put_page(page);
+ put_and_wait_on_page_locked(page);
return;
out:
pte_unmap_unlock(ptep, ptl);
@@ -370,8 +367,7 @@ void pmd_migration_entry_wait(struct mm_struct *mm, pmd_t *pmd)
if (!get_page_unless_zero(page))
goto unlock;
spin_unlock(ptl);
- wait_on_page_locked(page);
- put_page(page);
+ put_and_wait_on_page_locked(page);
return;
unlock:
spin_unlock(ptl);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 62ac0c488624..9c50d90b9bc5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1456,14 +1456,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
count_memcg_page_event(page, PGLAZYFREED);
} else if (!mapping || !__remove_mapping(mapping, page, true))
goto keep_locked;
- /*
- * At this point, we have no other references and there is
- * no way to pick any more up (removed from LRU, removed
- * from pagecache). Can use non-atomic bitops now (and
- * we obviously don't have to worry about waking up a process
- * waiting on the page lock, because there are no references.
- */
- __ClearPageLocked(page);
+
+ unlock_page(page);
free_it:
nr_reclaimed++;
--
2.20.0.rc0.387.gc7a69e6b6c-goog
On 11/25/2018 07:29 PM, Hugh Dickins wrote:
>> Does somebody still have access to the customer load that triggered
>> the horrible scaling issues before?
>
> Kan? Tim?
>
We don't have access to the workload know. Will ask the customer to
see if they can test it.
Thanks.
Tim
On Mon, Nov 26, 2018 at 11:27 AM Hugh Dickins <[email protected]> wrote:
>
> +enum behavior {
> + EXCLUSIVE, /* Hold ref to page and take the bit when woken, like
> + * __lock_page() waiting on then setting PG_locked.
> + */
> + SHARED, /* Hold ref to page and check the bit when woken, like
> + * wait_on_page_writeback() waiting on PG_writeback.
> + */
> + DROP, /* Drop ref to page before wait, no check when woken,
> + * like put_and_wait_on_page_locked() on PG_locked.
> + */
> +};
Ack, thanks.
Linus
On Mon 26-11-18 11:27:07, Hugh Dickins wrote:
[...]
> @@ -1049,25 +1056,44 @@ static void wake_up_page(struct page *page, int bit)
> wake_up_page_bit(page, bit);
> }
>
> +/*
> + * A choice of three behaviors for wait_on_page_bit_common():
> + */
> +enum behavior {
> + EXCLUSIVE, /* Hold ref to page and take the bit when woken, like
> + * __lock_page() waiting on then setting PG_locked.
> + */
> + SHARED, /* Hold ref to page and check the bit when woken, like
> + * wait_on_page_writeback() waiting on PG_writeback.
> + */
> + DROP, /* Drop ref to page before wait, no check when woken,
> + * like put_and_wait_on_page_locked() on PG_locked.
> + */
> +};
I like this. It makes to semantic much more clear.
Thanks!
--
Michal Hocko
SUSE Labs
On Mon, Nov 26, 2018 at 11:27:07AM -0800, Hugh Dickins wrote:
> Waiting on a page migration entry has used wait_on_page_locked() all
> along since 2006: but you cannot safely wait_on_page_locked() without
> holding a reference to the page, and that extra reference is enough to
> make migrate_page_move_mapping() fail with -EAGAIN, when a racing task
> faults on the entry before migrate_page_move_mapping() gets there.
>
> And that failure is retried nine times, amplifying the pain when
> trying to migrate a popular page. With a single persistent faulter,
> migration sometimes succeeds; with two or three concurrent faulters,
> success becomes much less likely (and the more the page was mapped,
> the worse the overhead of unmapping and remapping it on each try).
>
> This is especially a problem for memory offlining, where the outer
> level retries forever (or until terminated from userspace), because
> a heavy refault workload can trigger an endless loop of migration
> failures. wait_on_page_locked() is the wrong tool for the job.
>
> David Herrmann (but was he the first?) noticed this issue in 2014:
> https://marc.info/?l=linux-mm&m=140110465608116&w=2
>
> Tim Chen started a thread in August 2017 which appears relevant:
> https://marc.info/?l=linux-mm&m=150275941014915&w=2
> where Kan Liang went on to implicate __migration_entry_wait():
> https://marc.info/?l=linux-mm&m=150300268411980&w=2
> and the thread ended up with the v4.14 commits:
> 2554db916586 ("sched/wait: Break up long wake list walk")
> 11a19c7b099f ("sched/wait: Introduce wakeup boomark in wake_up_page_bit")
>
> Baoquan He reported "Memory hotplug softlock issue" 14 November 2018:
> https://marc.info/?l=linux-mm&m=154217936431300&w=2
>
> We have all assumed that it is essential to hold a page reference while
> waiting on a page lock: partly to guarantee that there is still a struct
> page when MEMORY_HOTREMOVE is configured, but also to protect against
> reuse of the struct page going to someone who then holds the page locked
> indefinitely, when the waiter can reasonably expect timely unlocking.
>
> But in fact, so long as wait_on_page_bit_common() does the put_page(),
> and is careful not to rely on struct page contents thereafter, there is
> no need to hold a reference to the page while waiting on it. That does
> mean that this case cannot go back through the loop: but that's fine for
> the page migration case, and even if used more widely, is limited by the
> "Stop walking if it's locked" optimization in wake_page_function().
>
> Add interface put_and_wait_on_page_locked() to do this, using "behavior"
> enum in place of "lock" arg to wait_on_page_bit_common() to implement it.
> No interruptible or killable variant needed yet, but they might follow:
> I have a vague notion that reporting -EINTR should take precedence over
> return from wait_on_page_bit_common() without knowing the page state,
> so arrange it accordingly - but that may be nothing but pedantic.
>
> __migration_entry_wait() still has to take a brief reference to the
> page, prior to calling put_and_wait_on_page_locked(): but now that it
> is dropped before waiting, the chance of impeding page migration is
> very much reduced. Should we perhaps disable preemption across this?
>
> shrink_page_list()'s __ClearPageLocked(): that was a surprise! This
> survived a lot of testing before that showed up. PageWaiters may have
> been set by wait_on_page_bit_common(), and the reference dropped, just
> before shrink_page_list() succeeds in freezing its last page reference:
> in such a case, unlock_page() must be used. Follow the suggestion from
> Michal Hocko, just revert a978d6f52106 ("mm: unlockless reclaim") now:
> that optimization predates PageWaiters, and won't buy much these days;
> but we can reinstate it for the !PageWaiters case if anyone notices.
>
> It does raise the question: should vmscan.c's is_page_cache_freeable()
> and __remove_mapping() now treat a PageWaiters page as if an extra
> reference were held? Perhaps, but I don't think it matters much, since
> shrink_page_list() already had to win its trylock_page(), so waiters are
> not very common there: I noticed no difference when trying the bigger
> change, and it's surely not needed while put_and_wait_on_page_locked()
> is only used for page migration.
>
> Reported-and-tested-by: Baoquan He <[email protected]>
> Signed-off-by: Hugh Dickins <[email protected]>
> Acked-by: Michal Hocko <[email protected]>
> Reviewed-by: Andrea Arcangeli <[email protected]>
> ---
> include/linux/pagemap.h | 2 ++
> mm/filemap.c | 77 ++++++++++++++++++++++++++++++++++-------
> mm/huge_memory.c | 6 ++--
> mm/migrate.c | 12 +++----
> mm/vmscan.c | 10 ++----
> 5 files changed, 74 insertions(+), 33 deletions(-)
>
/**
* put_and_wait_on_page_locked - Drop a reference and wait for it to be unlocked
* @page: The page to wait for.
*
* The caller should hold a reference on @page. They expect the page to
* become unlocked relatively soon, but do not wish to hold up migration
* (for example) by holding the reference while waiting for the page to
* come unlocked. After this function returns, the caller should not
* dereference @page.
*/
(improvements gratefully received)
> +void put_and_wait_on_page_locked(struct page *page)
> +{
> + wait_queue_head_t *q;
> +
> + page = compound_head(page);
> + q = page_waitqueue(page);
> + wait_on_page_bit_common(q, page, PG_locked, TASK_UNINTERRUPTIBLE, DROP);
> +}
> +
> /**
> * add_page_wait_queue - Add an arbitrary waiter to a page's wait queue
> * @page: Page defining the wait queue of interest
On 11/26/18 8:27 PM, Hugh Dickins wrote:
> Waiting on a page migration entry has used wait_on_page_locked() all
> along since 2006: but you cannot safely wait_on_page_locked() without
> holding a reference to the page, and that extra reference is enough to
> make migrate_page_move_mapping() fail with -EAGAIN, when a racing task
> faults on the entry before migrate_page_move_mapping() gets there.
>
> And that failure is retried nine times, amplifying the pain when
> trying to migrate a popular page. With a single persistent faulter,
> migration sometimes succeeds; with two or three concurrent faulters,
> success becomes much less likely (and the more the page was mapped,
> the worse the overhead of unmapping and remapping it on each try).
>
> This is especially a problem for memory offlining, where the outer
> level retries forever (or until terminated from userspace), because
> a heavy refault workload can trigger an endless loop of migration
> failures. wait_on_page_locked() is the wrong tool for the job.
>
> David Herrmann (but was he the first?) noticed this issue in 2014:
> https://marc.info/?l=linux-mm&m=140110465608116&w=2
>
> Tim Chen started a thread in August 2017 which appears relevant:
> https://marc.info/?l=linux-mm&m=150275941014915&w=2
> where Kan Liang went on to implicate __migration_entry_wait():
> https://marc.info/?l=linux-mm&m=150300268411980&w=2
> and the thread ended up with the v4.14 commits:
> 2554db916586 ("sched/wait: Break up long wake list walk")
> 11a19c7b099f ("sched/wait: Introduce wakeup boomark in wake_up_page_bit")
>
> Baoquan He reported "Memory hotplug softlock issue" 14 November 2018:
> https://marc.info/?l=linux-mm&m=154217936431300&w=2
>
> We have all assumed that it is essential to hold a page reference while
> waiting on a page lock: partly to guarantee that there is still a struct
> page when MEMORY_HOTREMOVE is configured, but also to protect against
> reuse of the struct page going to someone who then holds the page locked
> indefinitely, when the waiter can reasonably expect timely unlocking.
>
> But in fact, so long as wait_on_page_bit_common() does the put_page(),
> and is careful not to rely on struct page contents thereafter, there is
> no need to hold a reference to the page while waiting on it. That does
> mean that this case cannot go back through the loop: but that's fine for
> the page migration case, and even if used more widely, is limited by the
> "Stop walking if it's locked" optimization in wake_page_function().
>
> Add interface put_and_wait_on_page_locked() to do this, using "behavior"
> enum in place of "lock" arg to wait_on_page_bit_common() to implement it.
> No interruptible or killable variant needed yet, but they might follow:
> I have a vague notion that reporting -EINTR should take precedence over
> return from wait_on_page_bit_common() without knowing the page state,
> so arrange it accordingly - but that may be nothing but pedantic.
>
> __migration_entry_wait() still has to take a brief reference to the
> page, prior to calling put_and_wait_on_page_locked(): but now that it
> is dropped before waiting, the chance of impeding page migration is
> very much reduced. Should we perhaps disable preemption across this?
>
> shrink_page_list()'s __ClearPageLocked(): that was a surprise! This
> survived a lot of testing before that showed up. PageWaiters may have
> been set by wait_on_page_bit_common(), and the reference dropped, just
> before shrink_page_list() succeeds in freezing its last page reference:
> in such a case, unlock_page() must be used. Follow the suggestion from
> Michal Hocko, just revert a978d6f52106 ("mm: unlockless reclaim") now:
> that optimization predates PageWaiters, and won't buy much these days;
> but we can reinstate it for the !PageWaiters case if anyone notices.
>
> It does raise the question: should vmscan.c's is_page_cache_freeable()
> and __remove_mapping() now treat a PageWaiters page as if an extra
> reference were held? Perhaps, but I don't think it matters much, since
> shrink_page_list() already had to win its trylock_page(), so waiters are
> not very common there: I noticed no difference when trying the bigger
> change, and it's surely not needed while put_and_wait_on_page_locked()
> is only used for page migration.
>
> Reported-and-tested-by: Baoquan He <[email protected]>
> Signed-off-by: Hugh Dickins <[email protected]>
> Acked-by: Michal Hocko <[email protected]>
> Reviewed-by: Andrea Arcangeli <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Thanks!
On Mon, Nov 26, 2018 at 11:27:07AM -0800, Hugh Dickins wrote:
> Waiting on a page migration entry has used wait_on_page_locked() all
> along since 2006: but you cannot safely wait_on_page_locked() without
> holding a reference to the page, and that extra reference is enough to
> make migrate_page_move_mapping() fail with -EAGAIN, when a racing task
> faults on the entry before migrate_page_move_mapping() gets there.
>
> And that failure is retried nine times, amplifying the pain when
> trying to migrate a popular page. With a single persistent faulter,
> migration sometimes succeeds; with two or three concurrent faulters,
> success becomes much less likely (and the more the page was mapped,
> the worse the overhead of unmapping and remapping it on each try).
>
> This is especially a problem for memory offlining, where the outer
> level retries forever (or until terminated from userspace), because
> a heavy refault workload can trigger an endless loop of migration
> failures. wait_on_page_locked() is the wrong tool for the job.
>
> David Herrmann (but was he the first?) noticed this issue in 2014:
> https://marc.info/?l=linux-mm&m=140110465608116&w=2
>
> Tim Chen started a thread in August 2017 which appears relevant:
> https://marc.info/?l=linux-mm&m=150275941014915&w=2
> where Kan Liang went on to implicate __migration_entry_wait():
> https://marc.info/?l=linux-mm&m=150300268411980&w=2
> and the thread ended up with the v4.14 commits:
> 2554db916586 ("sched/wait: Break up long wake list walk")
> 11a19c7b099f ("sched/wait: Introduce wakeup boomark in wake_up_page_bit")
>
> Baoquan He reported "Memory hotplug softlock issue" 14 November 2018:
> https://marc.info/?l=linux-mm&m=154217936431300&w=2
>
> We have all assumed that it is essential to hold a page reference while
> waiting on a page lock: partly to guarantee that there is still a struct
> page when MEMORY_HOTREMOVE is configured, but also to protect against
> reuse of the struct page going to someone who then holds the page locked
> indefinitely, when the waiter can reasonably expect timely unlocking.
>
> But in fact, so long as wait_on_page_bit_common() does the put_page(),
> and is careful not to rely on struct page contents thereafter, there is
> no need to hold a reference to the page while waiting on it. That does
> mean that this case cannot go back through the loop: but that's fine for
> the page migration case, and even if used more widely, is limited by the
> "Stop walking if it's locked" optimization in wake_page_function().
>
> Add interface put_and_wait_on_page_locked() to do this, using "behavior"
> enum in place of "lock" arg to wait_on_page_bit_common() to implement it.
> No interruptible or killable variant needed yet, but they might follow:
> I have a vague notion that reporting -EINTR should take precedence over
> return from wait_on_page_bit_common() without knowing the page state,
> so arrange it accordingly - but that may be nothing but pedantic.
>
> __migration_entry_wait() still has to take a brief reference to the
> page, prior to calling put_and_wait_on_page_locked(): but now that it
> is dropped before waiting, the chance of impeding page migration is
> very much reduced. Should we perhaps disable preemption across this?
>
> shrink_page_list()'s __ClearPageLocked(): that was a surprise! This
> survived a lot of testing before that showed up. PageWaiters may have
> been set by wait_on_page_bit_common(), and the reference dropped, just
> before shrink_page_list() succeeds in freezing its last page reference:
> in such a case, unlock_page() must be used. Follow the suggestion from
> Michal Hocko, just revert a978d6f52106 ("mm: unlockless reclaim") now:
> that optimization predates PageWaiters, and won't buy much these days;
> but we can reinstate it for the !PageWaiters case if anyone notices.
>
> It does raise the question: should vmscan.c's is_page_cache_freeable()
> and __remove_mapping() now treat a PageWaiters page as if an extra
> reference were held? Perhaps, but I don't think it matters much, since
> shrink_page_list() already had to win its trylock_page(), so waiters are
> not very common there: I noticed no difference when trying the bigger
> change, and it's surely not needed while put_and_wait_on_page_locked()
> is only used for page migration.
>
> Reported-and-tested-by: Baoquan He <[email protected]>
> Signed-off-by: Hugh Dickins <[email protected]>
> Acked-by: Michal Hocko <[email protected]>
> Reviewed-by: Andrea Arcangeli <[email protected]>
> ---
> include/linux/pagemap.h | 2 ++
> mm/filemap.c | 77 ++++++++++++++++++++++++++++++++++-------
> mm/huge_memory.c | 6 ++--
> mm/migrate.c | 12 +++----
> mm/vmscan.c | 10 ++----
> 5 files changed, 74 insertions(+), 33 deletions(-)
>
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 226f96f0dee0..e2d7039af6a3 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -537,6 +537,8 @@ static inline int wait_on_page_locked_killable(struct page *page)
> return wait_on_page_bit_killable(compound_head(page), PG_locked);
> }
>
> +extern void put_and_wait_on_page_locked(struct page *page);
> +
> /*
> * Wait for a page to complete writeback
> */
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 81adec8ee02c..575e16c037ca 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -981,7 +981,14 @@ static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync,
> if (wait_page->bit_nr != key->bit_nr)
> return 0;
>
> - /* Stop walking if it's locked */
> + /*
> + * Stop walking if it's locked.
> + * Is this safe if put_and_wait_on_page_locked() is in use?
> + * Yes: the waker must hold a reference to this page, and if PG_locked
> + * has now already been set by another task, that task must also hold
> + * a reference to the *same usage* of this page; so there is no need
> + * to walk on to wake even the put_and_wait_on_page_locked() callers.
> + */
> if (test_bit(key->bit_nr, &key->page->flags))
> return -1;
>
> @@ -1049,25 +1056,44 @@ static void wake_up_page(struct page *page, int bit)
> wake_up_page_bit(page, bit);
> }
>
> +/*
> + * A choice of three behaviors for wait_on_page_bit_common():
> + */
> +enum behavior {
> + EXCLUSIVE, /* Hold ref to page and take the bit when woken, like
> + * __lock_page() waiting on then setting PG_locked.
> + */
> + SHARED, /* Hold ref to page and check the bit when woken, like
> + * wait_on_page_writeback() waiting on PG_writeback.
> + */
> + DROP, /* Drop ref to page before wait, no check when woken,
> + * like put_and_wait_on_page_locked() on PG_locked.
> + */
> +};
Can we please make it:
/**
* enum behavior - a choice of three behaviors for wait_on_page_bit_common()
*/
enum behavior {
/**
* @EXCLUSIVE: Hold ref to page and take the bit when woken,
* like __lock_page() waiting on then setting %PG_locked.
*/
EXCLUSIVE,
/**
* @SHARED: Hold ref to page and check the bit when woken,
* like wait_on_page_writeback() waiting on %PG_writeback.
*/
SHARED,
/**
* @DROP: Drop ref to page before wait, no check when woken,
* like put_and_wait_on_page_locked() on %PG_locked.
*/
DROP,
};
--
Sincerely yours,
Mike.
On Mon, Nov 26, 2018 at 12:53:51PM -0800, Matthew Wilcox wrote:
> On Mon, Nov 26, 2018 at 11:27:07AM -0800, Hugh Dickins wrote:
> > Waiting on a page migration entry has used wait_on_page_locked() all
> > along since 2006: but you cannot safely wait_on_page_locked() without
> > holding a reference to the page, and that extra reference is enough to
> > make migrate_page_move_mapping() fail with -EAGAIN, when a racing task
> > faults on the entry before migrate_page_move_mapping() gets there.
> >
> > And that failure is retried nine times, amplifying the pain when
> > trying to migrate a popular page. With a single persistent faulter,
> > migration sometimes succeeds; with two or three concurrent faulters,
> > success becomes much less likely (and the more the page was mapped,
> > the worse the overhead of unmapping and remapping it on each try).
> >
> > This is especially a problem for memory offlining, where the outer
> > level retries forever (or until terminated from userspace), because
> > a heavy refault workload can trigger an endless loop of migration
> > failures. wait_on_page_locked() is the wrong tool for the job.
> >
> > David Herrmann (but was he the first?) noticed this issue in 2014:
> > https://marc.info/?l=linux-mm&m=140110465608116&w=2
> >
> > Tim Chen started a thread in August 2017 which appears relevant:
> > https://marc.info/?l=linux-mm&m=150275941014915&w=2
> > where Kan Liang went on to implicate __migration_entry_wait():
> > https://marc.info/?l=linux-mm&m=150300268411980&w=2
> > and the thread ended up with the v4.14 commits:
> > 2554db916586 ("sched/wait: Break up long wake list walk")
> > 11a19c7b099f ("sched/wait: Introduce wakeup boomark in wake_up_page_bit")
> >
> > Baoquan He reported "Memory hotplug softlock issue" 14 November 2018:
> > https://marc.info/?l=linux-mm&m=154217936431300&w=2
> >
> > We have all assumed that it is essential to hold a page reference while
> > waiting on a page lock: partly to guarantee that there is still a struct
> > page when MEMORY_HOTREMOVE is configured, but also to protect against
> > reuse of the struct page going to someone who then holds the page locked
> > indefinitely, when the waiter can reasonably expect timely unlocking.
> >
> > But in fact, so long as wait_on_page_bit_common() does the put_page(),
> > and is careful not to rely on struct page contents thereafter, there is
> > no need to hold a reference to the page while waiting on it. That does
> > mean that this case cannot go back through the loop: but that's fine for
> > the page migration case, and even if used more widely, is limited by the
> > "Stop walking if it's locked" optimization in wake_page_function().
> >
> > Add interface put_and_wait_on_page_locked() to do this, using "behavior"
> > enum in place of "lock" arg to wait_on_page_bit_common() to implement it.
> > No interruptible or killable variant needed yet, but they might follow:
> > I have a vague notion that reporting -EINTR should take precedence over
> > return from wait_on_page_bit_common() without knowing the page state,
> > so arrange it accordingly - but that may be nothing but pedantic.
> >
> > __migration_entry_wait() still has to take a brief reference to the
> > page, prior to calling put_and_wait_on_page_locked(): but now that it
> > is dropped before waiting, the chance of impeding page migration is
> > very much reduced. Should we perhaps disable preemption across this?
> >
> > shrink_page_list()'s __ClearPageLocked(): that was a surprise! This
> > survived a lot of testing before that showed up. PageWaiters may have
> > been set by wait_on_page_bit_common(), and the reference dropped, just
> > before shrink_page_list() succeeds in freezing its last page reference:
> > in such a case, unlock_page() must be used. Follow the suggestion from
> > Michal Hocko, just revert a978d6f52106 ("mm: unlockless reclaim") now:
> > that optimization predates PageWaiters, and won't buy much these days;
> > but we can reinstate it for the !PageWaiters case if anyone notices.
> >
> > It does raise the question: should vmscan.c's is_page_cache_freeable()
> > and __remove_mapping() now treat a PageWaiters page as if an extra
> > reference were held? Perhaps, but I don't think it matters much, since
> > shrink_page_list() already had to win its trylock_page(), so waiters are
> > not very common there: I noticed no difference when trying the bigger
> > change, and it's surely not needed while put_and_wait_on_page_locked()
> > is only used for page migration.
> >
> > Reported-and-tested-by: Baoquan He <[email protected]>
> > Signed-off-by: Hugh Dickins <[email protected]>
> > Acked-by: Michal Hocko <[email protected]>
> > Reviewed-by: Andrea Arcangeli <[email protected]>
> > ---
> > include/linux/pagemap.h | 2 ++
> > mm/filemap.c | 77 ++++++++++++++++++++++++++++++++++-------
> > mm/huge_memory.c | 6 ++--
> > mm/migrate.c | 12 +++----
> > mm/vmscan.c | 10 ++----
> > 5 files changed, 74 insertions(+), 33 deletions(-)
> >
>
> /**
> * put_and_wait_on_page_locked - Drop a reference and wait for it to be unlocked
wait for page ?
> * @page: The page to wait for.
> *
> * The caller should hold a reference on @page. They expect the page to
> * become unlocked relatively soon, but do not wish to hold up migration
> * (for example) by holding the reference while waiting for the page to
> * come unlocked. After this function returns, the caller should not
> * dereference @page.
> */
How about:
They expect the page to become unlocked relatively soon, but they can wait
for the page to come unlocked without holding the reference, to allow
other users of the @page (for example migration) to continue.
> (improvements gratefully received)
>
> > +void put_and_wait_on_page_locked(struct page *page)
> > +{
> > + wait_queue_head_t *q;
> > +
> > + page = compound_head(page);
> > + q = page_waitqueue(page);
> > + wait_on_page_bit_common(q, page, PG_locked, TASK_UNINTERRUPTIBLE, DROP);
> > +}
> > +
> > /**
> > * add_page_wait_queue - Add an arbitrary waiter to a page's wait queue
> > * @page: Page defining the wait queue of interest
>
--
Sincerely yours,
Mike.
On Tue 27-11-18 16:49:47, Cristopher Lameter wrote:
> On Tue, 27 Nov 2018, Mike Rapoport wrote:
>
> > > * @page: The page to wait for.
> > > *
> > > * The caller should hold a reference on @page. They expect the page to
> > > * become unlocked relatively soon, but do not wish to hold up migration
> > > * (for example) by holding the reference while waiting for the page to
> > > * come unlocked. After this function returns, the caller should not
> > > * dereference @page.
> > > */
> >
> > How about:
> >
> > They expect the page to become unlocked relatively soon, but they can wait
> > for the page to come unlocked without holding the reference, to allow
> > other users of the @page (for example migration) to continue.
>
> All of this seems a bit strange and it seems unnecessary? Maybe we need a
> better explanation?
>
> A process has no refcount on a page struct and is waiting for it to become
> unlocked? Why? Should it not simply ignore that page and continue? It
> cannot possibly do anything with the page since it does not hold a
> refcount.
So do you suggest busy waiting on the page under migration?
--
Michal Hocko
SUSE Labs
On Tue, Nov 27, 2018 at 8:49 AM Christopher Lameter <[email protected]> wrote:
>
> A process has no refcount on a page struct and is waiting for it to become
> unlocked? Why? Should it not simply ignore that page and continue?
The problem isn't that you can just "continue".
You need to *retry*.
And you can't just busy-loop. You want to wait until the page state
has changed, and _then_ retry.
Linus
On Tue, Nov 27, 2018 at 12:58:48PM +0200, Mike Rapoport wrote:
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index 81adec8ee02c..575e16c037ca 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -1049,25 +1056,44 @@ static void wake_up_page(struct page *page, int bit)
> > wake_up_page_bit(page, bit);
> > }
> >
> > +/*
> > + * A choice of three behaviors for wait_on_page_bit_common():
> > + */
> > +enum behavior {
> > + EXCLUSIVE, /* Hold ref to page and take the bit when woken, like
> > + * __lock_page() waiting on then setting PG_locked.
> > + */
> > + SHARED, /* Hold ref to page and check the bit when woken, like
> > + * wait_on_page_writeback() waiting on PG_writeback.
> > + */
> > + DROP, /* Drop ref to page before wait, no check when woken,
> > + * like put_and_wait_on_page_locked() on PG_locked.
> > + */
> > +};
>
> Can we please make it:
>
> /**
> * enum behavior - a choice of three behaviors for wait_on_page_bit_common()
> */
The enum isn't used outside mm/filemap.c, so I'm not entirely sure that
including kernel-doc for it is a good idea.
On Tue, 27 Nov 2018, Mike Rapoport wrote:
> > * @page: The page to wait for.
> > *
> > * The caller should hold a reference on @page. They expect the page to
> > * become unlocked relatively soon, but do not wish to hold up migration
> > * (for example) by holding the reference while waiting for the page to
> > * come unlocked. After this function returns, the caller should not
> > * dereference @page.
> > */
>
> How about:
>
> They expect the page to become unlocked relatively soon, but they can wait
> for the page to come unlocked without holding the reference, to allow
> other users of the @page (for example migration) to continue.
All of this seems a bit strange and it seems unnecessary? Maybe we need a
better explanation?
A process has no refcount on a page struct and is waiting for it to become
unlocked? Why? Should it not simply ignore that page and continue? It
cannot possibly do anything with the page since it does not hold a
refcount.
In order to do anything with the page struct it would have to obtain a
refcount and possibly lock the page. That would already wait for the
page to become unlocked.
On Tue, 27 Nov 2018, Mike Rapoport wrote:
> On Mon, Nov 26, 2018 at 11:27:07AM -0800, Hugh Dickins wrote:
> >
> > +/*
> > + * A choice of three behaviors for wait_on_page_bit_common():
> > + */
> > +enum behavior {
> > + EXCLUSIVE, /* Hold ref to page and take the bit when woken, like
> > + * __lock_page() waiting on then setting PG_locked.
> > + */
> > + SHARED, /* Hold ref to page and check the bit when woken, like
> > + * wait_on_page_writeback() waiting on PG_writeback.
> > + */
> > + DROP, /* Drop ref to page before wait, no check when woken,
> > + * like put_and_wait_on_page_locked() on PG_locked.
> > + */
> > +};
>
> Can we please make it:
>
> /**
> * enum behavior - a choice of three behaviors for wait_on_page_bit_common()
> */
> enum behavior {
> /**
> * @EXCLUSIVE: Hold ref to page and take the bit when woken,
> * like __lock_page() waiting on then setting %PG_locked.
> */
> EXCLUSIVE,
> /**
> * @SHARED: Hold ref to page and check the bit when woken,
> * like wait_on_page_writeback() waiting on %PG_writeback.
> */
> SHARED,
> /**
> * @DROP: Drop ref to page before wait, no check when woken,
> * like put_and_wait_on_page_locked() on %PG_locked.
> */
> DROP,
> };
I'm with Matthew, I'd prefer not: the first looks a more readable,
less cluttered comment to me than the second: this is just an arg
to an internal helper in mm/filemap.c, itself not kernel-doc'ed.
But the comment is not there for me: if consensus is that the
second is preferable, then sure, we can change it over.
Hugh
On November 27, 2018 11:08:50 PM GMT+02:00, Hugh Dickins <[email protected]> wrote:
>On Tue, 27 Nov 2018, Mike Rapoport wrote:
>> On Mon, Nov 26, 2018 at 11:27:07AM -0800, Hugh Dickins wrote:
>> >
>> > +/*
>> > + * A choice of three behaviors for wait_on_page_bit_common():
>> > + */
>> > +enum behavior {
>> > + EXCLUSIVE, /* Hold ref to page and take the bit when woken, like
>> > + * __lock_page() waiting on then setting PG_locked.
>> > + */
>> > + SHARED, /* Hold ref to page and check the bit when woken, like
>> > + * wait_on_page_writeback() waiting on PG_writeback.
>> > + */
>> > + DROP, /* Drop ref to page before wait, no check when woken,
>> > + * like put_and_wait_on_page_locked() on PG_locked.
>> > + */
>> > +};
>>
>> Can we please make it:
>>
>> /**
>> * enum behavior - a choice of three behaviors for
>wait_on_page_bit_common()
>> */
>> enum behavior {
>> /**
>> * @EXCLUSIVE: Hold ref to page and take the bit when woken,
>> * like __lock_page() waiting on then setting %PG_locked.
>> */
>> EXCLUSIVE,
>> /**
>> * @SHARED: Hold ref to page and check the bit when woken,
>> * like wait_on_page_writeback() waiting on %PG_writeback.
>> */
>> SHARED,
>> /**
>> * @DROP: Drop ref to page before wait, no check when woken,
>> * like put_and_wait_on_page_locked() on %PG_locked.
>> */
>> DROP,
>> };
>
>I'm with Matthew, I'd prefer not: the first looks a more readable,
>less cluttered comment to me than the second: this is just an arg
>to an internal helper in mm/filemap.c, itself not kernel-doc'ed.
Hmm, indeed, making this kernel-doc would be premature.
I was thinking about including this in a future description of the filemap internals, but until that would get written lot of things may change.
>But the comment is not there for me: if consensus is that the
>second is preferable, then sure, we can change it over.
>
>Hugh
--
Sincerely yours,
Mike.
On Tue, Nov 27, 2018 at 01:08:50PM -0800, Hugh Dickins wrote:
> On Tue, 27 Nov 2018, Mike Rapoport wrote:
> > On Mon, Nov 26, 2018 at 11:27:07AM -0800, Hugh Dickins wrote:
> > >
> > > +/*
> > > + * A choice of three behaviors for wait_on_page_bit_common():
> > > + */
> > > +enum behavior {
> > > + EXCLUSIVE, /* Hold ref to page and take the bit when woken, like
> > > + * __lock_page() waiting on then setting PG_locked.
> > > + */
> > > + SHARED, /* Hold ref to page and check the bit when woken, like
> > > + * wait_on_page_writeback() waiting on PG_writeback.
> > > + */
> > > + DROP, /* Drop ref to page before wait, no check when woken,
> > > + * like put_and_wait_on_page_locked() on PG_locked.
> > > + */
> > > +};
> >
> > Can we please make it:
> >
> > /**
> > * enum behavior - a choice of three behaviors for wait_on_page_bit_common()
> > */
> > enum behavior {
> > /**
> > * @EXCLUSIVE: Hold ref to page and take the bit when woken,
> > * like __lock_page() waiting on then setting %PG_locked.
> > */
> > EXCLUSIVE,
> > /**
> > * @SHARED: Hold ref to page and check the bit when woken,
> > * like wait_on_page_writeback() waiting on %PG_writeback.
> > */
> > SHARED,
> > /**
> > * @DROP: Drop ref to page before wait, no check when woken,
> > * like put_and_wait_on_page_locked() on %PG_locked.
> > */
> > DROP,
> > };
>
> I'm with Matthew, I'd prefer not: the first looks a more readable,
> less cluttered comment to me than the second: this is just an arg
> to an internal helper in mm/filemap.c, itself not kernel-doc'ed.
>
> But the comment is not there for me: if consensus is that the
> second is preferable, then sure, we can change it over.
For something which is internal to a single file I strongly prefer
the first as well.
--
Cheers,
Joey Pabalinas
On 11/26/18 8:27 PM, Hugh Dickins wrote:
> Waiting on a page migration entry has used wait_on_page_locked() all
> along since 2006: but you cannot safely wait_on_page_locked() without
> holding a reference to the page, and that extra reference is enough to
> make migrate_page_move_mapping() fail with -EAGAIN, when a racing task
> faults on the entry before migrate_page_move_mapping() gets there.
>
> And that failure is retried nine times, amplifying the pain when
> trying to migrate a popular page. With a single persistent faulter,
> migration sometimes succeeds; with two or three concurrent faulters,
> success becomes much less likely (and the more the page was mapped,
> the worse the overhead of unmapping and remapping it on each try).
>
> This is especially a problem for memory offlining, where the outer
> level retries forever (or until terminated from userspace), because
> a heavy refault workload can trigger an endless loop of migration
> failures. wait_on_page_locked() is the wrong tool for the job.
>
> David Herrmann (but was he the first?) noticed this issue in 2014:
> https://marc.info/?l=linux-mm&m=140110465608116&w=2
>
> Tim Chen started a thread in August 2017 which appears relevant:
> https://marc.info/?l=linux-mm&m=150275941014915&w=2
> where Kan Liang went on to implicate __migration_entry_wait():
> https://marc.info/?l=linux-mm&m=150300268411980&w=2
> and the thread ended up with the v4.14 commits:
> 2554db916586 ("sched/wait: Break up long wake list walk")
> 11a19c7b099f ("sched/wait: Introduce wakeup boomark in wake_up_page_bit")
>
> Baoquan He reported "Memory hotplug softlock issue" 14 November 2018:
> https://marc.info/?l=linux-mm&m=154217936431300&w=2
>
> We have all assumed that it is essential to hold a page reference while
> waiting on a page lock: partly to guarantee that there is still a struct
> page when MEMORY_HOTREMOVE is configured, but also to protect against
> reuse of the struct page going to someone who then holds the page locked
> indefinitely, when the waiter can reasonably expect timely unlocking.
>
> But in fact, so long as wait_on_page_bit_common() does the put_page(),
> and is careful not to rely on struct page contents thereafter, there is
> no need to hold a reference to the page while waiting on it. That does
> mean that this case cannot go back through the loop: but that's fine for
> the page migration case, and even if used more widely, is limited by the
> "Stop walking if it's locked" optimization in wake_page_function().
>
> Add interface put_and_wait_on_page_locked() to do this, using "behavior"
> enum in place of "lock" arg to wait_on_page_bit_common() to implement it.
> No interruptible or killable variant needed yet, but they might follow:
> I have a vague notion that reporting -EINTR should take precedence over
> return from wait_on_page_bit_common() without knowing the page state,
> so arrange it accordingly - but that may be nothing but pedantic.
>
> __migration_entry_wait() still has to take a brief reference to the
> page, prior to calling put_and_wait_on_page_locked(): but now that it
> is dropped before waiting, the chance of impeding page migration is
> very much reduced. Should we perhaps disable preemption across this?
>
> shrink_page_list()'s __ClearPageLocked(): that was a surprise! This
> survived a lot of testing before that showed up. PageWaiters may have
> been set by wait_on_page_bit_common(), and the reference dropped, just
> before shrink_page_list() succeeds in freezing its last page reference:
> in such a case, unlock_page() must be used. Follow the suggestion from
> Michal Hocko, just revert a978d6f52106 ("mm: unlockless reclaim") now:
> that optimization predates PageWaiters, and won't buy much these days;
> but we can reinstate it for the !PageWaiters case if anyone notices.
>
> It does raise the question: should vmscan.c's is_page_cache_freeable()
> and __remove_mapping() now treat a PageWaiters page as if an extra
> reference were held? Perhaps, but I don't think it matters much, since
> shrink_page_list() already had to win its trylock_page(), so waiters are
> not very common there: I noticed no difference when trying the bigger
> change, and it's surely not needed while put_and_wait_on_page_locked()
> is only used for page migration.
>
> Reported-and-tested-by: Baoquan He <[email protected]>
> Signed-off-by: Hugh Dickins <[email protected]>
> Acked-by: Michal Hocko <[email protected]>
> Reviewed-by: Andrea Arcangeli <[email protected]>
For the record, anyone backporting this to older kernels should make
sure to also include 605ca5ede764 ("mm/huge_memory.c: reorder operations
in __split_huge_page_tail()") or they are in for a lot of fun, like me.
Long story [1] short, Konstantin was correct in 605ca5ede764 changelog,
although it wasn't the main known issue he was fixing:
clear_compound_head() also must be called before unfreezing page
reference because after successful get_page_unless_zero() might follow
put_page() which needs correct compound_head().
Which is exactly what happens in __migration_entry_wait():
if (!get_page_unless_zero(page))
goto out;
pte_unmap_unlock(ptep, ptl);
put_and_wait_on_page_locked(page); -> does put_page(page)
while waiting on the THP split (which inserts those migration entries)
to finish. Before put_and_wait_on_page_locked() it would wait first, and
only then do put_page() on a page that's no longer tail page, so it
would work out despite the dangerous get_page_unless_zero() on a tail
page. Now it doesn't :)
Now if only 605ca5ede764 had a CC:stable and a Fixes: tag... Machine
Learning won this round though, because 605ca5ede764 was added to 4.14
stable by Sasha...
[1] https://bugzilla.suse.com/show_bug.cgi?id=1119962#c16
On Thu, 10 Jan 2019, Vlastimil Babka wrote:
>
> For the record, anyone backporting this to older kernels should make
> sure to also include 605ca5ede764 ("mm/huge_memory.c: reorder operations
> in __split_huge_page_tail()") or they are in for a lot of fun, like me.
Thanks a lot for alerting us all to this, Vlastimil. Yes, I consider
Konstantin's 605ca5ede764 a must-have, and so had it already in all
the trees on which I was testing put_and_wait_on_page_locked(),
without being aware of the critical role it was playing.
But you do enjoy fun, don't you? So I shouldn't apologize :)
>
> Long story [1] short, Konstantin was correct in 605ca5ede764 changelog,
> although it wasn't the main known issue he was fixing:
>
> clear_compound_head() also must be called before unfreezing page
> reference because after successful get_page_unless_zero() might follow
> put_page() which needs correct compound_head().
>
> Which is exactly what happens in __migration_entry_wait():
>
> if (!get_page_unless_zero(page))
> goto out;
> pte_unmap_unlock(ptep, ptl);
> put_and_wait_on_page_locked(page); -> does put_page(page)
>
> while waiting on the THP split (which inserts those migration entries)
> to finish. Before put_and_wait_on_page_locked() it would wait first, and
> only then do put_page() on a page that's no longer tail page, so it
> would work out despite the dangerous get_page_unless_zero() on a tail
> page. Now it doesn't :)
It took me a while to follow there, but yes, agreed.
>
> Now if only 605ca5ede764 had a CC:stable and a Fixes: tag... Machine
> Learning won this round though, because 605ca5ede764 was added to 4.14
> stable by Sasha...
I'm proud to have passed the Turing test in reverse, but actually
that was me, not ML. My 173d9d9fd3dd ("mm/huge_memory: splitting set
mapping+index before unfreeze") in 4.20 built upon Konstantin's, so I
included his as a precursor when sending the stable guys pre-XArray
backports. So Konstantin's is even in 4.9 stable now.
Hugh