2017-08-15 01:10:08

by Tim Chen

[permalink] [raw]
Subject: [PATCH 1/2] sched/wait: Break up long wake list walk

We encountered workloads that have very long wake up list on large
systems. A waker takes a long time to traverse the entire wake list and
execute all the wake functions.

We saw page wait list that are up to 3700+ entries long in tests of large
4 and 8 socket systems. It took 0.8 sec to traverse such list during
wake up. Any other CPU that contends for the list spin lock will spin
for a long time. As page wait list is shared by many pages so it could
get very long on systems with large memory.

Multiple CPUs waking are queued up behind the lock, and the last one queued
has to wait until all CPUs did all the wakeups.

The page wait list is traversed with interrupt disabled, which caused
various problems. This was the original cause that triggered the NMI
watch dog timer in: https://patchwork.kernel.org/patch/9800303/ . Only
extending the NMI watch dog timer there helped.

This patch bookmarks the waker's scan position in wake list and break
the wake up walk, to allow access to the list before the waker resume
its walk down the rest of the wait list. It lowers the interrupt and
rescheduling latency.

This patch also provides a performance boost when combined with the next
patch to break up page wakeup list walk. We saw 22% improvement in the
will-it-scale file pread2 test on a Xeon Phi system running 256 threads.

Thanks.

Tim

Reported-by: Kan Liang <[email protected]>
Signed-off-by: Tim Chen <[email protected]>
---
include/linux/wait.h | 9 +++++++
kernel/sched/wait.c | 76 ++++++++++++++++++++++++++++++++++++++++++----------
2 files changed, 71 insertions(+), 14 deletions(-)

diff --git a/include/linux/wait.h b/include/linux/wait.h
index 5b74e36..588a5d2 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -18,6 +18,14 @@ int default_wake_function(struct wait_queue_entry *wq_entry, unsigned mode, int
/* wait_queue_entry::flags */
#define WQ_FLAG_EXCLUSIVE 0x01
#define WQ_FLAG_WOKEN 0x02
+#define WQ_FLAG_BOOKMARK 0x04
+
+/*
+ * Scan threshold to break wait queue walk.
+ * This allows a waker to take a break from holding the
+ * wait queue lock during the wait queue walk.
+ */
+#define WAITQUEUE_WALK_BREAK_CNT 64

/*
* A single wait-queue entry structure:
@@ -947,6 +955,7 @@ void finish_wait(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_en
long wait_woken(struct wait_queue_entry *wq_entry, unsigned mode, long timeout);
int woken_wake_function(struct wait_queue_entry *wq_entry, unsigned mode, int sync, void *key);
int autoremove_wake_function(struct wait_queue_entry *wq_entry, unsigned mode, int sync, void *key);
+int bookmark_wake_function(struct wait_queue_entry *wq_entry, unsigned mode, int sync, void *key);

#define DEFINE_WAIT_FUNC(name, function) \
struct wait_queue_entry name = { \
diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index 17f11c6..d02e6c6 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -63,17 +63,64 @@ EXPORT_SYMBOL(remove_wait_queue);
* started to run but is not in state TASK_RUNNING. try_to_wake_up() returns
* zero in this (rare) case, and we handle it by continuing to scan the queue.
*/
-static void __wake_up_common(struct wait_queue_head *wq_head, unsigned int mode,
- int nr_exclusive, int wake_flags, void *key)
+static int __wake_up_common(struct wait_queue_head *wq_head, unsigned int mode,
+ int nr_exclusive, int wake_flags, void *key,
+ wait_queue_entry_t *bookmark)
{
wait_queue_entry_t *curr, *next;
+ int cnt = 0;
+
+ if (bookmark && (bookmark->flags & WQ_FLAG_BOOKMARK)) {
+ curr = list_next_entry(bookmark, entry);
+
+ list_del(&bookmark->entry);
+ bookmark->flags = 0;
+ } else
+ curr = list_first_entry(&wq_head->head, wait_queue_entry_t, entry);
+
+ if (&curr->entry == &wq_head->head)
+ return nr_exclusive;

- list_for_each_entry_safe(curr, next, &wq_head->head, entry) {
+ list_for_each_entry_safe_from(curr, next, &wq_head->head, entry) {
unsigned flags = curr->flags;

+ if (curr->flags & WQ_FLAG_BOOKMARK)
+ continue;
+
if (curr->func(curr, mode, wake_flags, key) &&
(flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive)
break;
+
+ if (bookmark && (++cnt > WAITQUEUE_WALK_BREAK_CNT) &&
+ (&next->entry != &wq_head->head)) {
+ bookmark->flags = WQ_FLAG_BOOKMARK;
+ list_add_tail(&bookmark->entry, &next->entry);
+ break;
+ }
+ }
+ return nr_exclusive;
+}
+
+static void __wake_up_common_lock(struct wait_queue_head *wq_head, unsigned int mode,
+ int nr_exclusive, int wake_flags, void *key)
+{
+ unsigned long flags;
+ wait_queue_entry_t bookmark;
+
+ bookmark.flags = 0;
+ bookmark.private = NULL;
+ bookmark.func = bookmark_wake_function;
+ INIT_LIST_HEAD(&bookmark.entry);
+
+ spin_lock_irqsave(&wq_head->lock, flags);
+ nr_exclusive = __wake_up_common(wq_head, mode, nr_exclusive, wake_flags, key, &bookmark);
+ spin_unlock_irqrestore(&wq_head->lock, flags);
+
+ while (bookmark.flags & WQ_FLAG_BOOKMARK) {
+ spin_lock_irqsave(&wq_head->lock, flags);
+ nr_exclusive = __wake_up_common(wq_head, mode, nr_exclusive,
+ wake_flags, key, &bookmark);
+ spin_unlock_irqrestore(&wq_head->lock, flags);
}
}

@@ -90,11 +137,7 @@ static void __wake_up_common(struct wait_queue_head *wq_head, unsigned int mode,
void __wake_up(struct wait_queue_head *wq_head, unsigned int mode,
int nr_exclusive, void *key)
{
- unsigned long flags;
-
- spin_lock_irqsave(&wq_head->lock, flags);
- __wake_up_common(wq_head, mode, nr_exclusive, 0, key);
- spin_unlock_irqrestore(&wq_head->lock, flags);
+ __wake_up_common_lock(wq_head, mode, nr_exclusive, 0, key);
}
EXPORT_SYMBOL(__wake_up);

@@ -103,13 +146,13 @@ EXPORT_SYMBOL(__wake_up);
*/
void __wake_up_locked(struct wait_queue_head *wq_head, unsigned int mode, int nr)
{
- __wake_up_common(wq_head, mode, nr, 0, NULL);
+ __wake_up_common(wq_head, mode, nr, 0, NULL, NULL);
}
EXPORT_SYMBOL_GPL(__wake_up_locked);

void __wake_up_locked_key(struct wait_queue_head *wq_head, unsigned int mode, void *key)
{
- __wake_up_common(wq_head, mode, 1, 0, key);
+ __wake_up_common(wq_head, mode, 1, 0, key, NULL);
}
EXPORT_SYMBOL_GPL(__wake_up_locked_key);

@@ -133,7 +176,6 @@ EXPORT_SYMBOL_GPL(__wake_up_locked_key);
void __wake_up_sync_key(struct wait_queue_head *wq_head, unsigned int mode,
int nr_exclusive, void *key)
{
- unsigned long flags;
int wake_flags = 1; /* XXX WF_SYNC */

if (unlikely(!wq_head))
@@ -142,9 +184,7 @@ void __wake_up_sync_key(struct wait_queue_head *wq_head, unsigned int mode,
if (unlikely(nr_exclusive != 1))
wake_flags = 0;

- spin_lock_irqsave(&wq_head->lock, flags);
- __wake_up_common(wq_head, mode, nr_exclusive, wake_flags, key);
- spin_unlock_irqrestore(&wq_head->lock, flags);
+ __wake_up_common_lock(wq_head, mode, nr_exclusive, wake_flags, key);
}
EXPORT_SYMBOL_GPL(__wake_up_sync_key);

@@ -326,6 +366,14 @@ int autoremove_wake_function(struct wait_queue_entry *wq_entry, unsigned mode, i
}
EXPORT_SYMBOL(autoremove_wake_function);

+int bookmark_wake_function(wait_queue_entry_t *wait, unsigned mode, int sync, void *key)
+{
+ /* bookmark only, no real wake up */
+ BUG();
+ return 0;
+}
+EXPORT_SYMBOL(bookmark_wake_function);
+
static inline bool is_kthread_should_stop(void)
{
return (current->flags & PF_KTHREAD) && kthread_should_stop();
--
2.9.4


2017-08-15 01:10:13

by Tim Chen

[permalink] [raw]
Subject: [PATCH 2/2] sched/wait: Introduce lock breaker in wake_up_page_bit

Now that we have added breaks in the wait queue scan and allow bookmark
on scan position, we put this logic in the wake_up_page_bit function.

We can have very long page wait list in large system where multiple
pages share the same wait list. We break the wake up walk here to allow
other cpus a chance to access the list, and not to disable the interrupts
when traversing the list for too long. This reduces the interrupt and
rescheduling latency, and excessive page wait queue lock hold time.

We have to add logic to detect any new arrivals to appropriately clear
the wait bit on the page only when there are no new waiters for a page.
The break in wait list walk open windows for new arrivals for a page
on the wait list during the wake ups. They could be added at the head
or tail of the wait queue depending on whether they are exclusive in
prepare_to_wait_event. So we can't clear the PageWaiters flag if there
are new arrivals during the wake up process. Otherwise we will skip
the wake_up_page when there are still entries to be woken up.

Signed-off-by: Tim Chen <[email protected]>
---
include/linux/wait.h | 7 +++++++
kernel/sched/wait.c | 7 +++++++
mm/filemap.c | 36 ++++++++++++++++++++++++++++++++++--
3 files changed, 48 insertions(+), 2 deletions(-)

diff --git a/include/linux/wait.h b/include/linux/wait.h
index 588a5d2..b4de5fa 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -19,6 +19,7 @@ int default_wake_function(struct wait_queue_entry *wq_entry, unsigned mode, int
#define WQ_FLAG_EXCLUSIVE 0x01
#define WQ_FLAG_WOKEN 0x02
#define WQ_FLAG_BOOKMARK 0x04
+#define WQ_FLAG_ARRIVALS 0x08

/*
* Scan threshold to break wait queue walk.
@@ -39,6 +40,8 @@ struct wait_queue_entry {

struct wait_queue_head {
spinlock_t lock;
+ unsigned int waker;
+ unsigned int flags;
struct list_head head;
};
typedef struct wait_queue_head wait_queue_head_t;
@@ -59,6 +62,8 @@ struct task_struct;

#define __WAIT_QUEUE_HEAD_INITIALIZER(name) { \
.lock = __SPIN_LOCK_UNLOCKED(name.lock), \
+ .waker = 0, \
+ .flags = 0, \
.head = { &(name).head, &(name).head } }

#define DECLARE_WAIT_QUEUE_HEAD(name) \
@@ -192,6 +197,8 @@ __remove_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq

void __wake_up(struct wait_queue_head *wq_head, unsigned int mode, int nr, void *key);
void __wake_up_locked_key(struct wait_queue_head *wq_head, unsigned int mode, void *key);
+void __wake_up_locked_key_bookmark(struct wait_queue_head *wq_head,
+ unsigned int mode, void *key, wait_queue_entry_t *bookmark);
void __wake_up_sync_key(struct wait_queue_head *wq_head, unsigned int mode, int nr, void *key);
void __wake_up_locked(struct wait_queue_head *wq_head, unsigned int mode, int nr);
void __wake_up_sync(struct wait_queue_head *wq_head, unsigned int mode, int nr);
diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index d02e6c6..c665b70 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -156,6 +156,13 @@ void __wake_up_locked_key(struct wait_queue_head *wq_head, unsigned int mode, vo
}
EXPORT_SYMBOL_GPL(__wake_up_locked_key);

+void __wake_up_locked_key_bookmark(struct wait_queue_head *wq_head,
+ unsigned int mode, void *key, wait_queue_entry_t *bookmark)
+{
+ __wake_up_common(wq_head, mode, 1, 0, key, bookmark);
+}
+EXPORT_SYMBOL_GPL(__wake_up_locked_key_bookmark);
+
/**
* __wake_up_sync_key - wake up threads blocked on a waitqueue.
* @wq_head: the waitqueue
diff --git a/mm/filemap.c b/mm/filemap.c
index a497024..a600981 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -920,13 +920,41 @@ static void wake_up_page_bit(struct page *page, int bit_nr)
wait_queue_head_t *q = page_waitqueue(page);
struct wait_page_key key;
unsigned long flags;
+ wait_queue_entry_t bookmark;

key.page = page;
key.bit_nr = bit_nr;
key.page_match = 0;

+ bookmark.flags = 0;
+ bookmark.private = NULL;
+ bookmark.func = bookmark_wake_function;
+ INIT_LIST_HEAD(&bookmark.entry);
+
+ spin_lock_irqsave(&q->lock, flags);
+ /* q->flags will be set to WQ_FLAG_ARRIVALS if items added to wait queue */
+ if (!q->waker)
+ q->flags &= ~WQ_FLAG_ARRIVALS;
+ ++ q->waker;
+ __wake_up_locked_key_bookmark(q, TASK_NORMAL, &key, &bookmark);
+ if (!(bookmark.flags & WQ_FLAG_BOOKMARK))
+ goto finish;
+ /*
+ * Take a breather from holding the lock,
+ * allow pages that finish wake up asynchronously
+ * to acquire the lock and remove themselves
+ * from wait queue
+ */
+ spin_unlock_irqrestore(&q->lock, flags);
+
+again:
spin_lock_irqsave(&q->lock, flags);
- __wake_up_locked_key(q, TASK_NORMAL, &key);
+ __wake_up_locked_key_bookmark(q, TASK_NORMAL, &key, &bookmark);
+ if (bookmark.flags & WQ_FLAG_BOOKMARK) {
+ spin_unlock_irqrestore(&q->lock, flags);
+ goto again;
+ }
+finish:
/*
* It is possible for other pages to have collided on the waitqueue
* hash, so in that case check for a page match. That prevents a long-
@@ -936,7 +964,8 @@ static void wake_up_page_bit(struct page *page, int bit_nr)
* and removed them from the waitqueue, but there are still other
* page waiters.
*/
- if (!waitqueue_active(q) || !key.page_match) {
+ if (!waitqueue_active(q) ||
+ (!key.page_match && (q->waker == 1) && !(q->flags & WQ_FLAG_ARRIVALS))) {
ClearPageWaiters(page);
/*
* It's possible to miss clearing Waiters here, when we woke
@@ -946,6 +975,7 @@ static void wake_up_page_bit(struct page *page, int bit_nr)
* That's okay, it's a rare case. The next waker will clear it.
*/
}
+ -- q->waker;
spin_unlock_irqrestore(&q->lock, flags);
}

@@ -976,6 +1006,7 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
__add_wait_queue_entry_tail_exclusive(q, wait);
else
__add_wait_queue(q, wait);
+ q->flags = WQ_FLAG_ARRIVALS;
SetPageWaiters(page);
}

@@ -1041,6 +1072,7 @@ void add_page_wait_queue(struct page *page, wait_queue_entry_t *waiter)
spin_lock_irqsave(&q->lock, flags);
__add_wait_queue(q, waiter);
SetPageWaiters(page);
+ q->flags = WQ_FLAG_ARRIVALS;
spin_unlock_irqrestore(&q->lock, flags);
}
EXPORT_SYMBOL_GPL(add_page_wait_queue);
--
2.9.4

2017-08-15 01:48:09

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/wait: Break up long wake list walk

On Mon, Aug 14, 2017 at 5:52 PM, Tim Chen <[email protected]> wrote:
> We encountered workloads that have very long wake up list on large
> systems. A waker takes a long time to traverse the entire wake list and
> execute all the wake functions.
>
> We saw page wait list that are up to 3700+ entries long in tests of large
> 4 and 8 socket systems. It took 0.8 sec to traverse such list during
> wake up. Any other CPU that contends for the list spin lock will spin
> for a long time. As page wait list is shared by many pages so it could
> get very long on systems with large memory.

I really dislike this patch.

The patch seems a band-aid for really horrible kernel behavior, rather
than fixing the underlying problem itself.

Now, it may well be that we do end up needing this band-aid in the
end, so this isn't a NAK of the patch per se. But I'd *really* like to
see if we can fix the underlying cause for what you see somehow..

In particular, if this is about the page wait table, maybe we can just
make the wait table bigger. IOW, are people actually waiting on the
*same* page, or are they mainly waiting on totally different pages,
just hashing to the same wait queue?

Because right now that page wait table is a small fixed size, and the
only reason it's a small fixed size is that nobody reported any issues
with it - particularly since we now avoid the wait table entirely for
the common cases by having that "contention" bit.

But it really is a *small* table. We literally have

#define PAGE_WAIT_TABLE_BITS 8

so it's just 256 entries. We could easily it much bigger, if we are
actually seeing a lot of collissions.

We *used* to have a very complex per-zone thing for bit-waitiqueues,
but that was because we got lots and lots of contention issues, and
everybody *always* touched the wait-queues whether they waited or not
(so being per-zone was a big deal)

We got rid of all that per-zone complexity when the normal case didn't
hit in the page wait queues at all, but we may have over-done the
simplification a bit since nobody showed any issue.

In particular, we used to size the per-zone thing by amount of memory.
We could easily re-introduce that for the new simpler page queues.

The page_waitiqueue() is a simple helper function inside mm/filemap.c,
and thanks to the per-page "do we have actual waiters" bit that we
have now, we can actually afford to make it bigger and more complex
now if we want to.

What happens to your load if you just make that table bigger? You can
literally test by just changing the constant from 8 to 16 or
something, making us use twice as many bits for hashing. A "real"
patch would size it by amount of memory, but just for testing the
contention on your load, you can do the hacky one-liner.

Linus

2017-08-15 02:27:44

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/wait: Break up long wake list walk

On Mon, Aug 14, 2017 at 06:48:06PM -0700, Linus Torvalds wrote:
> On Mon, Aug 14, 2017 at 5:52 PM, Tim Chen <[email protected]> wrote:
> > We encountered workloads that have very long wake up list on large
> > systems. A waker takes a long time to traverse the entire wake list and
> > execute all the wake functions.
> >
> > We saw page wait list that are up to 3700+ entries long in tests of large
> > 4 and 8 socket systems. It took 0.8 sec to traverse such list during
> > wake up. Any other CPU that contends for the list spin lock will spin
> > for a long time. As page wait list is shared by many pages so it could
> > get very long on systems with large memory.
>
> I really dislike this patch.
>
> The patch seems a band-aid for really horrible kernel behavior, rather
> than fixing the underlying problem itself.
>
> Now, it may well be that we do end up needing this band-aid in the
> end, so this isn't a NAK of the patch per se. But I'd *really* like to
> see if we can fix the underlying cause for what you see somehow..

We could try it and it may even help in this case and it may
be a good idea in any case on such a system, but:

- Even with a large hash table it might be that by chance all CPUs
will be queued up on the same page
- There are a lot of other wait queues in the kernel and they all
could run into a similar problem
- I suspect it's even possible to construct it from user space
as a kind of DoS attack

Given all that I don't see any alternative to fixing wait queues somehow.
It's just that systems are so big that now that they're starting to
stretch the tried old primitives.

Now in one case (on a smaller system) we debugged we had

- 4S system with 208 logical threads
- during the test the wait queue length was 3700 entries.
- the last CPUs queued had to wait roughly 0.8s

This gives a budget of roughly 1us per wake up.

It could be that we could find some way to do "bulk wakeups"
in the scheduler that are much cheaper, and switch to them if
there are a lot of entries in the wait queues.

With that it may be possible to do a wake up in less than 1us.

But even with that it will be difficult to beat the scaling
curve. If systems get bigger again (and they will be) it
could easily break again, as the budget gets smaller and smaller.

Also disabling interrupts for that long is just nasty.

Given all that I still think a lock breaker of some form is needed.

-Andi

2017-08-15 02:52:18

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/wait: Break up long wake list walk

On Mon, Aug 14, 2017 at 7:27 PM, Andi Kleen <[email protected]> wrote:
>
> We could try it and it may even help in this case and it may
> be a good idea in any case on such a system, but:
>
> - Even with a large hash table it might be that by chance all CPUs
> will be queued up on the same page
> - There are a lot of other wait queues in the kernel and they all
> could run into a similar problem
> - I suspect it's even possible to construct it from user space
> as a kind of DoS attack

Maybe. Which is why I didn't NAK the patch outright.

But I don't think it's the solution for the scalability issue you guys
found. It's just a workaround, and it's likely a bad one at that.

> Now in one case (on a smaller system) we debugged we had
>
> - 4S system with 208 logical threads
> - during the test the wait queue length was 3700 entries.
> - the last CPUs queued had to wait roughly 0.8s
>
> This gives a budget of roughly 1us per wake up.

I'm not at all convinced that follows.

When bad scaling happens, you often end up hitting quadratic (or
worse) behavior. So if you are able to fix the scaling by some fixed
amount, it's possible that almost _all_ the problems just go away.

The real issue is that "3700 entries" part. What was it that actually
triggered them? In particular, if it's just a hashing issue, and we
can trivially just make the hash table be bigger (256 entries is
*tiny*) then the whole thing goes away.

Which is why I really want to hear what happens if you just change
PAGE_WAIT_TABLE_BITS to 16. The right fix would be to just make it
scale by memory, but before we even do that, let's just look at what
happens when you increase the size the stupid way.

Maybe those 3700 entries will just shrink down to 14 entries because
the hash just works fine and 256 entries was just much much too small
when you have hundreds of thousands of threads or whatever

But it is *also* possible that it's actually all waiting on the exact
same page, and there's some way to do a thundering herd on the page
lock bit, for example. But then it would be really good to hear what
it is that triggers that.

The thing is, the reason we perform well on many loads in the kernel
is that I have *always* pushed back against bad workarounds.

We do *not* do lock back-off in our locks, for example, because I told
people that lock contention gets fixed by not contending, not by
trying to act better when things have already become bad.

This is the same issue. We don't "fix" things by papering over some
symptom. We try to fix the _actual_ underlying problem. Maybe there is
some caller that can simply be rewritten. Maybe we can do other tricks
than just make the wait tables bigger. But we should not say "3700
entries is ok, let's just make that sh*t be interruptible".

That is what the patch does now, and that is why I dislike the patch.

So I _am_ NAK'ing the patch if nobody is willing to even try alternatives.

Because a band-aid is ok for "some theoretical worst-case behavior".

But a band-aid is *not* ok for "we can't even be bothered to try to
figure out the right thing, so we're just adding this hack and leaving
it".

Linus

2017-08-15 03:15:26

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/wait: Break up long wake list walk

> That is what the patch does now, and that is why I dislike the patch.
>
> So I _am_ NAK'ing the patch if nobody is willing to even try alternatives.

Ok, perhaps larger hash table is the right solution for this one.

But what should we do when some other (non page) wait queue runs into the
same problem?

-Andi

2017-08-15 03:28:21

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/wait: Break up long wake list walk

On Mon, Aug 14, 2017 at 8:15 PM, Andi Kleen <[email protected]> wrote:
> But what should we do when some other (non page) wait queue runs into the
> same problem?

Hopefully the same: root-cause it.

Once you have a test-case, it should generally be fairly simple to do
with profiles, just seeing who the caller is when ttwu() (or whatever
it is that ends up being the most noticeable part of the wakeup chain)
shows up very heavily.

And I think that ends up being true whether the "break up long chains"
patch goes in or not. Even if we end up allowing interrupts in the
middle, a long wait-queue is a problem.

I think the "break up long chains" thing may be the right thing
against actual malicious attacks, but not for any actual real
benchmark or load.

I don't think we normally have cases of long wait-queues, though. At
least not the kinds that cause problems. The real (and valid)
thundering herd cases should already be using exclusive waiters that
only wake up one process at a time.

The page bit-waiting is hopefully special. As mentioned, we used to
have some _really_ special code for it for other reasons, and I
suspect you see this problem with them because we over-simplified it
from being a per-zone dynamically sized one (where the per-zone thing
caused both performance problems and actual bugs) to being that
"static small array".

So I think/hope that just re-introducing some dynamic sizing will help
sufficiently, and that this really is an odd and unusual case.

Linus

2017-08-15 19:05:42

by Tim Chen

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/wait: Break up long wake list walk

On 08/14/2017 08:28 PM, Linus Torvalds wrote:
> On Mon, Aug 14, 2017 at 8:15 PM, Andi Kleen <[email protected]> wrote:
>> But what should we do when some other (non page) wait queue runs into the
>> same problem?
>
> Hopefully the same: root-cause it.
>
> Once you have a test-case, it should generally be fairly simple to do
> with profiles, just seeing who the caller is when ttwu() (or whatever
> it is that ends up being the most noticeable part of the wakeup chain)
> shows up very heavily.

We have a test case but it is a customer workload. We'll try to get
a bit more info.

>
> And I think that ends up being true whether the "break up long chains"
> patch goes in or not. Even if we end up allowing interrupts in the
> middle, a long wait-queue is a problem.
>
> I think the "break up long chains" thing may be the right thing
> against actual malicious attacks, but not for any actual real
> benchmark or load.

This is a concern from our customer as we could trigger the watchdog timer
by running user space workloads.

>
> I don't think we normally have cases of long wait-queues, though. At
> least not the kinds that cause problems. The real (and valid)
> thundering herd cases should already be using exclusive waiters that
> only wake up one process at a time.
>
> The page bit-waiting is hopefully special. As mentioned, we used to
> have some _really_ special code for it for other reasons, and I
> suspect you see this problem with them because we over-simplified it
> from being a per-zone dynamically sized one (where the per-zone thing
> caused both performance problems and actual bugs) to being that
> "static small array".
>
> So I think/hope that just re-introducing some dynamic sizing will help
> sufficiently, and that this really is an odd and unusual case.

I agree that dynamic sizing makes a lot of sense. We'll check to
see if additional size to the hash table helps, assuming that the
waiters are distributed among different pages for our test case.

Thanks.

Tim

2017-08-15 19:41:04

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/wait: Break up long wake list walk

On Tue, Aug 15, 2017 at 12:05 PM, Tim Chen <[email protected]> wrote:
>
> We have a test case but it is a customer workload. We'll try to get
> a bit more info.

Ok. Being a customer workload is lovely in the sense that it is
actually a real load, not just a microbecnhmark.

But yeah, it makes it harder to describe and show what's going on.

But you do have access to that workload internally at Intel, and can
at least test things out that way, I assume?

> I agree that dynamic sizing makes a lot of sense. We'll check to
> see if additional size to the hash table helps, assuming that the
> waiters are distributed among different pages for our test case.

One more thing: it turns out that there are two very different kinds
of users of the page waitqueue.

There's the "wait_on_page_bit*()" users - people waiting for a page to
unlock or stop being under writeback etc.

Those *should* generally be limited to just one wait-queue per waiting
thread, I think.

Then there is the "cachefiles" use, which ends up adding a lot of
waitqueues to a lot of paghes to monitor their state.

Honestly, I think that second use a horrible hack. It basically adds a
waitqueue to each page in order to get a callback when it is ready,
and then copies it.

And it does this for things like cachefiles_read_backing_file(), so
you might have a huge list of pages for copying a large file, and it
adds a callback for every single one of those all at once.

The fix for the cachefiles behavior might be very different from the
fix to the "normal" operations. But making the wait queue hash tables
bigger _should_ help both cases.

We might also want to hash based on the actual bit we're waiting for.
Right now we just do a

wait_queue_head_t *q = page_waitqueue(page);

but I think the actual bit is always explicit (well, the cachefiles
interface doesn't have that, but looking at the callback for that, it
really only cares about PG_locked, so it *should* make the bit it is
waiting for explicit).

So if we have unnecessarily collisions because we have waiters looking
at different bits of the same page, we could just hash in the bit
number that we're waiting for too.

Linus

2017-08-15 19:47:57

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/wait: Break up long wake list walk

On Tue, Aug 15, 2017 at 12:41 PM, Linus Torvalds
<[email protected]> wrote:
>
> So if we have unnecessarily collisions because we have waiters looking
> at different bits of the same page, we could just hash in the bit
> number that we're waiting for too.

Oh, nope, we can't do that, because we only have one "PageWaters" bit
per page, and it is shared across all bits we're waiting for on that
page.

So collisions between different bits on the same page are inevitable,
and we just need to make sure the hash table is big enough that we
don't get unnecessary collisions between different pages.

Linus

2017-08-15 22:47:42

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/wait: Break up long wake list walk

On Mon, 14 Aug 2017, Linus Torvalds wrote:

>On Mon, Aug 14, 2017 at 8:15 PM, Andi Kleen <[email protected]> wrote:
>> But what should we do when some other (non page) wait queue runs into the
>> same problem?
>
>Hopefully the same: root-cause it.

Or you can always use wake_qs; which exists _exactly_ for the issues you
are running into. Note that Linus does not want them in general wait:

https://lkml.org/lkml/2017/7/7/605

... but you can always use them on your own if you really need to (ie locks,
ipc, etc).

Thanks,
Davidlohr

2017-08-15 22:56:17

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/wait: Break up long wake list walk

On Tue, Aug 15, 2017 at 3:47 PM, Davidlohr Bueso <[email protected]> wrote:
>
> Or you can always use wake_qs; which exists _exactly_ for the issues you
> are running into

Except they really don't actually work for this case, exactly because
they also simplify away "minor" details like exclusive vs
non-exclusive etc.

The page wait-queue very much has a mix of "wake all" and "wake one" semantics.

But I guess we could have two queues per page hash - one that is
wake-once, and one that is wake-all.

Which might solve the technical problem.

And if somebody then rewrote the swait code to not use the
unbelievably broken and misleading naming, it might even be
acceptable.

But as is, that swait code is broken shit, and absolutely does *not*
need new users. We got rid of one user, and the KVM people already
admitted that one of the remaining users is broken and doesn't
actually want swait at all and should use "wake_up_process()" instead
since there is no actual queuing going on.

In the meantime, stop peddling crap. That thing really is broken.

Linus

2017-08-15 22:57:34

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/wait: Break up long wake list walk

On Tue, Aug 15, 2017 at 3:56 PM, Linus Torvalds
<[email protected]> wrote:
>
> Except they really don't actually work for this case, exactly because
> they also simplify away "minor" details like exclusive vs
> non-exclusive etc.
>
> The page wait-queue very much has a mix of "wake all" and "wake one" semantics.

Oh, and the page wait-queue really needs that key argument too, which
is another thing that swait queue code got rid of in the name of
simplicity.

So no. The swait code is absolutely _entirely_ the wrong thing to use.

Linus

2017-08-15 23:50:36

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/wait: Break up long wake list walk

On Tue, Aug 15, 2017 at 3:57 PM, Linus Torvalds
<[email protected]> wrote:
>
> Oh, and the page wait-queue really needs that key argument too, which
> is another thing that swait queue code got rid of in the name of
> simplicity.

Actually, it gets worse.

Because the page wait queues are hashed, it's not an all-or-nothing
thing even for the non-exclusive cases, and it's not a "wake up first
entry" for the exclusive case. Both have to be conditional on the wait
entry actually matching the page and bit in question.

So no way to use swait, or any of the lockless queuing code in general
(so we can't do some clever private wait-list using llist.h either).

End result: it looks like you fairly fundamentally do need to use a
lock over the whole list traversal (like the standard wait-queues),
and then add a cursor entry like Tim's patch if dropping the lock in
the middle.

Anyway, looking at the old code, we *used* to limit the page wait hash
table to 4k entries, and we used to have one hash table per memory
zone.

The per-zone thing didn't work at all for the generic bit-waitqueues,
because of how people used them on virtual addresses on the stack.

But it *could* work for the page waitqueues, which are now a totally
separate entity, and is obviously always physically addressed (since
the indexing is by "struct page" pointer), and doesn't have that
issue.

So I guess we could re-introduce the notion of per-zone page waitqueue
hash tables. It was disgusting to allocate and free though (and hooked
into the memory hotplug code).

So I'd still hope that we can instead just have one larger hash table,
and that is sufficient for the problem.

Linus

2017-08-16 23:23:15

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/wait: Break up long wake list walk

Linus Torvalds <[email protected]> writes:

> On Tue, Aug 15, 2017 at 3:57 PM, Linus Torvalds
> <[email protected]> wrote:
>>
>> Oh, and the page wait-queue really needs that key argument too, which
>> is another thing that swait queue code got rid of in the name of
>> simplicity.
>
> Actually, it gets worse.
>
> Because the page wait queues are hashed, it's not an all-or-nothing
> thing even for the non-exclusive cases, and it's not a "wake up first
> entry" for the exclusive case. Both have to be conditional on the wait
> entry actually matching the page and bit in question.
>
> So no way to use swait, or any of the lockless queuing code in general
> (so we can't do some clever private wait-list using llist.h either).
>
> End result: it looks like you fairly fundamentally do need to use a
> lock over the whole list traversal (like the standard wait-queues),
> and then add a cursor entry like Tim's patch if dropping the lock in
> the middle.
>
> Anyway, looking at the old code, we *used* to limit the page wait hash
> table to 4k entries, and we used to have one hash table per memory
> zone.
>
> The per-zone thing didn't work at all for the generic bit-waitqueues,
> because of how people used them on virtual addresses on the stack.
>
> But it *could* work for the page waitqueues, which are now a totally
> separate entity, and is obviously always physically addressed (since
> the indexing is by "struct page" pointer), and doesn't have that
> issue.
>
> So I guess we could re-introduce the notion of per-zone page waitqueue
> hash tables. It was disgusting to allocate and free though (and hooked
> into the memory hotplug code).
>
> So I'd still hope that we can instead just have one larger hash table,
> and that is sufficient for the problem.

If increasing the hash table size fixes the problem I am wondering if
rhash tables might be the proper solution to this problem. They start
out small and then grow as needed.

Eric

2017-08-17 16:18:07

by Liang, Kan

[permalink] [raw]
Subject: RE: [PATCH 1/2] sched/wait: Break up long wake list walk


> On Mon, Aug 14, 2017 at 5:52 PM, Tim Chen <[email protected]>
> wrote:
> > We encountered workloads that have very long wake up list on large
> > systems. A waker takes a long time to traverse the entire wake list
> > and execute all the wake functions.
> >
> > We saw page wait list that are up to 3700+ entries long in tests of
> > large
> > 4 and 8 socket systems. It took 0.8 sec to traverse such list during
> > wake up. Any other CPU that contends for the list spin lock will spin
> > for a long time. As page wait list is shared by many pages so it
> > could get very long on systems with large memory.
>
> I really dislike this patch.
>
> The patch seems a band-aid for really horrible kernel behavior, rather than
> fixing the underlying problem itself.
>
> Now, it may well be that we do end up needing this band-aid in the end, so
> this isn't a NAK of the patch per se. But I'd *really* like to see if we can fix the
> underlying cause for what you see somehow..
>
> In particular, if this is about the page wait table, maybe we can just make the
> wait table bigger. IOW, are people actually waiting on the
> *same* page, or are they mainly waiting on totally different pages, just
> hashing to the same wait queue?
>
> Because right now that page wait table is a small fixed size, and the only
> reason it's a small fixed size is that nobody reported any issues with it -
> particularly since we now avoid the wait table entirely for the common cases
> by having that "contention" bit.
>
> But it really is a *small* table. We literally have
>
> #define PAGE_WAIT_TABLE_BITS 8
>
> so it's just 256 entries. We could easily it much bigger, if we are actually
> seeing a lot of collissions.
>
> We *used* to have a very complex per-zone thing for bit-waitiqueues, but
> that was because we got lots and lots of contention issues, and everybody
> *always* touched the wait-queues whether they waited or not (so being per-
> zone was a big deal)
>
> We got rid of all that per-zone complexity when the normal case didn't hit in
> the page wait queues at all, but we may have over-done the simplification a
> bit since nobody showed any issue.
>
> In particular, we used to size the per-zone thing by amount of memory.
> We could easily re-introduce that for the new simpler page queues.
>
> The page_waitiqueue() is a simple helper function inside mm/filemap.c, and
> thanks to the per-page "do we have actual waiters" bit that we have now, we
> can actually afford to make it bigger and more complex now if we want to.
>
> What happens to your load if you just make that table bigger? You can
> literally test by just changing the constant from 8 to 16 or something, making
> us use twice as many bits for hashing. A "real"
> patch would size it by amount of memory, but just for testing the contention
> on your load, you can do the hacky one-liner.

Hi Linus,

We tried both 12 and 16 bit table and that didn't make a difference.
The long wake ups are mostly on the same page when we do instrumentation

Here is the wake_up_page_bit call stack when the workaround is running, which
is collected by perf record -g -a -e probe:wake_up_page_bit -- sleep 10

# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 374 of event 'probe:wake_up_page_bit'
# Event count (approx.): 374
#
# Overhead Trace output
# ........ ..................
#
100.00% (ffffffffae1ad000)
|
---wake_up_page_bit
|
|--49.73%--migrate_misplaced_transhuge_page
| do_huge_pmd_numa_page
| __handle_mm_fault
| handle_mm_fault
| __do_page_fault
| do_page_fault
| page_fault
| |
| |--28.07%--0x2b7b7
| | |
| | |--13.64%--0x127a2
| | | 0x7fb5247eddc5
| | |
| | |--13.37%--0x127d8
| | | 0x7fb5247eddc5
| | |
| | |--0.53%--0x1280e
| | | 0x7fb5247eddc5
| | |
| | --0.27%--0x12844
| | 0x7fb5247eddc5
| |
| |--18.18%--0x2b788
| | |
| | |--14.97%--0x127a2
| | | 0x7fb5247eddc5
| | |
| | |--1.34%--0x1287a
| | | 0x7fb5247eddc5
| | |
| | |--0.53%--0x128b0
| | | 0x7fb5247eddc5
| | |
| | |--0.53%--0x1280e
| | | 0x7fb5247eddc5
| | |
| | |--0.53%--0x127d8
| | | 0x7fb5247eddc5
| | |
| | --0.27%--0x12844
| | 0x7fb5247eddc5
| |
| |--1.07%--0x2b823
| | |
| | |--0.53%--0x127a2
| | | 0x7fb5247eddc5
| | |
| | |--0.27%--0x1287a
| | | 0x7fb5247eddc5
| | |
| | --0.27%--0x127d8
| | 0x7fb5247eddc5
| |
| |--0.80%--0x2b88f
| | |
| | --0.53%--0x127d8
| | 0x7fb5247eddc5
| |
| |--0.80%--0x2b7f4
| | |
| | |--0.53%--0x127d8
| | | 0x7fb5247eddc5
| | |
| | --0.27%--0x127a2
| | 0x7fb5247eddc5
| |
| |--0.53%--0x2b8fb
| | 0x127a2
| | 0x7fb5247eddc5
| |
| --0.27%--0x2b8e9
| 0x127a2
| 0x7fb5247eddc5
|
|--44.12%--__handle_mm_fault
| handle_mm_fault
| __do_page_fault
| do_page_fault
| page_fault
| |
| |--30.75%--_dl_relocate_object
| | dl_main
| | _dl_sysdep_start
| | 0x40
| |
| --13.37%--memset
| _dl_map_object
| |
| |--2.94%--_etext
| |
| |--0.80%--0x7f34ea294b08
| | 0
| |
| |--0.80%--0x7f1d5fa64b08
| | 0
| |
| |--0.53%--0x7fd4c83dbb08
| | 0
| |
| |--0.53%--0x7efe3724cb08
| | 0
| |
| |--0.27%--0x7ff2cf0b69c0
| | 0
| |
| |--0.27%--0x7fc9bc22cb08
| | 0
| |
| |--0.27%--0x7fc432971058
| | 0
| |
| |--0.27%--0x7faf21ec2b08
| | 0
| |
| |--0.27%--0x7faf21ec2640
| | 0
| |
| |--0.27%--0x7f940f08e058
| | 0
| |
| |--0.27%--0x7f4b84122640
| | 0
| |
| |--0.27%--0x7f42c8fd7fd8
| | 0
| |
| |--0.27%--0x7f3f15778fd8
| | 0
| |
| |--0.27%--0x7f3f15776058
| | 0
| |
| |--0.27%--0x7f34ea27dfd8
| | 0
| |
| |--0.27%--0x7f34ea27b058
| | 0
| |
| |--0.27%--0x7f2a0409bb08
| | 0
| |
| |--0.27%--0x7f2a04084fd8
| | 0
| |
| |--0.27%--0x7f2a04082058
| | 0
| |
| |--0.27%--0x7f1949633b08
| | 0
| |
| |--0.27%--0x7f194961cfd8
| | 0
| |
| |--0.27%--0x7f1629f87b08
| | 0
| |
| |--0.27%--0x7f1629f70fd8
| | 0
| |
| |--0.27%--0x7f1629f6e058
| | 0
| |
| |--0.27%--0x7f060696eb08
| | 0
| |
| |--0.27%--0x7f04ac14c9c0
| | 0
| |
| |--0.27%--0x7efe8b4bbb08
| | 0
| |
| |--0.27%--0x7efe8b4a59c0
| | 0
| |
| |--0.27%--0x7efe8b4a4fd8
| | 0
| |
| |--0.27%--0x7efe8b4a2058
| | 0
| |
| |--0.27%--0x7efcd0c70b08
| | 0
| |
| |--0.27%--0x207ad8
| | 0
| |
| --0.27%--0x206b30
| 0
|
|--2.14%--filemap_map_pages
| __handle_mm_fault
| handle_mm_fault
| __do_page_fault
| do_page_fault
| page_fault
| |
| |--0.53%--_IO_vfscanf
| | |
| | |--0.27%--0x6563697665442055
| | |
| | --0.27%--_IO_vsscanf
| | 0x6563697665442055
| |
| |--0.53%--_dl_map_object_from_fd
| | _dl_map_object
| | |
| | |--0.27%--0x7faf21ec2640
| | | 0
| | |
| | --0.27%--_etext
| |
| |--0.27%--__libc_enable_asynccancel
| | __fopen_internal
| | 0x6d6f6f2f30373635
| |
| |--0.27%--vfprintf
| | _IO_vsprintf
| | 0x4
| |
| |--0.27%--0x1fb40
| | 0x41d589495541f689
| |
| --0.27%--memset@plt
|
|--1.87%--do_huge_pmd_numa_page
| __handle_mm_fault
| handle_mm_fault
| __do_page_fault
| do_page_fault
| page_fault
| |
| |--0.80%--0x2b7b7
| | 0x127d8
| | 0x7fb5247eddc5
| |
| |--0.80%--0x2b788
| | 0x127a2
| | 0x7fb5247eddc5
| |
| --0.27%--0x2b918
| 0x127d8
| 0x7fb5247eddc5
|
|--1.87%--migrate_pages
| migrate_misplaced_page
| __handle_mm_fault
| handle_mm_fault
| __do_page_fault
| do_page_fault
| page_fault
| |
| |--1.07%--0x2b7b7
| | |
| | |--0.80%--0x127d8
| | | 0x7fb5247eddc5
| | |
| | --0.27%--0x1287a
| | 0x7fb5247eddc5
| |
| --0.80%--0x2b788
| 0x127a2
| 0x7fb5247eddc5
|
--0.27%--do_wp_page
__handle_mm_fault
handle_mm_fault
__do_page_fault
do_page_fault
page_fault
_IO_link_in

Thanks,
Kan

2017-08-17 16:25:11

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/wait: Break up long wake list walk

On Thu, Aug 17, 2017 at 9:17 AM, Liang, Kan <[email protected]> wrote:
>
> We tried both 12 and 16 bit table and that didn't make a difference.
> The long wake ups are mostly on the same page when we do instrumentation

Ok.

> Here is the wake_up_page_bit call stack when the workaround is running, which
> is collected by perf record -g -a -e probe:wake_up_page_bit -- sleep 10

It's actually not really wake_up_page_bit() that is all that
interesting, it would be more interesting to see which path it is that
*adds* the entries.

So it's mainly wait_on_page_bit_common(), but also add_page_wait_queue().

Can you get that call stack instead (or in addition to)?

Linus

2017-08-17 20:18:48

by Liang, Kan

[permalink] [raw]
Subject: RE: [PATCH 1/2] sched/wait: Break up long wake list walk



> > Here is the wake_up_page_bit call stack when the workaround is running,
> which
> > is collected by perf record -g -a -e probe:wake_up_page_bit -- sleep 10
>
> It's actually not really wake_up_page_bit() that is all that
> interesting, it would be more interesting to see which path it is that
> *adds* the entries.
>
> So it's mainly wait_on_page_bit_common(), but also
> add_page_wait_queue().
>
> Can you get that call stack instead (or in addition to)?
>

Here is the call stack of wait_on_page_bit_common
when the queue is long (entries >1000).

# Overhead Trace output
# ........ ..................
#
100.00% (ffffffff931aefca)
|
---wait_on_page_bit
__migration_entry_wait
migration_entry_wait
do_swap_page
__handle_mm_fault
handle_mm_fault
__do_page_fault
do_page_fault
page_fault
|
|--21.89%--0x123a2
| start_thread
|
|--21.64%--0x12352
| start_thread
|
|--20.90%--_int_free
| |
| --20.44%--0
|
|--7.34%--0x127a9
| start_thread
|
|--6.84%--0x127df
| start_thread
|
|--6.65%--0x12205
| 0x1206d
| 0x11f85
| 0x11a05
| 0x10302
| |
| --6.62%--0xa8ee
| |
| --5.22%--0x3af5
| __libc_start_main
|
|--5.40%--0x1284b
| start_thread
|
|--3.14%--0x12881
| start_thread
|
|--3.02%--0x12773
| start_thread
|
--2.97%--0x12815
start_thread

Thanks,
Kan

2017-08-17 20:44:43

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/wait: Break up long wake list walk

On Thu, Aug 17, 2017 at 1:18 PM, Liang, Kan <[email protected]> wrote:
>
> Here is the call stack of wait_on_page_bit_common
> when the queue is long (entries >1000).
>
> # Overhead Trace output
> # ........ ..................
> #
> 100.00% (ffffffff931aefca)
> |
> ---wait_on_page_bit
> __migration_entry_wait
> migration_entry_wait
> do_swap_page
> __handle_mm_fault
> handle_mm_fault
> __do_page_fault
> do_page_fault
> page_fault

Hmm. Ok, so it does seem to very much be related to migration. Your
wake_up_page_bit() profile made me suspect that, but this one seems to
pretty much confirm it.

So it looks like that wait_on_page_locked() thing in
__migration_entry_wait(), and what probably happens is that your load
ends up triggering a lot of migration (or just migration of a very hot
page), and then *every* thread ends up waiting for whatever page that
ended up getting migrated.

And so the wait queue for that page grows hugely long.

Looking at the other profile, the thing that is locking the page (that
everybody then ends up waiting on) would seem to be
migrate_misplaced_transhuge_page(), so this is _presumably_ due to
NUMA balancing.

Does the problem go away if you disable the NUMA balancing code?

Adding Mel and Kirill to the participants, just to make them aware of
the issue, and just because their names show up when I look at blame.

Linus

2017-08-18 12:23:43

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/wait: Break up long wake list walk

On Thu, Aug 17, 2017 at 01:44:40PM -0700, Linus Torvalds wrote:
> On Thu, Aug 17, 2017 at 1:18 PM, Liang, Kan <[email protected]> wrote:
> >
> > Here is the call stack of wait_on_page_bit_common
> > when the queue is long (entries >1000).
> >
> > # Overhead Trace output
> > # ........ ..................
> > #
> > 100.00% (ffffffff931aefca)
> > |
> > ---wait_on_page_bit
> > __migration_entry_wait
> > migration_entry_wait
> > do_swap_page
> > __handle_mm_fault
> > handle_mm_fault
> > __do_page_fault
> > do_page_fault
> > page_fault
>
> Hmm. Ok, so it does seem to very much be related to migration. Your
> wake_up_page_bit() profile made me suspect that, but this one seems to
> pretty much confirm it.
>
> So it looks like that wait_on_page_locked() thing in
> __migration_entry_wait(), and what probably happens is that your load
> ends up triggering a lot of migration (or just migration of a very hot
> page), and then *every* thread ends up waiting for whatever page that
> ended up getting migrated.
>

Agreed.

> And so the wait queue for that page grows hugely long.
>

It's basically only bounded by the maximum number of threads that can exist.

> Looking at the other profile, the thing that is locking the page (that
> everybody then ends up waiting on) would seem to be
> migrate_misplaced_transhuge_page(), so this is _presumably_ due to
> NUMA balancing.
>

Yes, migrate_misplaced_transhuge_page requires NUMA balancing to be part
of the picture.

> Does the problem go away if you disable the NUMA balancing code?
>
> Adding Mel and Kirill to the participants, just to make them aware of
> the issue, and just because their names show up when I look at blame.
>

I'm not imagining a way of dealing with this that would reliably detect
when there are a large number of waiters without adding a mess. We could
adjust the scanning rate to reduce the problem but it would be difficult
to target properly and wouldn't prevent the problem occurring with the
added hassle that it would now be intermittent.

Assuming the problem goes away by disabling NUMA then it would be nice if it
could be determined that the page lock holder is trying to allocate a page
when the queue is huge. That is part of the operation that potentially
takes a long time and may be why so many callers are stacking up. If
so, I would suggest clearing __GFP_DIRECT_RECLAIM from the GFP flags in
migrate_misplaced_transhuge_page and assume that a remote hit is always
going to be cheaper than compacting memory to successfully allocate a
THP. That may be worth doing unconditionally because we'd have to save a
*lot* of remote misses to offset compaction cost.

Nothing fancy other than needing a comment if it works.

diff --git a/mm/migrate.c b/mm/migrate.c
index 627671551873..87b0275ddcdb 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1926,7 +1926,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
goto out_dropref;

new_page = alloc_pages_node(node,
- (GFP_TRANSHUGE_LIGHT | __GFP_THISNODE),
+ (GFP_TRANSHUGE_LIGHT | __GFP_THISNODE) & ~__GFP_DIRECT_RECLAIM,
HPAGE_PMD_ORDER);
if (!new_page)
goto out_fail;

--
Mel Gorman
SUSE Labs

2017-08-18 13:06:24

by Liang, Kan

[permalink] [raw]
Subject: RE: [PATCH 1/2] sched/wait: Break up long wake list walk



> On Thu, Aug 17, 2017 at 1:18 PM, Liang, Kan <[email protected]> wrote:
> >
> > Here is the call stack of wait_on_page_bit_common when the queue is
> > long (entries >1000).
> >
> > # Overhead Trace output
> > # ........ ..................
> > #
> > 100.00% (ffffffff931aefca)
> > |
> > ---wait_on_page_bit
> > __migration_entry_wait
> > migration_entry_wait
> > do_swap_page
> > __handle_mm_fault
> > handle_mm_fault
> > __do_page_fault
> > do_page_fault
> > page_fault
>
> Hmm. Ok, so it does seem to very much be related to migration. Your
> wake_up_page_bit() profile made me suspect that, but this one seems to
> pretty much confirm it.
>
> So it looks like that wait_on_page_locked() thing in __migration_entry_wait(),
> and what probably happens is that your load ends up triggering a lot of
> migration (or just migration of a very hot page), and then *every* thread
> ends up waiting for whatever page that ended up getting migrated.
>
> And so the wait queue for that page grows hugely long.
>
> Looking at the other profile, the thing that is locking the page (that everybody
> then ends up waiting on) would seem to be
> migrate_misplaced_transhuge_page(), so this is _presumably_ due to NUMA
> balancing.
>
> Does the problem go away if you disable the NUMA balancing code?
>

Yes, the problem goes away when NUMA balancing is disabled.


Thanks,
Kan

2017-08-18 14:20:45

by Liang, Kan

[permalink] [raw]
Subject: RE: [PATCH 1/2] sched/wait: Break up long wake list walk



> On Thu, Aug 17, 2017 at 01:44:40PM -0700, Linus Torvalds wrote:
> > On Thu, Aug 17, 2017 at 1:18 PM, Liang, Kan <[email protected]> wrote:
> > >
> > > Here is the call stack of wait_on_page_bit_common when the queue is
> > > long (entries >1000).
> > >
> > > # Overhead Trace output
> > > # ........ ..................
> > > #
> > > 100.00% (ffffffff931aefca)
> > > |
> > > ---wait_on_page_bit
> > > __migration_entry_wait
> > > migration_entry_wait
> > > do_swap_page
> > > __handle_mm_fault
> > > handle_mm_fault
> > > __do_page_fault
> > > do_page_fault
> > > page_fault
> >
> > Hmm. Ok, so it does seem to very much be related to migration. Your
> > wake_up_page_bit() profile made me suspect that, but this one seems to
> > pretty much confirm it.
> >
> > So it looks like that wait_on_page_locked() thing in
> > __migration_entry_wait(), and what probably happens is that your load
> > ends up triggering a lot of migration (or just migration of a very hot
> > page), and then *every* thread ends up waiting for whatever page that
> > ended up getting migrated.
> >
>
> Agreed.
>
> > And so the wait queue for that page grows hugely long.
> >
>
> It's basically only bounded by the maximum number of threads that can exist.
>
> > Looking at the other profile, the thing that is locking the page (that
> > everybody then ends up waiting on) would seem to be
> > migrate_misplaced_transhuge_page(), so this is _presumably_ due to
> > NUMA balancing.
> >
>
> Yes, migrate_misplaced_transhuge_page requires NUMA balancing to be
> part of the picture.
>
> > Does the problem go away if you disable the NUMA balancing code?
> >
> > Adding Mel and Kirill to the participants, just to make them aware of
> > the issue, and just because their names show up when I look at blame.
> >
>
> I'm not imagining a way of dealing with this that would reliably detect when
> there are a large number of waiters without adding a mess. We could adjust
> the scanning rate to reduce the problem but it would be difficult to target
> properly and wouldn't prevent the problem occurring with the added hassle
> that it would now be intermittent.
>
> Assuming the problem goes away by disabling NUMA then it would be nice if
> it could be determined that the page lock holder is trying to allocate a page
> when the queue is huge. That is part of the operation that potentially takes a
> long time and may be why so many callers are stacking up. If so, I would
> suggest clearing __GFP_DIRECT_RECLAIM from the GFP flags in
> migrate_misplaced_transhuge_page and assume that a remote hit is always
> going to be cheaper than compacting memory to successfully allocate a THP.
> That may be worth doing unconditionally because we'd have to save a
> *lot* of remote misses to offset compaction cost.
>
> Nothing fancy other than needing a comment if it works.
>

No, the patch doesn't work.

Thanks,
Kan

> diff --git a/mm/migrate.c b/mm/migrate.c index
> 627671551873..87b0275ddcdb 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1926,7 +1926,7 @@ int migrate_misplaced_transhuge_page(struct
> mm_struct *mm,
> goto out_dropref;
>
> new_page = alloc_pages_node(node,
> - (GFP_TRANSHUGE_LIGHT | __GFP_THISNODE),
> + (GFP_TRANSHUGE_LIGHT | __GFP_THISNODE) &
> ~__GFP_DIRECT_RECLAIM,
> HPAGE_PMD_ORDER);
> if (!new_page)
> goto out_fail;
>
> --
> Mel Gorman
> SUSE Labs

2017-08-18 14:46:26

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/wait: Break up long wake list walk

On Fri, Aug 18, 2017 at 02:20:38PM +0000, Liang, Kan wrote:
> > Nothing fancy other than needing a comment if it works.
> >
>
> No, the patch doesn't work.
>

That indicates that it may be a hot page and it's possible that the page is
locked for a short time but waiters accumulate. What happens if you leave
NUMA balancing enabled but disable THP? Waiting on migration entries also
uses wait_on_page_locked so it would be interesting to know if the problem
is specific to THP.

Can you tell me what this workload is doing? I want to see if it's something
like many threads pounding on a limited number of pages very quickly. If
it's many threads working on private data, it would also be important to
know how each buffers threads are aligned, particularly if the buffers
are smaller than a THP or base page size. For example, if each thread is
operating on a base page sized buffer then disabling THP would side-step
the problem but THP would be false sharing between multiple threads.


--
Mel Gorman
SUSE Labs

2017-08-18 16:36:14

by Tim Chen

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/wait: Break up long wake list walk

On 08/18/2017 07:46 AM, Mel Gorman wrote:
> On Fri, Aug 18, 2017 at 02:20:38PM +0000, Liang, Kan wrote:
>>> Nothing fancy other than needing a comment if it works.
>>>
>>
>> No, the patch doesn't work.
>>
>
> That indicates that it may be a hot page and it's possible that the page is
> locked for a short time but waiters accumulate. What happens if you leave
> NUMA balancing enabled but disable THP? Waiting on migration entries also
> uses wait_on_page_locked so it would be interesting to know if the problem
> is specific to THP.
>
> Can you tell me what this workload is doing? I want to see if it's something
> like many threads pounding on a limited number of pages very quickly. If

It is a customer workload so we have limited visibility. But we believe there are
some pages that are frequently accessed by all threads.

> it's many threads working on private data, it would also be important to
> know how each buffers threads are aligned, particularly if the buffers
> are smaller than a THP or base page size. For example, if each thread is
> operating on a base page sized buffer then disabling THP would side-step
> the problem but THP would be false sharing between multiple threads.
>

Still, I don't think this problem is THP specific. If there is a hot
regular page getting migrated, we'll also see many threads get
queued up quickly. THP may have made the problem worse as migrating
it takes a longer time, meaning more threads could get queued up.

Thanks.

Tim

2017-08-18 16:45:08

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/wait: Break up long wake list walk

> Still, I don't think this problem is THP specific. If there is a hot
> regular page getting migrated, we'll also see many threads get
> queued up quickly. THP may have made the problem worse as migrating
> it takes a longer time, meaning more threads could get queued up.

Also THP probably makes more threads collide because the pages are larger.
But still it can all happen even without THP.

-Andi

2017-08-18 16:53:35

by Liang, Kan

[permalink] [raw]
Subject: RE: [PATCH 1/2] sched/wait: Break up long wake list walk



> On Fri, Aug 18, 2017 at 02:20:38PM +0000, Liang, Kan wrote:
> > > Nothing fancy other than needing a comment if it works.
> > >
> >
> > No, the patch doesn't work.
> >
>
> That indicates that it may be a hot page and it's possible that the page is
> locked for a short time but waiters accumulate. What happens if you leave
> NUMA balancing enabled but disable THP?

No, disabling THP doesn't help the case.

Thanks,
Kan

> Waiting on migration entries also
> uses wait_on_page_locked so it would be interesting to know if the problem
> is specific to THP.
>
> Can you tell me what this workload is doing? I want to see if it's something
> like many threads pounding on a limited number of pages very quickly. If it's
> many threads working on private data, it would also be important to know
> how each buffers threads are aligned, particularly if the buffers are smaller
> than a THP or base page size. For example, if each thread is operating on a
> base page sized buffer then disabling THP would side-step the problem but
> THP would be false sharing between multiple threads.
>
>
> --
> Mel Gorman
> SUSE Labs

2017-08-18 16:55:58

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/wait: Break up long wake list walk

On Fri, Aug 18, 2017 at 5:23 AM, Mel Gorman <[email protected]> wrote:
>
> new_page = alloc_pages_node(node,
> - (GFP_TRANSHUGE_LIGHT | __GFP_THISNODE),
> + (GFP_TRANSHUGE_LIGHT | __GFP_THISNODE) & ~__GFP_DIRECT_RECLAIM,
> HPAGE_PMD_ORDER);

That can't make any difference. We already have:

#define GFP_TRANSHUGE_LIGHT ((GFP_HIGHUSER_MOVABLE | __GFP_COMP | \
__GFP_NOMEMALLOC | __GFP_NOWARN) & ~__GFP_RECLAIM)

and that "& ~__GFP_RECLAIM" is removing __GFP_DIRECT_RECLAIM.

So that patch is a no-op, afaik.

Is there something else expensive in there?

It *might* be simply that we have a shit-ton of threads, and the
thread that holds the page lock for migration is just preempted out.
even if it doesn't really do anything particularly expensive.

Linus

2017-08-18 17:48:26

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/wait: Break up long wake list walk

On Fri, Aug 18, 2017 at 9:53 AM, Liang, Kan <[email protected]> wrote:
>
>> On Fri, Aug 18, 2017 Mel Gorman wrote:
>>
>> That indicates that it may be a hot page and it's possible that the page is
>> locked for a short time but waiters accumulate. What happens if you leave
>> NUMA balancing enabled but disable THP?
>
> No, disabling THP doesn't help the case.

Interesting. That particular code sequence should only be active for
THP. What does the profile look like with THP disabled but with NUMA
balancing still enabled?

Just asking because maybe that different call chain could give us some
other ideas of what the commonality here is that triggers out
behavioral problem.

I was really hoping that we'd root-cause this and have a solution (and
then apply Tim's patch as a "belt and suspenders" kind of thing), but
it's starting to smell like we may have to apply Tim's patch as a
band-aid, and try to figure out what the trigger is longer-term.

Linus

2017-08-18 18:55:01

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/wait: Break up long wake list walk

On Fri, Aug 18, 2017 at 10:48:23AM -0700, Linus Torvalds wrote:
> On Fri, Aug 18, 2017 at 9:53 AM, Liang, Kan <[email protected]> wrote:
> >
> >> On Fri, Aug 18, 2017 Mel Gorman wrote:
> >>
> >> That indicates that it may be a hot page and it's possible that the page is
> >> locked for a short time but waiters accumulate. What happens if you leave
> >> NUMA balancing enabled but disable THP?
> >
> > No, disabling THP doesn't help the case.
>
> Interesting. That particular code sequence should only be active for
> THP. What does the profile look like with THP disabled but with NUMA
> balancing still enabled?
>

While that specific code sequence is active in the example, the problem is
fundamental to what NUMA balancing does. If many threads share a single page,
base page or THP, then any thread accessing the data during migration will
block on page lock. The symptoms will be difference but I am willing to
bet it'll be a wake on a page lock either way. NUMA balancing is somewhat
unique in that it's relatively easy to have lots of threads depend on a
single pages lock.

> Just asking because maybe that different call chain could give us some
> other ideas of what the commonality here is that triggers out
> behavioral problem.
>

I am reasonably confident that the commonality is multiple threads sharing
a page. Maybe it's a single hot structure that is shared between threads.
Maybe it's parallel buffers that are aligned on a sub-page boundary.
Multiple threads accessing buffers aligned by cache lines would do it
which is reasonable behaviour for a parallelised compute load for example.
If I'm right, writing a test case for it is straight-forward and I'll get
to it on Monday when I'm back near my normal work machine.

The initial guess that it may be allocation latency was obviously way
off. I didn't follow through properly but the fact it's not THP specific
means the length of time it takes to migrate is possibly irrelevant. If
the page is hot enough, threads will block once migration starts even if
the migration completes quickly.

> I was really hoping that we'd root-cause this and have a solution (and
> then apply Tim's patch as a "belt and suspenders" kind of thing), but
> it's starting to smell like we may have to apply Tim's patch as a
> band-aid, and try to figure out what the trigger is longer-term.
>

I believe the trigger is simply because a hot page gets unmapped and
then threads lock on it.

One option to mitigate (but not eliminate) the problem is to record when
the page lock is contended and pass in TNF_PAGE_CONTENDED (new flag) to
task_numa_fault(). For each time it's passed in, shift numa_scan_period
<< 1 which will slow the scanner and reduce the frequency contention
occurs at. If it's heavily contended, the period will quickly reach
numa_scan_period_max. That is simple with the caveat that a single hot
contended page will slow all balancing. The main problem is that this
mitigates and not eliminates the problem. No matter how slow the scanner
is, it'll still hit the case where many threads contend on a single page.

An alternative is to set a VMA flag on VMAs if many contentions are
detected and stop scanning that VMA entirely. It would need a VMA flag
which right now might mean making vm_flags u64 and increasing the size of
vm_area_struct on 32-bit. The downside is that it is permanent. If heavy
contention happens then scanning that VMA is disabled for the lifetime
of the process because there would no longer be a way to detect that
re-enabling is appropriate.

A variation would be to record contentions in struct numa_group and return
false in should_numa_migrate_memory if contentions are high and scale it
down over time. It wouldn't be perfect as processes sharing hot pages are
not guaranteed to have a numa_group in common but it may be sufficient.

--
Mel Gorman
SUSE Labs

2017-08-18 19:14:15

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/wait: Break up long wake list walk

On Fri, Aug 18, 2017 at 11:54 AM, Mel Gorman
<[email protected]> wrote:
>
> One option to mitigate (but not eliminate) the problem is to record when
> the page lock is contended and pass in TNF_PAGE_CONTENDED (new flag) to
> task_numa_fault().

Well, finding it contended is fairly easy - just look at the page wait
queue, and if it's not empty, assume it's due to contention.

I also wonder if we could be even *more* hacky, and in the whole
__migration_entry_wait() path, change the logic from:

- wait on page lock before retrying the fault

to

- yield()

which is hacky, but there's a rationale for it:

(a) avoid the crazy long wait queues ;)

(b) we know that migration is *supposed* to be CPU-bound (not IO
bound), so yielding the CPU and retrying may just be the right thing
to do.

It's possible that we could just do a hybrid approach, and introduce a
"wait_on_page_lock_or_yield()", that does a sleeping wait if the
wait-queue is short, and a yield otherwise, but it might be worth just
testing the truly stupid patch.

Because that code sequence doesn't actually depend on
"wait_on_page_lock()" for _correctness_ anyway, afaik. Anybody who
does "migration_entry_wait()" _has_ to retry anyway, since the page
table contents may have changed by waiting.

So I'm not proud of the attached patch, and I don't think it's really
acceptable as-is, but maybe it's worth testing? And maybe it's
arguably no worse than what we have now?

Comments?

(Yeah, if we take this approach, we might even just say "screw the
spinlock - just do ACCESS_ONCE() and do a yield() if it looks like a
migration entry")

Linus


Attachments:
patch.diff (1.00 kB)

2017-08-18 19:58:59

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/wait: Break up long wake list walk

> which is hacky, but there's a rationale for it:
>
> (a) avoid the crazy long wait queues ;)
>
> (b) we know that migration is *supposed* to be CPU-bound (not IO
> bound), so yielding the CPU and retrying may just be the right thing
> to do.

So this would degenerate into a spin when the contention is with
other CPUs?

But then if we guarantee that migration has flat latency curve
and no long tail it may be reasonable.

If the contention is with the local CPU it could cause some
unfairness (and in theory priority inheritance issues with local
CPU contenders?), but hopefully not too bad.

-Andi

2017-08-18 20:05:14

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/wait: Break up long wake list walk

> I was really hoping that we'd root-cause this and have a solution (and
> then apply Tim's patch as a "belt and suspenders" kind of thing), but

One thing I wanted to point out is that Tim's patch seems to make
several schedule intensive micro benchmarks faster.

I think what's happening is that it allows more parallelism during wakeup:

Normally it's like

CPU 1 CPU 2 CPU 3 .....

LOCK
wake up tasks on other CPUs woken up woken up
UNLOCK SPIN on waitq lock SPIN on waitq lock
LOCK
remove waitq
UNLOCk
LOCK
remove waitq
UNLOCK

So everything is serialized.

But with the bookmark patch the other CPUs can go through the "remove waitq" sequence
earlier because they have a chance to get a go at the lock and do it in parallel
with the main wakeup.

Tim used a 64 task threshold for the bookmark. That may be actually too large.
It may even be faster to use a shorter one.

So I think it's more than a bandaid, but likely a useful performance improvement
even for less extreme wait queues.

-Andi

2017-08-18 20:10:07

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/wait: Break up long wake list walk

On Fri, Aug 18, 2017 at 12:58 PM, Andi Kleen <[email protected]> wrote:
>> which is hacky, but there's a rationale for it:
>>
>> (a) avoid the crazy long wait queues ;)
>>
>> (b) we know that migration is *supposed* to be CPU-bound (not IO
>> bound), so yielding the CPU and retrying may just be the right thing
>> to do.
>
> So this would degenerate into a spin when the contention is with
> other CPUs?
>
> But then if we guarantee that migration has flat latency curve
> and no long tail it may be reasonable.

Honestly, right now I'd say it's more of a "poath meant purely for
testing with some weak-ass excuse for why it might not be broken".

Linus

2017-08-18 20:29:35

by Liang, Kan

[permalink] [raw]
Subject: RE: [PATCH 1/2] sched/wait: Break up long wake list walk

> >>
> >> That indicates that it may be a hot page and it's possible that the
> >> page is locked for a short time but waiters accumulate. What happens
> >> if you leave NUMA balancing enabled but disable THP?
> >
> > No, disabling THP doesn't help the case.
>
> Interesting. That particular code sequence should only be active for THP.
> What does the profile look like with THP disabled but with NUMA balancing
> still enabled?

Here is the profiling with THP disabled for wait_on_page_bit_common and
wake_up_page_bit.


The call stack of wait_on_page_bit_common
# Overhead Trace output
# ........ ..................
#
100.00% (ffffffff821aefca)
|
---wait_on_page_bit
__migration_entry_wait
migration_entry_wait
do_swap_page
__handle_mm_fault
handle_mm_fault
__do_page_fault
do_page_fault
page_fault
|
|--24.28%--_int_free
| |
| --24.15%--0
|
|--15.48%--0x2b788
| |
| --15.47%--0x127a2
| start_thread
|
|--13.54%--0x2b7b7
| |
| |--8.68%--0x127a2
| | start_thread
| |
| --4.86%--0x127d8
| start_thread
|
|--11.69%--0x123a2
| start_thread
|
|--6.30%--0x12205
| 0x1206d
| 0x11f85
| 0x11a05
| 0x10302
| |
| --6.27%--0xa8ee
| |
| --5.48%--0x3af5
| |
| --5.43%--__libc_start_main
|
|--5.24%--0x12352
| start_thread
|
|--3.56%--0x127bc
| |
| --3.55%--start_thread
|
|--3.06%--0x127a9
| start_thread
|
|--3.05%--0x127f2
| |
| --3.05%--start_thread
|
|--2.62%--0x127df
| start_thread
|
|--2.35%--0x1285e
| start_thread
|
|--1.86%--0x1284b
| start_thread
|
|--1.23%--0x12894
| start_thread
|
|--1.23%--0x12828
| start_thread
|
|--1.12%--0x1233c
| start_thread
|
|--1.02%--0x12881
| start_thread
|
|--0.99%--0x12773
| start_thread
|
--0.97%--0x12815
start_thread


The profile of wake_up_page_bit is still a 10 sec sample.
# Samples: 5K of event 'probe:wake_up_page_bit'
# Event count (approx.): 5645
#
# Overhead Trace output
# ........ ..................
#
100.00% (ffffffff821ad000)
|
---wake_up_page_bit
|
|--50.89%--do_wp_page
| __handle_mm_fault
| handle_mm_fault
| __do_page_fault
| do_page_fault
| page_fault
| |
| |--38.97%--_dl_fixup
| | |
| | |--16.88%--0x7f933d9f2e40
| | | 0
| | |
| | |--13.73%--0x7fb87a828e40
| | | 0
| | |
| | |--4.84%--0x7fed49202e40
| | | 0
| | |
| | |--0.87%--0x7fed491ffa50
| | | 0
| | |
| | |--0.73%--0x7f933d9efa50
| | | 0
| | |
| | --0.71%--0x7fed492024b0
| | 0
| |
| |--3.14%--_dl_fini
| | __run_exit_handlers
| | |
| | |--1.81%--0x7fb87994f2a0
| | | 0
| | |
| | |--0.71%--0x7fed483292a0
| | | 0
| | |
| | --0.62%--0x7f933cb192a0
| | 0
| |
| |--1.91%--0x6ad0
| | __run_exit_handlers
| | |
| | |--1.03%--0x7fb87994f2a0
| | | 0
| | |
| | --0.87%--0x7f933cb192a0
| | 0
| |
| |--1.52%--ped_disk_type_unregister
| | __run_exit_handlers
| | |
| | --1.06%--0x7fed483292a0
| | 0
| |
| |--1.06%--0xcd89
| | __run_exit_handlers
| | |
| | --0.51%--0x7fb87994f2a0
| | 0
| |
| |--1.05%--__offtime
| | 0
| |
| |--0.83%--0x45f9
| | __run_exit_handlers
| | |
| | --0.73%--0x7fed483292a0
| | 0
| |
| |--0.66%--0x10de8
| | __run_exit_handlers
| |
| --0.57%--0x3455
| __run_exit_handlers
|
|--45.85%--migrate_pages
| migrate_misplaced_page
| __handle_mm_fault
| handle_mm_fault
| __do_page_fault
| do_page_fault
| page_fault
| |
| |--12.21%--0x42f2
| | 0x11f77
| | 0x11a05
| | 0x10302
| | 0xa8ee
| | |
| | --9.44%--0x3af5
| | __libc_start_main
| |
| |--3.79%--_int_free
| | 0
| |
| |--2.69%--_dl_fini
| | __run_exit_handlers
| | |
| | |--1.17%--0x7f933cb192a0
| | | 0
| | |
| | |--0.85%--0x7fb87994f2a0
| | | 0
| | |
| | --0.67%--0x7fed483292a0
| | 0
| |
| |--2.57%--0x12205
| | 0x1206d
| | 0x11f85
| | 0x11a05
| | 0x10302
| | 0xa8ee
| | |
| | --1.98%--0x3af5
| | __libc_start_main
| |
| |--1.20%--_dl_fixup
| | |
| | |--0.71%--0x3af5
| | | __libc_start_main
| | |
| | --0.50%--_dl_fini
| | __run_exit_handlers
| |
| |--1.15%--do_lookup_x
| |
| |--0.99%--0xcc26
| | __run_exit_handlers
| |
| |--0.90%--ped_device_free_all
| | __run_exit_handlers
| |
| |--0.89%--__do_global_dtors_aux
| | __run_exit_handlers
| |
| |--0.89%--0x3448
| | __run_exit_handlers
| | |
| | --0.53%--0x7fb87994f2a0
| | 0
| |
| |--0.83%--0x25bc4
| | __run_exit_handlers
| |
| |--0.83%--check_match.9440
| | 0xae470
| |
| |--0.80%--0x30f0
| | __run_exit_handlers
| |
| |--0.73%--0x17a0
| | __run_exit_handlers
| |
| |--0.71%--0xcd60
| | __run_exit_handlers
| |
| |--0.71%--0x4754
| | __run_exit_handlers
| |
| |--0.69%--dm_get_suspended_counter@plt
| | __run_exit_handlers
| |
| |--0.60%--free@plt
| | 0
| |
| |--0.60%--0x1580
| | __run_exit_handlers
| |
| |--0.55%--__tz_compute
| | 0
| |
| |--0.55%--0x6020
| | __run_exit_handlers
| |
| |--0.55%--__do_global_dtors_aux
| | __run_exit_handlers
| |
| |--0.53%--0x25ae4
| | __run_exit_handlers
| |
| |--0.53%--0x11a16
| | 0x10302
| | 0xa8ee
| |
| |--0.53%--dm_get_suspended_counter
| | __run_exit_handlers
| |
| |--0.53%--ped_device_free_all@plt
| | __run_exit_handlers
| |
| |--0.53%--__do_global_dtors_aux
| | __run_exit_handlers
| |
| |--0.53%--0x1620
| | __run_exit_handlers
| |
| |--0.50%--__cxa_finalize@plt
| | _dl_fini
| | __run_exit_handlers
| |
| --0.50%--0x1910
| __run_exit_handlers
|
|--1.72%--filemap_map_pages
| __handle_mm_fault
| handle_mm_fault
| __do_page_fault
| do_page_fault
| page_fault
|
--1.54%--__handle_mm_fault
handle_mm_fault
__do_page_fault
do_page_fault
page_fault
|
|--0.69%--memset
| _dl_map_object
|
--0.64%--_dl_relocate_object
dl_main
_dl_sysdep_start
0x40


Thanks,
Kan



2017-08-18 20:29:56

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/wait: Break up long wake list walk

On Fri, Aug 18, 2017 at 1:05 PM, Andi Kleen <[email protected]> wrote:
>
> I think what's happening is that it allows more parallelism during wakeup:
>
> Normally it's like
>
> CPU 1 CPU 2 CPU 3 .....
>
> LOCK
> wake up tasks on other CPUs woken up woken up
> UNLOCK SPIN on waitq lock SPIN on waitq lock

Hmm. The processes that are woken up shouldn't need to touch the waitq
lock after wakeup. The default "autoremove_wake_function()" does the
wait list removal, so if you just use the normal wait/wakeup, you're
all done an don't need to do anythig more.

That's very much by design.

In fact, it's why "finish_wait()" uses that "list_empty_careful()"
thing on the entry - exactly so that it only needs to take the wait
queue lock if it is still on the wait list (ie it was woken up by
something else).

Now, it *is* racy, in the sense that the autoremove_wake_function()
will remove the entry *after* having successfully woken up the
process, so with bad luck and a quick wakeup, the woken process may
not see the good list_empty_careful() case.

So we really *should* do the remove earlier inside the pi_lock region
in ttwu(). We don't have that kind of interface, though. If you
actually do see tasks getting stuck on the waitqueue lock after being
woken up, it might be worth looking at, though.

The other possibility is that you were looking at cases that didn't
use "autoremove_wake_function()" at all, of course. Maybe they are
worth fixing. The autoremval really does make a difference, exactly
because of the issue you point to.

Linus

2017-08-18 20:34:40

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/wait: Break up long wake list walk

On Fri, Aug 18, 2017 at 1:29 PM, Liang, Kan <[email protected]> wrote:
> Here is the profiling with THP disabled for wait_on_page_bit_common and
> wake_up_page_bit.
>
>
> The call stack of wait_on_page_bit_common
> # Overhead Trace output
> # ........ ..................
> #
> 100.00% (ffffffff821aefca)
> |
> ---wait_on_page_bit
> __migration_entry_wait
> migration_entry_wait
> do_swap_page

Ok, so it really is exactly the same thing, just for a regular page,
and there is absolutely nothing huge-page specific to this.

Thanks.

If you can test that (hacky, ugly) yield() patch, just to see how it
behaves (maybe it degrades performance horribly even if it then avoids
the long wait queues), that would be lovely.

Does the load actually have some way of measuring performance? Because
with the yield(), I'd hope that all the wait_on_page_bit() stuff is
all gone, but it might just *perform* horribly badly.

Linus

2017-08-21 18:32:37

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/wait: Break up long wake list walk

On Fri, Aug 18, 2017 at 12:14:12PM -0700, Linus Torvalds wrote:
> On Fri, Aug 18, 2017 at 11:54 AM, Mel Gorman
> <[email protected]> wrote:
> >
> > One option to mitigate (but not eliminate) the problem is to record when
> > the page lock is contended and pass in TNF_PAGE_CONTENDED (new flag) to
> > task_numa_fault().
>
> Well, finding it contended is fairly easy - just look at the page wait
> queue, and if it's not empty, assume it's due to contention.
>

Yes.

> I also wonder if we could be even *more* hacky, and in the whole
> __migration_entry_wait() path, change the logic from:
>
> - wait on page lock before retrying the fault
>
> to
>
> - yield()
>
> which is hacky, but there's a rationale for it:
>
> (a) avoid the crazy long wait queues ;)
>
> (b) we know that migration is *supposed* to be CPU-bound (not IO
> bound), so yielding the CPU and retrying may just be the right thing
> to do.
>

Potentially. I spent a few hours trying to construct a test case that
would migrate constantly that could be used as a basis for evaluating a
patch or alternative. Unfortunately it was not as easy as I thought and
I still have to construct a case that causes migration storms that would
result in multiple threads waiting on a single page.

> Because that code sequence doesn't actually depend on
> "wait_on_page_lock()" for _correctness_ anyway, afaik. Anybody who
> does "migration_entry_wait()" _has_ to retry anyway, since the page
> table contents may have changed by waiting.
>
> So I'm not proud of the attached patch, and I don't think it's really
> acceptable as-is, but maybe it's worth testing? And maybe it's
> arguably no worse than what we have now?
>
> Comments?
>

The transhuge migration path for numa balancing doesn't go through the
migration_entry_wait patch despite similarly named functions that suggest
it does so this may only has the most effect when THP is disabled. It's
worth trying anyway.

Covering both paths would be something like the patch below which spins
until the page is unlocked or it should reschedule. It's not even boot
tested as I spent what time I had on the test case that I hoped would be
able to prove it really works.

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 79b36f57c3ba..31cda1288176 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -517,6 +517,13 @@ static inline void wait_on_page_locked(struct page *page)
wait_on_page_bit(compound_head(page), PG_locked);
}

+void __spinwait_on_page_locked(struct page *page);
+static inline void spinwait_on_page_locked(struct page *page)
+{
+ if (PageLocked(page))
+ __spinwait_on_page_locked(page);
+}
+
static inline int wait_on_page_locked_killable(struct page *page)
{
if (!PageLocked(page))
diff --git a/mm/filemap.c b/mm/filemap.c
index a49702445ce0..c9d6f49614bc 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1210,6 +1210,15 @@ int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
}
}

+void __spinwait_on_page_locked(struct page *page)
+{
+ do {
+ cpu_relax();
+ } while (PageLocked(page) && !cond_resched());
+
+ wait_on_page_locked(page);
+}
+
/**
* page_cache_next_hole - find the next hole (not-present entry)
* @mapping: mapping
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 90731e3b7e58..c7025c806420 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1443,7 +1443,7 @@ int do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd)
if (!get_page_unless_zero(page))
goto out_unlock;
spin_unlock(vmf->ptl);
- wait_on_page_locked(page);
+ spinwait_on_page_locked(page);
put_page(page);
goto out;
}
@@ -1480,7 +1480,7 @@ int do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd)
if (!get_page_unless_zero(page))
goto out_unlock;
spin_unlock(vmf->ptl);
- wait_on_page_locked(page);
+ spinwait_on_page_locked(page);
put_page(page);
goto out;
}
diff --git a/mm/migrate.c b/mm/migrate.c
index e84eeb4e4356..9b6c3fc5beac 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -308,7 +308,7 @@ void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
if (!get_page_unless_zero(page))
goto out;
pte_unmap_unlock(ptep, ptl);
- wait_on_page_locked(page);
+ spinwait_on_page_locked(page);
put_page(page);
return;
out:


2017-08-21 18:56:25

by Liang, Kan

[permalink] [raw]
Subject: RE: [PATCH 1/2] sched/wait: Break up long wake list walk

> > Because that code sequence doesn't actually depend on
> > "wait_on_page_lock()" for _correctness_ anyway, afaik. Anybody who
> > does "migration_entry_wait()" _has_ to retry anyway, since the page
> > table contents may have changed by waiting.
> >
> > So I'm not proud of the attached patch, and I don't think it's really
> > acceptable as-is, but maybe it's worth testing? And maybe it's
> > arguably no worse than what we have now?
> >
> > Comments?
> >
>
> The transhuge migration path for numa balancing doesn't go through the
> migration_entry_wait patch despite similarly named functions that suggest
> it does so this may only has the most effect when THP is disabled. It's
> worth trying anyway.

I just finished the test of yield patch (only functionality not performance).
Yes, it works well with THP disabled.
With THP enabled, I observed one LOCKUP caused by long queue wait.

Here is the call stack with THP enabled.
#
100.00% (ffffffff9e1aefca)
|
---wait_on_page_bit
do_huge_pmd_numa_page
__handle_mm_fault
handle_mm_fault
__do_page_fault
do_page_fault
page_fault
|
|--60.39%--0x2b7b7
| |
| |--34.26%--0x127d8
| | start_thread
| |
| --25.95%--0x127a2
| start_thread
|
--39.25%--0x2b788
|
--38.81%--0x127a2
start_thread


>
> Covering both paths would be something like the patch below which spins
> until the page is unlocked or it should reschedule. It's not even boot
> tested as I spent what time I had on the test case that I hoped would be
> able to prove it really works.

I will give it a try.

Thanks,
Kan

>
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 79b36f57c3ba..31cda1288176 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -517,6 +517,13 @@ static inline void wait_on_page_locked(struct page
> *page)
> wait_on_page_bit(compound_head(page), PG_locked);
> }
>
> +void __spinwait_on_page_locked(struct page *page);
> +static inline void spinwait_on_page_locked(struct page *page)
> +{
> + if (PageLocked(page))
> + __spinwait_on_page_locked(page);
> +}
> +
> static inline int wait_on_page_locked_killable(struct page *page)
> {
> if (!PageLocked(page))
> diff --git a/mm/filemap.c b/mm/filemap.c
> index a49702445ce0..c9d6f49614bc 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1210,6 +1210,15 @@ int __lock_page_or_retry(struct page *page,
> struct mm_struct *mm,
> }
> }
>
> +void __spinwait_on_page_locked(struct page *page)
> +{
> + do {
> + cpu_relax();
> + } while (PageLocked(page) && !cond_resched());
> +
> + wait_on_page_locked(page);
> +}
> +
> /**
> * page_cache_next_hole - find the next hole (not-present entry)
> * @mapping: mapping
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 90731e3b7e58..c7025c806420 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1443,7 +1443,7 @@ int do_huge_pmd_numa_page(struct vm_fault
> *vmf, pmd_t pmd)
> if (!get_page_unless_zero(page))
> goto out_unlock;
> spin_unlock(vmf->ptl);
> - wait_on_page_locked(page);
> + spinwait_on_page_locked(page);
> put_page(page);
> goto out;
> }
> @@ -1480,7 +1480,7 @@ int do_huge_pmd_numa_page(struct vm_fault
> *vmf, pmd_t pmd)
> if (!get_page_unless_zero(page))
> goto out_unlock;
> spin_unlock(vmf->ptl);
> - wait_on_page_locked(page);
> + spinwait_on_page_locked(page);
> put_page(page);
> goto out;
> }
> diff --git a/mm/migrate.c b/mm/migrate.c
> index e84eeb4e4356..9b6c3fc5beac 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -308,7 +308,7 @@ void __migration_entry_wait(struct mm_struct *mm,
> pte_t *ptep,
> if (!get_page_unless_zero(page))
> goto out;
> pte_unmap_unlock(ptep, ptl);
> - wait_on_page_locked(page);
> + spinwait_on_page_locked(page);
> put_page(page);
> return;
> out:
>

2017-08-22 17:23:55

by Liang, Kan

[permalink] [raw]
Subject: RE: [PATCH 1/2] sched/wait: Break up long wake list walk


> > Covering both paths would be something like the patch below which
> > spins until the page is unlocked or it should reschedule. It's not
> > even boot tested as I spent what time I had on the test case that I
> > hoped would be able to prove it really works.
>
> I will give it a try.

Although the patch doesn't trigger watchdog, the spin lock wait time
is not small (0.45s).
It may get worse again on larger systems.


Irqsoff ftrace result.
# tracer: irqsoff
#
# irqsoff latency trace v1.1.5 on 4.13.0-rc4+
# --------------------------------------------------------------------
# latency: 451753 us, #4/4, CPU#159 | (M:desktop VP:0, KP:0, SP:0 HP:0 #P:224)
# -----------------
# | task: fjsctest-233851 (uid:0 nice:0 policy:0 rt_prio:0)
# -----------------
# => started at: wake_up_page_bit
# => ended at: wake_up_page_bit
#
#
# _------=> CPU#
# / _-----=> irqs-off
# | / _----=> need-resched
# || / _---=> hardirq/softirq
# ||| / _--=> preempt-depth
# |||| / delay
# cmd pid ||||| time | caller
# \ / ||||| \ | /
<...>-233851 159d... 0us@: _raw_spin_lock_irqsave <-wake_up_page_bit
<...>-233851 159dN.. 451726us+: _raw_spin_unlock_irqrestore <-wake_up_page_bit
<...>-233851 159dN.. 451754us!: trace_hardirqs_on <-wake_up_page_bit
<...>-233851 159dN.. 451873us : <stack trace>
=> unlock_page
=> migrate_pages
=> migrate_misplaced_page
=> __handle_mm_fault
=> handle_mm_fault
=> __do_page_fault
=> do_page_fault
=> page_fault


The call stack of wait_on_page_bit_common

100.00% (ffffffff971b252b)
|
---__spinwait_on_page_locked
|
|--96.81%--__migration_entry_wait
| migration_entry_wait
| do_swap_page
| __handle_mm_fault
| handle_mm_fault
| __do_page_fault
| do_page_fault
| page_fault
| |
| |--22.49%--0x123a2
| | |
| | --22.34%--start_thread
| |
| |--15.69%--0x127bc
| | |
| | --13.20%--start_thread
| |
| |--13.48%--0x12352
| | |
| | --11.74%--start_thread
| |
| |--13.43%--0x127f2
| | |
| | --11.25%--start_thread
| |
| |--10.03%--0x1285e
| | |
| | --8.59%--start_thread
| |
| |--5.90%--0x12894
| | |
| | --5.03%--start_thread
| |
| |--5.66%--0x12828
| | |
| | --4.81%--start_thread
| |
| |--5.17%--0x1233c
| | |
| | --4.46%--start_thread
| |
| --4.72%--0x2b788
| |
| --4.72%--0x127a2
| start_thread
|
--3.19%--do_huge_pmd_numa_page
__handle_mm_fault
handle_mm_fault
__do_page_fault
do_page_fault
page_fault
0x2b788
0x127a2
start_thread


>
> >
> > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index
> > 79b36f57c3ba..31cda1288176 100644
> > --- a/include/linux/pagemap.h
> > +++ b/include/linux/pagemap.h
> > @@ -517,6 +517,13 @@ static inline void wait_on_page_locked(struct
> > page
> > *page)
> > wait_on_page_bit(compound_head(page), PG_locked); }
> >
> > +void __spinwait_on_page_locked(struct page *page); static inline void
> > +spinwait_on_page_locked(struct page *page) {
> > + if (PageLocked(page))
> > + __spinwait_on_page_locked(page);
> > +}
> > +
> > static inline int wait_on_page_locked_killable(struct page *page) {
> > if (!PageLocked(page))
> > diff --git a/mm/filemap.c b/mm/filemap.c index
> > a49702445ce0..c9d6f49614bc 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -1210,6 +1210,15 @@ int __lock_page_or_retry(struct page *page,
> > struct mm_struct *mm,
> > }
> > }
> >
> > +void __spinwait_on_page_locked(struct page *page) {
> > + do {
> > + cpu_relax();
> > + } while (PageLocked(page) && !cond_resched());
> > +
> > + wait_on_page_locked(page);
> > +}
> > +
> > /**
> > * page_cache_next_hole - find the next hole (not-present entry)
> > * @mapping: mapping
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c index
> > 90731e3b7e58..c7025c806420 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1443,7 +1443,7 @@ int do_huge_pmd_numa_page(struct vm_fault
> *vmf,
> > pmd_t pmd)
> > if (!get_page_unless_zero(page))
> > goto out_unlock;
> > spin_unlock(vmf->ptl);
> > - wait_on_page_locked(page);
> > + spinwait_on_page_locked(page);
> > put_page(page);
> > goto out;
> > }
> > @@ -1480,7 +1480,7 @@ int do_huge_pmd_numa_page(struct vm_fault
> *vmf,
> > pmd_t pmd)
> > if (!get_page_unless_zero(page))
> > goto out_unlock;
> > spin_unlock(vmf->ptl);
> > - wait_on_page_locked(page);
> > + spinwait_on_page_locked(page);
> > put_page(page);
> > goto out;
> > }
> > diff --git a/mm/migrate.c b/mm/migrate.c index
> > e84eeb4e4356..9b6c3fc5beac 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -308,7 +308,7 @@ void __migration_entry_wait(struct mm_struct
> *mm,
> > pte_t *ptep,
> > if (!get_page_unless_zero(page))
> > goto out;
> > pte_unmap_unlock(ptep, ptl);
> > - wait_on_page_locked(page);
> > + spinwait_on_page_locked(page);
> > put_page(page);
> > return;
> > out:
> >

2017-08-22 18:19:15

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/wait: Break up long wake list walk

On Tue, Aug 22, 2017 at 10:23 AM, Liang, Kan <[email protected]> wrote:
>
> Although the patch doesn't trigger watchdog, the spin lock wait time
> is not small (0.45s).
> It may get worse again on larger systems.

Yeah, I don't think Mel's patch is great - because I think we could do
so much better.

What I like about Mel's patch is that it recognizes that
"wait_on_page_locked()" there is special, and replaces it with
something else. I think that "something else" is worse than my
"yield()" call, though.

In particular, it wastes CPU time even in the good case, and the
process that will unlock the page may actually be waiting for us to
reschedule. It may be CPU bound, but it might well have just been
preempted out.

So if we do busy loops, I really think we should also make sure that
the thing we're waiting for is not preempted.

HOWEVER, I'm actually starting to think that there is perhaps
something else going on.

Let me walk you through my thinking:

This is the migration logic:

(a) migration locks the page

(b) migration is supposedly CPU-limited

(c) migration then unlocks the page.

Ignore all the details, that's the 10.000 ft view. Right?

Now, if the above is right, then I have a question for people:

HOW IN THE HELL DO WE HAVE TIME FOR THOUSANDS OF THREADS TO HIT THAT ONE PAGE?

That just sounds really sketchy to me. Even if all those thousands of
threads are runnable, we need to schedule into them just to get them
to wait on that one page.

So that sounds really quite odd when migration is supposed to hold the
page lock for a relatively short time and get out. Don't you agree?

Which is why I started thinking of what the hell could go on for that
long wait-queue to happen.

One thing that strikes me is that the way wait_on_page_bit() works is
that it will NOT wait until the next bit clearing, it will wait until
it actively *sees* the page bit being clear.

Now, work with me on that. What's the difference?

What we could have is some bad NUMA balancing pattern that actually
has a page that everybody touches.

And hey, we pretty much know that everybody touches that page, since
people get stuck on that wait-queue, right?

And since everybody touches it, as a result everybody eventually
thinks that page should be migrated to their NUMA node.

But for all we know, the migration keeps on failing, because one of
the points of that "lock page - try to move - unlock page" is that
*TRY* in "try to move". There's a number of things that makes it not
actually migrate. Like not being movable, or failing to isolate the
page, or whatever.

So we could have some situation where we end up locking and unlocking
the page over and over again (which admittedly is already a sign of
something wrong in the NUMA balancing, but that's a separate issue).

And if we get into that situation, where everybody wants that one hot
page, what happens to the waiters?

One of the thousands of waiters is unlucky (remember, this argument
started with the whole "you shouldn't get that many waiters on one
single page that isn't even locked for that long"), and goes:

(a) Oh, the page is locked, I will wait for the lock bit to clear

(b) go to sleep

(c) the migration fails, the lock bit is cleared, the waiter is woken
up but doesn't get the CPU immediately, and one of the other
*thousands* of threads decides to also try to migrate (see above),

(d) the guy waiting for the lock bit to clear will see the page
"still" locked (really just "locked again") and continue to wait.

In the meantime, one of the other threads happens to be unlucky, also
hits the race, and now we have one more thread waiting for that page
lock. It keeps getting unlocked, but it also keeps on getting locked,
and so the queue can keep growing.

See where I'm going here? I think it's really odd how *thousands* of
threads can hit that locked window that is supposed to be pretty
small. But I think it's much more likely if we have some kind of
repeated event going on.

So I'm starting to think that part of the problem may be how stupid
that "wait_for_page_bit_common()" code is. It really shouldn't wait
until it sees that the bit is clear. It could have been cleared and
then re-taken.

And honestly, we actually have extra code for that "let's go round
again". That seems pointless. If the bit has been cleared, we've been
woken up, and nothing else would have done so anyway, so if we're not
interested in locking, we're simply *done* after we've done the
"io_scheduler()".

So I propose testing the attached trivial patch. It may not do
anything at all. But the existing code is actually doing extra work
just to be fragile, in case the scenario above can happen.

Comments?

Linus


Attachments:
patch.diff (572.00 B)

2017-08-22 18:25:58

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/wait: Break up long wake list walk

On Tue, Aug 22, 2017 at 11:19 AM, Linus Torvalds
<[email protected]> wrote:
>
> So I propose testing the attached trivial patch. It may not do
> anything at all. But the existing code is actually doing extra work
> just to be fragile, in case the scenario above can happen.

Side note: the patch compiles for me. But that is literally ALL the
testing it has gotten. I spent more time writing that email trying to
explain what my thinking was about that patch, than I spent anywhere
else on that patch.

So it may be garbage. Caveat probator.

Linus

2017-08-22 18:56:46

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/wait: Break up long wake list walk

On Tue, Aug 22, 2017 at 11:19:12AM -0700, Linus Torvalds wrote:
> diff --git a/mm/filemap.c b/mm/filemap.c
> index a49702445ce0..75c29a1f90fb 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -991,13 +991,11 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
> }
> }
>
> - if (lock) {
> - if (!test_and_set_bit_lock(bit_nr, &page->flags))
> - break;
> - } else {
> - if (!test_bit(bit_nr, &page->flags))
> - break;
> - }
> + if (!lock)
> + break;
> +
> + if (!test_and_set_bit_lock(bit_nr, &page->flags))
> + break;
> }

Won't we now prematurely terminate the wait when we get a spurious
wakeup?

2017-08-22 19:08:40

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/wait: Break up long wake list walk

On Tue, Aug 22, 2017 at 11:19:12AM -0700, Linus Torvalds wrote:
> And since everybody touches it, as a result everybody eventually
> thinks that page should be migrated to their NUMA node.

So that migration stuff has a filter on, we need two consecutive numa
faults from the same page_cpupid 'hash', see
should_numa_migrate_memory().

And since this appears to be anonymous memory (no THP) this is all a
single address space. However, we don't appear to invalidate TLBs when
we upgrade the PTE protection bits (not strictly required of course), so
we can have multiple CPUs trip over the same 'old' NUMA PTE.

Still, generating such a migration storm would be fairly tricky I think.

2017-08-22 19:15:34

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/wait: Break up long wake list walk

On Tue, Aug 22, 2017 at 11:56 AM, Peter Zijlstra <[email protected]> wrote:
>
> Won't we now prematurely terminate the wait when we get a spurious
> wakeup?

I think there's two answers to that:

(a) do we even care?

(b) what spurious wakeup?

The "do we even care" quesiton is because wait_on_page_bit by
definition isn't really serializing. And I'm not even talking about
memory ordering, altough that is true too - I'm talking just
fundamentally, that by definition when we're not locking, by the time
wait_on_page_bit() returns to the caller, it could obviously have
changed again.

So I think wait_on_page_bit() is by definition not really guaranteeing
that the bit really is clear. And I don't think we have really have
cases that matter.

But if we do - say, 'fsync()' waiting for a page to wait for
writeback, where would you get spurious wakeups from? They normally
happen either when we have nested waiting (eg a page fault happens
while we have other wait queues active), and I'm not seeing that being
an issue here.

That said, I do think we might want to perhaps make a "careful" vs
"just wait a bit" version of this if the patch works out.

The patch is primarily for testing this particular case. I actually
think it's probably ok in general, but maybe there really is some
special case that could have multiple wakeup sources and it needs to
see *this* particular one.

(We could perhaps handle that case by checking "is the wait-queue
empty now" instead, and just get rid of the re-arming, not break out
of the loop immediately after the io_schedule()).

Linus

2017-08-22 19:30:33

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/wait: Break up long wake list walk

On Tue, Aug 22, 2017 at 12:08 PM, Peter Zijlstra <[email protected]> wrote:
>
> So that migration stuff has a filter on, we need two consecutive numa
> faults from the same page_cpupid 'hash', see
> should_numa_migrate_memory().

Hmm. That is only called for MPOL_F_MORON.

We don't actually know what policy the problem space uses, since tthis
is some specialized load.

I could easily see somebody having set MPOL_PREFERRED with
MPOL_F_LOCAL and then touch it from every single node. Isn't that even
the default?

> And since this appears to be anonymous memory (no THP) this is all a
> single address space. However, we don't appear to invalidate TLBs when
> we upgrade the PTE protection bits (not strictly required of course), so
> we can have multiple CPUs trip over the same 'old' NUMA PTE.
>
> Still, generating such a migration storm would be fairly tricky I think.

Well, Mel seems to have been unable to generate a load that reproduces
the long page waitqueues. And I don't think we've had any other
reports of this either.

So "quite tricky" may well be exactly what it needs.

Likely also with a user load that does something that the people
involved in the automatic numa migration would have considered
completely insane and never tested or even thought about.

Users sometimes do completely insane things. It may have started as a
workaround for some particular case where they did something wrong "on
purpose", and then they entirely forgot about it, and five years later
it's running their whole infrastructure and doing insane things
because the "particular case" it was tested with was on some broken
preproduction machine with totally broken firmware tables for memory
node layout.

Linus

2017-08-22 19:37:17

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/wait: Break up long wake list walk

> > Still, generating such a migration storm would be fairly tricky I think.
>
> Well, Mel seems to have been unable to generate a load that reproduces
> the long page waitqueues. And I don't think we've had any other
> reports of this either.

It could be that it requires a fairly large system. On large systems
under load a lot of things take much longer, so what's a tiny window on Mel's
system may suddenly be very large, and with much more threads
they have a higher chance of bad interactions anyways.

We only see it on 4S+ today. But systems are always getting larger,
so what's a large system today, will be a normal medium scale system
tomorrow.

BTW we also collected PT traces for the long hang cases, but it was
hard to find a consistent pattern in them.

-Andi

2017-08-22 19:55:42

by Liang, Kan

[permalink] [raw]
Subject: RE: [PATCH 1/2] sched/wait: Break up long wake list walk


> So I propose testing the attached trivial patch.

It doesn’t work.
The call stack is the same.

100.00% (ffffffff821af140)
|
---wait_on_page_bit
__migration_entry_wait
migration_entry_wait
do_swap_page
__handle_mm_fault
handle_mm_fault
__do_page_fault
do_page_fault
page_fault
|
|--40.62%--0x123a2
| start_thread
|


> It may not do anything at all.
> But the existing code is actually doing extra work just to be fragile, in case the
> scenario above can happen.
>
> Comments?
>
> Linus

2017-08-22 20:42:17

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/wait: Break up long wake list walk

On Tue, Aug 22, 2017 at 12:55 PM, Liang, Kan <[email protected]> wrote:
>
>> So I propose testing the attached trivial patch.
>
> It doesn’t work.
> The call stack is the same.

So I would have expected the stack trace to be the same, and I would
even expect the CPU usage to be fairly similar, because you'd see
repeating from the callers (taking the fault again if the page is -
once again - being migrated).

But I was hoping that the wait queues would be shorter because the
loop for the retry would be bigger.

Oh well.

I'm slightly out of ideas. Apparently the yield() worked ok (apart
from not catching all cases), and maybe we could do a version that
waits on the page bit in the non-contended case, but yields under
contention?

IOW, maybe this is the best we can do for now? Introducing that
"wait_on_page_migration()" helper might allow us to tweak this a bit
as people come up with better ideas..

And then add Tim's patch for the general worst-case just in case?

Linus


Attachments:
patch.diff (2.36 kB)

2017-08-22 20:53:29

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/wait: Break up long wake list walk

On Tue, Aug 22, 2017 at 01:42:13PM -0700, Linus Torvalds wrote:
> +void wait_on_page_bit_or_yield(struct page *page, int bit_nr)
> +{
> + if (PageWaiters(page)) {
> + yield();
> + return;
> + }
> + wait_on_page_bit(page, bit_nr);
> +}

So _the_ problem with yield() is when you hit this with a RT task it
will busy spin and possibly not allow the task that actually has the
lock to make progress at all.

So ideally there'd be a timeout or other limit on the amount of yield().

This being bit-spinlocks leaves us very short on state to play with
though :/

2017-08-22 20:59:02

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/wait: Break up long wake list walk

On Tue, Aug 22, 2017 at 1:53 PM, Peter Zijlstra <[email protected]> wrote:
>
> So _the_ problem with yield() is when you hit this with a RT task it
> will busy spin and possibly not allow the task that actually has the
> lock to make progress at all.

I thought we had explicitly defined yield() to not do that.

But I guess we could make this yielding behavior depend on a few more
heuristics. So do the yield only when there is contention, and when
it's a non-RT task.

Linus

Subject: Re: [PATCH 1/2] sched/wait: Break up long wake list walk

On Tue, 22 Aug 2017, Andi Kleen wrote:

> We only see it on 4S+ today. But systems are always getting larger,
> so what's a large system today, will be a normal medium scale system
> tomorrow.
>
> BTW we also collected PT traces for the long hang cases, but it was
> hard to find a consistent pattern in them.

Hmmm... Maybe it would be wise to limit the pages autonuma can migrate?

If a page has more than 50 refcounts or so then dont migrate it. I think
high number of refcounts and a high frequewncy of calls are reached in
particular for pages of the c library. Attempting to migrate those does
not make much sense anyways because the load may shift and another
function may become popular. We may end up shifting very difficult to
migrate pages back and forth.

2017-08-22 21:24:10

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/wait: Break up long wake list walk

On Tue, Aug 22, 2017 at 04:08:52PM -0500, Christopher Lameter wrote:
> On Tue, 22 Aug 2017, Andi Kleen wrote:
>
> > We only see it on 4S+ today. But systems are always getting larger,
> > so what's a large system today, will be a normal medium scale system
> > tomorrow.
> >
> > BTW we also collected PT traces for the long hang cases, but it was
> > hard to find a consistent pattern in them.
>
> Hmmm... Maybe it would be wise to limit the pages autonuma can migrate?
>
> If a page has more than 50 refcounts or so then dont migrate it. I think
> high number of refcounts and a high frequewncy of calls are reached in
> particular for pages of the c library. Attempting to migrate those does
> not make much sense anyways because the load may shift and another
> function may become popular. We may end up shifting very difficult to
> migrate pages back and forth.

I believe in this case it's used by threads, so a reference count limit
wouldn't help.

If migrating code was a problem I would probably rather just disable
migration of read-only pages.

-Andi

2017-08-22 22:52:20

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/wait: Break up long wake list walk

On Tue, Aug 22, 2017 at 2:24 PM, Andi Kleen <[email protected]> wrote:
>
> I believe in this case it's used by threads, so a reference count limit
> wouldn't help.

For the first migration try, yes. But if it's some kind of "try and
try again" pattern, the second time you try and there are people
waiting for the page, the page count (not the map count) would be
elevanted.

So it's possible that depending on exactly what the deeper problem is,
the "this page is very busy, don't migrate" case might be
discoverable, and the page count might be part of it.

However, after PeterZ made that comment that page migration should
have that should_numa_migrate_memory() filter, I am looking at that
mpol_misplaced() code.

And honestly, that MPOL_PREFERRED / MPOL_F_LOCAL case really looks
like complete garbage to me.

It looks like garbage exactly because it says "always migrate to the
current node", but that's crazy - if it's a group of threads all
running together on the same VM, that obviously will just bounce the
page around for absolute zero good ewason.

The *other* memory policies look fairly sane. They basically have a
fairly well-defined preferred node for the policy (although the
"MPOL_INTERLEAVE" looks wrong for a hugepage). But
MPOL_PREFERRED/MPOL_F_LOCAL really looks completely broken.

Maybe people expected that anybody who uses MPOL_F_LOCAL will also
bind all threads to one single node?

Could we perhaps make that "MPOL_PREFERRED / MPOL_F_LOCAL" case just
do the MPOL_F_MORON policy, which *does* use that "should I migrate to
the local node" filter?

IOW, we've been looking at the waiters (because the problem shows up
due to the excessive wait queues), but maybe the source of the problem
comes from the numa balancing code just insanely bouncing pages
back-and-forth if you use that "always balance to local node" thing.

Untested (as always) patch attached.

Linus


Attachments:
patch.diff (840.00 B)

2017-08-22 23:19:07

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/wait: Break up long wake list walk

On Tue, Aug 22, 2017 at 3:52 PM, Linus Torvalds
<[email protected]> wrote:
>
> The *other* memory policies look fairly sane. They basically have a
> fairly well-defined preferred node for the policy (although the
> "MPOL_INTERLEAVE" looks wrong for a hugepage). But
> MPOL_PREFERRED/MPOL_F_LOCAL really looks completely broken.

Of course, I don't know if that customer test-case actually triggers
that MPOL_PREFERRED/MPOL_F_LOCAL case at all.

So again, that issue may not even be what is going on.

Linus

2017-08-23 14:49:56

by Liang, Kan

[permalink] [raw]
Subject: RE: [PATCH 1/2] sched/wait: Break up long wake list walk

> On Tue, Aug 22, 2017 at 12:55 PM, Liang, Kan <[email protected]> wrote:
> >
> >> So I propose testing the attached trivial patch.
> >
> > It doesn’t work.
> > The call stack is the same.
>
> So I would have expected the stack trace to be the same, and I would even
> expect the CPU usage to be fairly similar, because you'd see repeating from
> the callers (taking the fault again if the page is - once again - being migrated).
>
> But I was hoping that the wait queues would be shorter because the loop for
> the retry would be bigger.
>
> Oh well.
>
> I'm slightly out of ideas. Apparently the yield() worked ok (apart from not
> catching all cases), and maybe we could do a version that waits on the page
> bit in the non-contended case, but yields under contention?
>
> IOW, maybe this is the best we can do for now? Introducing that
> "wait_on_page_migration()" helper might allow us to tweak this a bit as
> people come up with better ideas..

The "wait_on_page_migration()" helper works well in the overnight testing.

Thanks,
Kan

>
> And then add Tim's patch for the general worst-case just in case?
>
> Linus




2017-08-23 14:51:18

by Liang, Kan

[permalink] [raw]
Subject: RE: [PATCH 1/2] sched/wait: Break up long wake list walk


> Subject: Re: [PATCH 1/2] sched/wait: Break up long wake list walk
>
> On Tue, Aug 22, 2017 at 2:24 PM, Andi Kleen <[email protected]> wrote:
> >
> > I believe in this case it's used by threads, so a reference count
> > limit wouldn't help.
>
> For the first migration try, yes. But if it's some kind of "try and try again"
> pattern, the second time you try and there are people waiting for the page,
> the page count (not the map count) would be elevanted.
>
> So it's possible that depending on exactly what the deeper problem is, the
> "this page is very busy, don't migrate" case might be discoverable, and the
> page count might be part of it.
>
> However, after PeterZ made that comment that page migration should have
> that should_numa_migrate_memory() filter, I am looking at that
> mpol_misplaced() code.
>
> And honestly, that MPOL_PREFERRED / MPOL_F_LOCAL case really looks like
> complete garbage to me.
>
> It looks like garbage exactly because it says "always migrate to the current
> node", but that's crazy - if it's a group of threads all running together on the
> same VM, that obviously will just bounce the page around for absolute zero
> good ewason.
>
> The *other* memory policies look fairly sane. They basically have a fairly
> well-defined preferred node for the policy (although the
> "MPOL_INTERLEAVE" looks wrong for a hugepage). But
> MPOL_PREFERRED/MPOL_F_LOCAL really looks completely broken.
>
> Maybe people expected that anybody who uses MPOL_F_LOCAL will also
> bind all threads to one single node?
>
> Could we perhaps make that "MPOL_PREFERRED / MPOL_F_LOCAL" case just
> do the MPOL_F_MORON policy, which *does* use that "should I migrate to
> the local node" filter?
>
> IOW, we've been looking at the waiters (because the problem shows up due
> to the excessive wait queues), but maybe the source of the problem comes
> from the numa balancing code just insanely bouncing pages back-and-forth if
> you use that "always balance to local node" thing.
>
> Untested (as always) patch attached.

The patch doesn’t work.

Thanks,
Kan

2017-08-23 15:59:11

by Tim Chen

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/wait: Break up long wake list walk

On 08/23/2017 07:49 AM, Liang, Kan wrote:
>> On Tue, Aug 22, 2017 at 12:55 PM, Liang, Kan <[email protected]> wrote:
>>>
>>>> So I propose testing the attached trivial patch.
>>>
>>> It doesn’t work.
>>> The call stack is the same.
>>
>> So I would have expected the stack trace to be the same, and I would even
>> expect the CPU usage to be fairly similar, because you'd see repeating from
>> the callers (taking the fault again if the page is - once again - being migrated).
>>
>> But I was hoping that the wait queues would be shorter because the loop for
>> the retry would be bigger.
>>
>> Oh well.
>>
>> I'm slightly out of ideas. Apparently the yield() worked ok (apart from not
>> catching all cases), and maybe we could do a version that waits on the page
>> bit in the non-contended case, but yields under contention?
>>
>> IOW, maybe this is the best we can do for now? Introducing that
>> "wait_on_page_migration()" helper might allow us to tweak this a bit as
>> people come up with better ideas..
>
> The "wait_on_page_migration()" helper works well in the overnight testing.
>


Linus,

Will you still consider the original patch as a fail safe mechanism?

Thanks.

Tim

2017-08-23 16:04:07

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/wait: Break up long wake list walk

On Tue, Aug 22, 2017 at 11:19:12AM -0700, Linus Torvalds wrote:
> On Tue, Aug 22, 2017 at 10:23 AM, Liang, Kan <[email protected]> wrote:
> >
> > Although the patch doesn't trigger watchdog, the spin lock wait time
> > is not small (0.45s).
> > It may get worse again on larger systems.
>
> Yeah, I don't think Mel's patch is great - because I think we could do
> so much better.
>
> What I like about Mel's patch is that it recognizes that
> "wait_on_page_locked()" there is special, and replaces it with
> something else. I think that "something else" is worse than my
> "yield()" call, though.
>

I only partially agree. yield() can be unbound if there are an indefinite
number of lock holders or frequent reacquisitions. yield() also some warnings
around it related to potentially never doing the actual yield. The latter
can cause lockup warnings. I was aiming for was the easiest path to "try
for a bit but give up in a reasonable amount of time". I picked waiting
on the page lock because at least it'll recover. I could have returned
and allowed the fault to retry but thought this may consume excessive CPU.

I spent more time on the test case to try and get some sort of useful
data out of it and that took most of the time I had available again. The
current state of the test case still isn't hitting the worst patterns but
it can at least detect latency problems. It uses multiple threads bound to
different nodes to access thread-private data within a large buffer where
each thread's data is aligned. For example, an alignment of 64 would have
each thread access a private cache line while still sharing a page from
a NUMA balancing point of view. 4K would still share a page if THP was
used. I've a few patches running on a 4-socket machine with 144 I borrowed
for a few hours and hopefully something will fall out that.

> So if we do busy loops, I really think we should also make sure that
> the thing we're waiting for is not preempted.
>
> HOWEVER, I'm actually starting to think that there is perhaps
> something else going on.
>
> Let me walk you through my thinking:
>
> This is the migration logic:
>
> (a) migration locks the page
>
> (b) migration is supposedly CPU-limited
>
> (c) migration then unlocks the page.
>
> Ignore all the details, that's the 10.000 ft view. Right?
>

Right.

> Now, if the above is right, then I have a question for people:
>
> HOW IN THE HELL DO WE HAVE TIME FOR THOUSANDS OF THREADS TO HIT THAT ONE PAGE?
>

There are not many explanations. Given that it's thousands of threads, it
may be the case that some are waiting while there are many more contending
on CPU. The migration process can get scheduled out which might compound the
problem. While THP migration gives up quickly after a migration failure,
base page migration does not. If any part of the migration path returns
EAGAIN, it'll retry up to 10 times and depending where it is, that can
mean the migrating process is locking the page 10 times. If it's fast
enough reacquiring the lock, the waiting processes will wait for each of
those 10 attempts because they don't notice that base page migration has
already cleared the NUMA pte.

Given that NUMA balancing is best effort, the 10 attempts for numa balancing
is questionable. A patch to test should be straight-forward so I'll spit
it out after this mail and queue it up.

> That just sounds really sketchy to me. Even if all those thousands of
> threads are runnable, we need to schedule into them just to get them
> to wait on that one page.
>

The same is true if they are just yielding.

> So that sounds really quite odd when migration is supposed to hold the
> page lock for a relatively short time and get out. Don't you agree?
>

Yes. As part of that getting out, it shouldn't retry 10 times.

> Which is why I started thinking of what the hell could go on for that
> long wait-queue to happen.
>
> One thing that strikes me is that the way wait_on_page_bit() works is
> that it will NOT wait until the next bit clearing, it will wait until
> it actively *sees* the page bit being clear.
>
> Now, work with me on that. What's the difference?
>
> What we could have is some bad NUMA balancing pattern that actually
> has a page that everybody touches.
>
> And hey, we pretty much know that everybody touches that page, since
> people get stuck on that wait-queue, right?
>
> And since everybody touches it, as a result everybody eventually
> thinks that page should be migrated to their NUMA node.
>

Potentially yes. There is a two-pass filter as mentioned elsewhere in the
thread and the scanner has to update the PTEs for that to happen but it's not
completely impossible. Once a migration starts, other threads shouldn't try
again until the next window. That window can be small but with thousands
of threads potentially scanning (even at a very slow rate), the window
could be tiny. If many threads are doing the scanning one after the other,
it would potentially allow the two-pass check to pass sooner than expected.

Co-incidentally, Rik encountered this class of problem and there is a
patch in Andrew's tree "sched/numa: Scale scan period with tasks in group
and shared/private" that might have an impact on this problem.

> But for all we know, the migration keeps on failing, because one of
> the points of that "lock page - try to move - unlock page" is that
> *TRY* in "try to move". There's a number of things that makes it not
> actually migrate. Like not being movable, or failing to isolate the
> page, or whatever.
>
> So we could have some situation where we end up locking and unlocking
> the page over and over again (which admittedly is already a sign of
> something wrong in the NUMA balancing, but that's a separate issue).
>

The retries are part of the picture in the migration side. Multiple
protection updates from large numbers of threads are another potential
source.

> And if we get into that situation, where everybody wants that one hot
> page, what happens to the waiters?
>
> One of the thousands of waiters is unlucky (remember, this argument
> started with the whole "you shouldn't get that many waiters on one
> single page that isn't even locked for that long"), and goes:
>
> (a) Oh, the page is locked, I will wait for the lock bit to clear
>
> (b) go to sleep
>
> (c) the migration fails, the lock bit is cleared, the waiter is woken
> up but doesn't get the CPU immediately, and one of the other
> *thousands* of threads decides to also try to migrate (see above),
>
> (d) the guy waiting for the lock bit to clear will see the page
> "still" locked (really just "locked again") and continue to wait.
>

Part c may be slightly inaccurate but I think a similar situation can occur
with multiple threads deciding to do change_prot_numa in quick succession
so it's functionally similar.

> In the meantime, one of the other threads happens to be unlucky, also
> hits the race, and now we have one more thread waiting for that page
> lock. It keeps getting unlocked, but it also keeps on getting locked,
> and so the queue can keep growing.
>
> See where I'm going here? I think it's really odd how *thousands* of
> threads can hit that locked window that is supposed to be pretty
> small. But I think it's much more likely if we have some kind of
> repeated event going on.
>

Agreed.

> So I'm starting to think that part of the problem may be how stupid
> that "wait_for_page_bit_common()" code is. It really shouldn't wait
> until it sees that the bit is clear. It could have been cleared and
> then re-taken.
>
> And honestly, we actually have extra code for that "let's go round
> again". That seems pointless. If the bit has been cleared, we've been
> woken up, and nothing else would have done so anyway, so if we're not
> interested in locking, we're simply *done* after we've done the
> "io_scheduler()".
>
> So I propose testing the attached trivial patch. It may not do
> anything at all. But the existing code is actually doing extra work
> just to be fragile, in case the scenario above can happen.
>
> Comments?

Nothing useful to add on top of Peter's concerns but I haven't thought
about that aspect of the thread very much. I'm going to try see if a
patch that avoids multiple migration retries or Rik's patch have a
noticable impact in case they are enough on their own.

--
Mel Gorman
SUSE Labs

2017-08-23 18:17:32

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/wait: Break up long wake list walk

On Wed, Aug 23, 2017 at 8:58 AM, Tim Chen <[email protected]> wrote:
>
> Will you still consider the original patch as a fail safe mechanism?

I don't think we have much choice, although I would *really* want to
get this root-caused rather than just papering over the symptoms.

Maybe still worth testing that "sched/numa: Scale scan period with
tasks in group and shared/private" patch that Mel mentioned.

In fact, looking at that patch description, it does seem to match this
particular load a lot. Quoting from the commit message:

"Running 80 tasks in the same group, or as threads of the same process,
results in the memory getting scanned 80x as fast as it would be if a
single task was using the memory.

This really hurts some workloads"

So if 80 threads causes 80x as much scanning, a few thousand threads
might indeed be really really bad.

So once more unto the breach, dear friends, once more.

Please.

The patch got applied to -tip as commit b5dd77c8bdad, and can be
downloaded here:

https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?id=b5dd77c8bdada7b6262d0cba02a6ed525bf4e6e1

(Hmm. It says it's cc'd to me, but I never noticed that patch simply
because it was in a big group of other -tip commits.. Oh well).

Linus

2017-08-23 20:55:30

by Liang, Kan

[permalink] [raw]
Subject: RE: [PATCH 1/2] sched/wait: Break up long wake list walk

>
> On Wed, Aug 23, 2017 at 8:58 AM, Tim Chen <[email protected]>
> wrote:
> >
> > Will you still consider the original patch as a fail safe mechanism?
>
> I don't think we have much choice, although I would *really* want to get this
> root-caused rather than just papering over the symptoms.
>
> Maybe still worth testing that "sched/numa: Scale scan period with tasks in
> group and shared/private" patch that Mel mentioned.

The patch doesn’t help on our load.

Thanks,
Kan
>
> In fact, looking at that patch description, it does seem to match this particular
> load a lot. Quoting from the commit message:
>
> "Running 80 tasks in the same group, or as threads of the same process,
> results in the memory getting scanned 80x as fast as it would be if a
> single task was using the memory.
>
> This really hurts some workloads"
>
> So if 80 threads causes 80x as much scanning, a few thousand threads might
> indeed be really really bad.
>
> So once more unto the breach, dear friends, once more.
>
> Please.
>
> The patch got applied to -tip as commit b5dd77c8bdad, and can be
> downloaded here:
>
>
> https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?id=b5dd
> 77c8bdada7b6262d0cba02a6ed525bf4e6e1
>
> (Hmm. It says it's cc'd to me, but I never noticed that patch simply because it
> was in a big group of other -tip commits.. Oh well).
>
> Linus

2017-08-23 23:30:55

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/wait: Break up long wake list walk

On Wed, Aug 23, 2017 at 11:17 AM, Linus Torvalds
<[email protected]> wrote:
> On Wed, Aug 23, 2017 at 8:58 AM, Tim Chen <[email protected]> wrote:
>>
>> Will you still consider the original patch as a fail safe mechanism?
>
> I don't think we have much choice, although I would *really* want to
> get this root-caused rather than just papering over the symptoms.

Oh well. Apparently we're not making progress on that, so I looked at
the patch again.

Can we fix it up a bit? In particular, the "bookmark_wake_function()"
thing added no value, and definitely shouldn't have been exported.
Just use NULL instead.

And the WAITQUEUE_WALK_BREAK_CNT thing should be internal to
__wake_up_common(), not in some common header file. Again, there's no
value in exporting it to anybody else.

And doing

if (curr->flags & WQ_FLAG_BOOKMARK)

looks odd, when we just did

unsigned flags = curr->flags;

one line earlier, so that can be just simplified.

So can you test that simplified version of the patch? I'm attaching my
suggested edited patch, but you may just want to do those changes
directly to your tree instead.

Hmm?

Linus


Attachments:
patch.diff (4.90 kB)

2017-08-24 17:49:20

by Tim Chen

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/wait: Break up long wake list walk

On 08/23/2017 04:30 PM, Linus Torvalds wrote:
> On Wed, Aug 23, 2017 at 11:17 AM, Linus Torvalds
> <[email protected]> wrote:
>> On Wed, Aug 23, 2017 at 8:58 AM, Tim Chen <[email protected]> wrote:
>>>
>>> Will you still consider the original patch as a fail safe mechanism?
>>
>> I don't think we have much choice, although I would *really* want to
>> get this root-caused rather than just papering over the symptoms.
>
> Oh well. Apparently we're not making progress on that, so I looked at
> the patch again.
>
> Can we fix it up a bit? In particular, the "bookmark_wake_function()"
> thing added no value, and definitely shouldn't have been exported.
> Just use NULL instead.
>
> And the WAITQUEUE_WALK_BREAK_CNT thing should be internal to
> __wake_up_common(), not in some common header file. Again, there's no
> value in exporting it to anybody else.
>
> And doing
>
> if (curr->flags & WQ_FLAG_BOOKMARK)
>
> looks odd, when we just did
>
> unsigned flags = curr->flags;
>
> one line earlier, so that can be just simplified.
>
> So can you test that simplified version of the patch? I'm attaching my
> suggested edited patch, but you may just want to do those changes
> directly to your tree instead.

These changes look fine. We are testing them now.
Does the second patch in the series look okay to you?

Thanks.

Tim


2017-08-24 18:16:18

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/wait: Break up long wake list walk

On Thu, Aug 24, 2017 at 10:49 AM, Tim Chen <[email protected]> wrote:
>
> These changes look fine. We are testing them now.
> Does the second patch in the series look okay to you?

I didn't really have any reaction to that one, as long as Mel&co are
ok with it, I'm fine with it.

Linus

2017-08-24 20:44:52

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/wait: Break up long wake list walk

On Thu, Aug 24, 2017 at 11:16:15AM -0700, Linus Torvalds wrote:
> On Thu, Aug 24, 2017 at 10:49 AM, Tim Chen <[email protected]> wrote:
> >
> > These changes look fine. We are testing them now.
> > Does the second patch in the series look okay to you?
>
> I didn't really have any reaction to that one, as long as Mel&co are
> ok with it, I'm fine with it.
>

I've no strong objections or concerns. I'm disappointed that the
original root cause for this could not be found but hope that eventually a
reproducible test case will eventually be available. Despite having access
to a 4-socket box, I was still unable to create a workload that caused
large delays on wakeup. I'm going to have to stop as I don't think it's
possible to create on that particular machine for whatever reason.

--
Mel Gorman
SUSE Labs

2017-08-25 16:44:19

by Tim Chen

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/wait: Break up long wake list walk

On 08/24/2017 01:44 PM, Mel Gorman wrote:
> On Thu, Aug 24, 2017 at 11:16:15AM -0700, Linus Torvalds wrote:
>> On Thu, Aug 24, 2017 at 10:49 AM, Tim Chen <[email protected]> wrote:
>>>
>>> These changes look fine. We are testing them now.
>>> Does the second patch in the series look okay to you?
>>
>> I didn't really have any reaction to that one, as long as Mel&co are
>> ok with it, I'm fine with it.
>>
>
> I've no strong objections or concerns. I'm disappointed that the
> original root cause for this could not be found but hope that eventually a
> reproducible test case will eventually be available. Despite having access
> to a 4-socket box, I was still unable to create a workload that caused
> large delays on wakeup. I'm going to have to stop as I don't think it's
> possible to create on that particular machine for whatever reason.
>

Kan helped to test the updated patch 1 from Linus. It worked fine.
I've refreshed the patch set that includes all the changes
and send a version 2 refresh of the patch set separately.

Thanks.

Tim