2017-04-05 21:00:17

by Hugh Dickins

[permalink] [raw]
Subject: Is it safe for kthreadd to drain_all_pages?

Hi Mel,

I suspect that it's not safe for kthreadd to drain_all_pages();
but I haven't studied flush_work() etc, so don't really know what
I'm talking about: hoping that you will jump to a realization.

4.11-rc has been giving me hangs after hours of swapping load. At
first they looked like memory leaks ("fork: Cannot allocate memory");
but for no good reason I happened to do "cat /proc/sys/vm/stat_refresh"
before looking at /proc/meminfo one time, and the stat_refresh stuck
in D state, waiting for completion of flush_work like many kworkers.
kthreadd waiting for completion of flush_work in drain_all_pages().

But I only noticed that pattern later: originally tried to bisect
rc1 before rc2 came out, but underestimated how long to wait before
deciding a stage good - I thought 12 hours, but would now say 2 days.
Too late for bisection, I suspect your drain_all_pages() changes.

(I've also found order:0 page allocation stalls in /var/log/messages,
148804ms a nice example: which suggest that these hangs are perhaps a
condition it can sometimes get out of itself. None with the patch.)

Patch below has been running well for 36 hours now:
a bit too early to be sure, but I think it's time to turn to you.


[PATCH] mm: don't let kthreadd drain_all_pages

4.11-rc has been giving me hangs after many hours of swapping load: most
kworkers waiting for completion of a flush_work, kthreadd waiting for
completion of flush_work in drain_all_pages (while doing copy_process).
I suspect that kthreadd should not be allowed to drain_all_pages().

Signed-off-by: Hugh Dickins <[email protected]>
---

mm/page_alloc.c | 2 ++
1 file changed, 2 insertions(+)

--- 4.11-rc5/mm/page_alloc.c 2017-03-13 09:08:37.743209168 -0700
+++ linux/mm/page_alloc.c 2017-04-04 00:33:44.086867413 -0700
@@ -2376,6 +2376,8 @@ void drain_all_pages(struct zone *zone)
/* Workqueues cannot recurse */
if (current->flags & PF_WQ_WORKER)
return;
+ if (current == kthreadd_task)
+ return;

/*
* Do not drain if one is already in progress unless it's specific to


2017-04-06 08:55:45

by Michal Hocko

[permalink] [raw]
Subject: Re: Is it safe for kthreadd to drain_all_pages?

On Wed 05-04-17 13:59:49, Hugh Dickins wrote:
> Hi Mel,
>
> I suspect that it's not safe for kthreadd to drain_all_pages();
> but I haven't studied flush_work() etc, so don't really know what
> I'm talking about: hoping that you will jump to a realization.
>
> 4.11-rc has been giving me hangs after hours of swapping load. At
> first they looked like memory leaks ("fork: Cannot allocate memory");
> but for no good reason I happened to do "cat /proc/sys/vm/stat_refresh"
> before looking at /proc/meminfo one time, and the stat_refresh stuck
> in D state, waiting for completion of flush_work like many kworkers.
> kthreadd waiting for completion of flush_work in drain_all_pages().
>
> But I only noticed that pattern later: originally tried to bisect
> rc1 before rc2 came out, but underestimated how long to wait before
> deciding a stage good - I thought 12 hours, but would now say 2 days.
> Too late for bisection, I suspect your drain_all_pages() changes.

Yes, this is a fallout from Mel's changes. I was about to say that
my follow up fixes which made this flushing to the single WQ with rescuer
fixed that but it seems that
http://www.ozlabs.org/~akpm/mmotm/broken-out/mm-move-pcp-and-lru-pcp-drainging-into-single-wq.patch
didn't make it to the Linus tree. Could you re-test with this one?
While your change is obviously correct I think the above should address
it as well and it is more generic. If it works then I will ask Andrew to
send the above to Linus (along with its follow up
mm-move-pcp-and-lru-pcp-drainging-into-single-wq-fix.patch)
--
Michal Hocko
SUSE Labs

2017-04-06 13:06:45

by Mel Gorman

[permalink] [raw]
Subject: Re: Is it safe for kthreadd to drain_all_pages?

On Wed, Apr 05, 2017 at 01:59:49PM -0700, Hugh Dickins wrote:
> Hi Mel,
>
> I suspect that it's not safe for kthreadd to drain_all_pages();
> but I haven't studied flush_work() etc, so don't really know what
> I'm talking about: hoping that you will jump to a realization.
>

You're right, it's not safe. If kthreadd is creating the workqueue
thread to do the drain and it'll recurse into itself.

> 4.11-rc has been giving me hangs after hours of swapping load. At
> first they looked like memory leaks ("fork: Cannot allocate memory");
> but for no good reason I happened to do "cat /proc/sys/vm/stat_refresh"
> before looking at /proc/meminfo one time, and the stat_refresh stuck
> in D state, waiting for completion of flush_work like many kworkers.
> kthreadd waiting for completion of flush_work in drain_all_pages().
>

It's asking itself to do work in all likelihood.

> Patch below has been running well for 36 hours now:
> a bit too early to be sure, but I think it's time to turn to you.
>

I think the patch is valid but like Michal, would appreciate if you
could run the patch he linked to see if it also side-steps the same
problem.

Good spot!

--
Mel Gorman
SUSE Labs

2017-04-06 18:52:55

by Hugh Dickins

[permalink] [raw]
Subject: Re: Is it safe for kthreadd to drain_all_pages?

On Thu, 6 Apr 2017, Mel Gorman wrote:
> On Wed, Apr 05, 2017 at 01:59:49PM -0700, Hugh Dickins wrote:
> > Hi Mel,
> >
> > I suspect that it's not safe for kthreadd to drain_all_pages();
> > but I haven't studied flush_work() etc, so don't really know what
> > I'm talking about: hoping that you will jump to a realization.
> >
>
> You're right, it's not safe. If kthreadd is creating the workqueue
> thread to do the drain and it'll recurse into itself.
>
> > 4.11-rc has been giving me hangs after hours of swapping load. At
> > first they looked like memory leaks ("fork: Cannot allocate memory");
> > but for no good reason I happened to do "cat /proc/sys/vm/stat_refresh"
> > before looking at /proc/meminfo one time, and the stat_refresh stuck
> > in D state, waiting for completion of flush_work like many kworkers.
> > kthreadd waiting for completion of flush_work in drain_all_pages().
> >
>
> It's asking itself to do work in all likelihood.
>
> > Patch below has been running well for 36 hours now:
> > a bit too early to be sure, but I think it's time to turn to you.
> >
>
> I think the patch is valid but like Michal, would appreciate if you
> could run the patch he linked to see if it also side-steps the same
> problem.
>
> Good spot!

Thank you both for explanations, and direction to the two "drainging"
patches. I've put those on to 4.11-rc5 (and double-checked that I've
taken mine off), and set it going. Fine so far but much too soon to
tell - mine did 56 hours with clean /var/log/messages before I switched,
so I demand no less of Michal's :). I'll report back tomorrow and the
day after (unless badness appears sooner once I'm home).

Hugh

2017-04-07 16:25:50

by Hugh Dickins

[permalink] [raw]
Subject: Re: Is it safe for kthreadd to drain_all_pages?

On Thu, 6 Apr 2017, Hugh Dickins wrote:
> On Thu, 6 Apr 2017, Mel Gorman wrote:
> > On Wed, Apr 05, 2017 at 01:59:49PM -0700, Hugh Dickins wrote:
> > > Hi Mel,
> > >
> > > I suspect that it's not safe for kthreadd to drain_all_pages();
> > > but I haven't studied flush_work() etc, so don't really know what
> > > I'm talking about: hoping that you will jump to a realization.
> > >
> >
> > You're right, it's not safe. If kthreadd is creating the workqueue
> > thread to do the drain and it'll recurse into itself.
> >
> > > 4.11-rc has been giving me hangs after hours of swapping load. At
> > > first they looked like memory leaks ("fork: Cannot allocate memory");
> > > but for no good reason I happened to do "cat /proc/sys/vm/stat_refresh"
> > > before looking at /proc/meminfo one time, and the stat_refresh stuck
> > > in D state, waiting for completion of flush_work like many kworkers.
> > > kthreadd waiting for completion of flush_work in drain_all_pages().
> > >
> >
> > It's asking itself to do work in all likelihood.
> >
> > > Patch below has been running well for 36 hours now:
> > > a bit too early to be sure, but I think it's time to turn to you.
> > >
> >
> > I think the patch is valid but like Michal, would appreciate if you
> > could run the patch he linked to see if it also side-steps the same
> > problem.
> >
> > Good spot!
>
> Thank you both for explanations, and direction to the two "drainging"
> patches. I've put those on to 4.11-rc5 (and double-checked that I've
> taken mine off), and set it going. Fine so far but much too soon to
> tell - mine did 56 hours with clean /var/log/messages before I switched,
> so I demand no less of Michal's :). I'll report back tomorrow and the
> day after (unless badness appears sooner once I'm home).

24 hours so far, and with a clean /var/log/messages. Not conclusive
yet, and of course I'll leave it running another couple of days, but
I'm increasingly sure that it works as you intended: I agree that

mm-move-pcp-and-lru-pcp-drainging-into-single-wq.patch
mm-move-pcp-and-lru-pcp-drainging-into-single-wq-fix.patch

should go to Linus as soon as convenient. Though I think the commit
message needs something a bit stronger than "Quite annoying though".
Maybe add a line:

Fixes serious hang under load, observed repeatedly on 4.11-rc.

Thanks!
Hugh

2017-04-07 16:39:42

by Michal Hocko

[permalink] [raw]
Subject: Re: Is it safe for kthreadd to drain_all_pages?

On Fri 07-04-17 09:25:33, Hugh Dickins wrote:
[...]
> 24 hours so far, and with a clean /var/log/messages. Not conclusive
> yet, and of course I'll leave it running another couple of days, but
> I'm increasingly sure that it works as you intended: I agree that
>
> mm-move-pcp-and-lru-pcp-drainging-into-single-wq.patch
> mm-move-pcp-and-lru-pcp-drainging-into-single-wq-fix.patch
>
> should go to Linus as soon as convenient. Though I think the commit
> message needs something a bit stronger than "Quite annoying though".
> Maybe add a line:
>
> Fixes serious hang under load, observed repeatedly on 4.11-rc.

Yeah, it is much less theoretical now. I will rephrase and ask Andrew to
update the chagelog and send it to Linus once I've got your final go.

Thanks!

--
Michal Hocko
SUSE Labs

2017-04-07 16:58:42

by Hugh Dickins

[permalink] [raw]
Subject: Re: Is it safe for kthreadd to drain_all_pages?

On Fri, 7 Apr 2017, Michal Hocko wrote:
> On Fri 07-04-17 09:25:33, Hugh Dickins wrote:
> [...]
> > 24 hours so far, and with a clean /var/log/messages. Not conclusive
> > yet, and of course I'll leave it running another couple of days, but
> > I'm increasingly sure that it works as you intended: I agree that
> >
> > mm-move-pcp-and-lru-pcp-drainging-into-single-wq.patch
> > mm-move-pcp-and-lru-pcp-drainging-into-single-wq-fix.patch
> >
> > should go to Linus as soon as convenient. Though I think the commit
> > message needs something a bit stronger than "Quite annoying though".
> > Maybe add a line:
> >
> > Fixes serious hang under load, observed repeatedly on 4.11-rc.
>
> Yeah, it is much less theoretical now. I will rephrase and ask Andrew to
> update the chagelog and send it to Linus once I've got your final go.

I don't know akpm's timetable, but your fix being more than a two-liner,
I think it would be better if it could get into rc6, than wait another
week for rc7, just in case others then find problems with it. So I
think it's safer *not* to wait for my final go, but proceed on the
assumption that it will follow a day later.

Hugh

2017-04-07 17:29:30

by Michal Hocko

[permalink] [raw]
Subject: Re: Is it safe for kthreadd to drain_all_pages?

On Fri 07-04-17 09:58:17, Hugh Dickins wrote:
> On Fri, 7 Apr 2017, Michal Hocko wrote:
> > On Fri 07-04-17 09:25:33, Hugh Dickins wrote:
> > [...]
> > > 24 hours so far, and with a clean /var/log/messages. Not conclusive
> > > yet, and of course I'll leave it running another couple of days, but
> > > I'm increasingly sure that it works as you intended: I agree that
> > >
> > > mm-move-pcp-and-lru-pcp-drainging-into-single-wq.patch
> > > mm-move-pcp-and-lru-pcp-drainging-into-single-wq-fix.patch
> > >
> > > should go to Linus as soon as convenient. Though I think the commit
> > > message needs something a bit stronger than "Quite annoying though".
> > > Maybe add a line:
> > >
> > > Fixes serious hang under load, observed repeatedly on 4.11-rc.
> >
> > Yeah, it is much less theoretical now. I will rephrase and ask Andrew to
> > update the chagelog and send it to Linus once I've got your final go.
>
> I don't know akpm's timetable, but your fix being more than a two-liner,
> I think it would be better if it could get into rc6, than wait another
> week for rc7, just in case others then find problems with it. So I
> think it's safer *not* to wait for my final go, but proceed on the
> assumption that it will follow a day later.

Fair enough. Andrew, could you update the changelog of
mm-move-pcp-and-lru-pcp-drainging-into-single-wq.patch
and send it to Linus along with
mm-move-pcp-and-lru-pcp-drainging-into-single-wq-fix.patch before rc6?

I would add your Teste-by Hugh but I guess you want to give your testing
more time before feeling comfortable to give it.
---
mm: move pcp and lru-pcp draining into single wq

We currently have 2 specific WQ_RECLAIM workqueues in the mm code.
vmstat_wq for updating pcp stats and lru_add_drain_wq dedicated to drain
per cpu lru caches. This seems more than necessary because both can run
on a single WQ. Both do not block on locks requiring a memory allocation
nor perform any allocations themselves. We will save one rescuer thread
this way.

On the other hand drain_all_pages() queues work on the system wq which
doesn't have rescuer and so this depend on memory allocation (when all
workers are stuck allocating and new ones cannot be created). Initially
we thought this would be more of a theoretical problem but Hugh Dickins
has reported:
: 4.11-rc has been giving me hangs after hours of swapping load. At
: first they looked like memory leaks ("fork: Cannot allocate memory");
: but for no good reason I happened to do "cat /proc/sys/vm/stat_refresh"
: before looking at /proc/meminfo one time, and the stat_refresh stuck
: in D state, waiting for completion of flush_work like many kworkers.
: kthreadd waiting for completion of flush_work in drain_all_pages().

This worker should be using WQ_RECLAIM as well in order to guarantee
a forward progress. We can reuse the same one as for lru draining and
vmstat.

Link: http://lkml.kernel.org/r/[email protected]
Fixes: 0ccce3b92421 ("mm, page_alloc: drain per-cpu pages from workqueue context")
Signed-off-by: Michal Hocko <[email protected]>
Suggested-by: Tetsuo Handa <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
--
Michal Hocko
SUSE Labs

2017-04-07 18:46:34

by Hugh Dickins

[permalink] [raw]
Subject: Re: Is it safe for kthreadd to drain_all_pages?

On Fri, 7 Apr 2017, Michal Hocko wrote:
> On Fri 07-04-17 09:58:17, Hugh Dickins wrote:
> > On Fri, 7 Apr 2017, Michal Hocko wrote:
> > > On Fri 07-04-17 09:25:33, Hugh Dickins wrote:
> > > [...]
> > > > 24 hours so far, and with a clean /var/log/messages. Not conclusive
> > > > yet, and of course I'll leave it running another couple of days, but
> > > > I'm increasingly sure that it works as you intended: I agree that
> > > >
> > > > mm-move-pcp-and-lru-pcp-drainging-into-single-wq.patch
> > > > mm-move-pcp-and-lru-pcp-drainging-into-single-wq-fix.patch
> > > >
> > > > should go to Linus as soon as convenient. Though I think the commit
> > > > message needs something a bit stronger than "Quite annoying though".
> > > > Maybe add a line:
> > > >
> > > > Fixes serious hang under load, observed repeatedly on 4.11-rc.
> > >
> > > Yeah, it is much less theoretical now. I will rephrase and ask Andrew to
> > > update the chagelog and send it to Linus once I've got your final go.
> >
> > I don't know akpm's timetable, but your fix being more than a two-liner,
> > I think it would be better if it could get into rc6, than wait another
> > week for rc7, just in case others then find problems with it. So I
> > think it's safer *not* to wait for my final go, but proceed on the
> > assumption that it will follow a day later.
>
> Fair enough. Andrew, could you update the changelog of
> mm-move-pcp-and-lru-pcp-drainging-into-single-wq.patch
> and send it to Linus along with
> mm-move-pcp-and-lru-pcp-drainging-into-single-wq-fix.patch before rc6?
>
> I would add your Teste-by Hugh but I guess you want to give your testing
> more time before feeling comfortable to give it.

Yes, fair enough: at the moment it's just
Half-Tested-by: Hugh Dickins <[email protected]>
and I hope to take the Half- off in about 21 hours.
But I certainly wouldn't mind if it found its way to Linus without my
final seal of approval.

> ---
> mm: move pcp and lru-pcp draining into single wq
>
> We currently have 2 specific WQ_RECLAIM workqueues in the mm code.
> vmstat_wq for updating pcp stats and lru_add_drain_wq dedicated to drain
> per cpu lru caches. This seems more than necessary because both can run
> on a single WQ. Both do not block on locks requiring a memory allocation
> nor perform any allocations themselves. We will save one rescuer thread
> this way.
>
> On the other hand drain_all_pages() queues work on the system wq which
> doesn't have rescuer and so this depend on memory allocation (when all
> workers are stuck allocating and new ones cannot be created). Initially
> we thought this would be more of a theoretical problem but Hugh Dickins
> has reported:
> : 4.11-rc has been giving me hangs after hours of swapping load. At
> : first they looked like memory leaks ("fork: Cannot allocate memory");
> : but for no good reason I happened to do "cat /proc/sys/vm/stat_refresh"
> : before looking at /proc/meminfo one time, and the stat_refresh stuck
> : in D state, waiting for completion of flush_work like many kworkers.
> : kthreadd waiting for completion of flush_work in drain_all_pages().
>
> This worker should be using WQ_RECLAIM as well in order to guarantee
> a forward progress. We can reuse the same one as for lru draining and
> vmstat.
>
> Link: http://lkml.kernel.org/r/[email protected]
> Fixes: 0ccce3b92421 ("mm, page_alloc: drain per-cpu pages from workqueue context")
> Signed-off-by: Michal Hocko <[email protected]>
> Suggested-by: Tetsuo Handa <[email protected]>
> Acked-by: Vlastimil Babka <[email protected]>
> Acked-by: Mel Gorman <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>
> --
> Michal Hocko
> SUSE Labs

2017-04-08 17:04:40

by Hugh Dickins

[permalink] [raw]
Subject: Re: Is it safe for kthreadd to drain_all_pages?

On Fri, 7 Apr 2017, Hugh Dickins wrote:
> On Fri, 7 Apr 2017, Michal Hocko wrote:
> > On Fri 07-04-17 09:58:17, Hugh Dickins wrote:
> > > On Fri, 7 Apr 2017, Michal Hocko wrote:
> > > > On Fri 07-04-17 09:25:33, Hugh Dickins wrote:
> > > > [...]
> > > > > 24 hours so far, and with a clean /var/log/messages. Not conclusive
> > > > > yet, and of course I'll leave it running another couple of days, but
> > > > > I'm increasingly sure that it works as you intended: I agree that
> > > > >
> > > > > mm-move-pcp-and-lru-pcp-drainging-into-single-wq.patch
> > > > > mm-move-pcp-and-lru-pcp-drainging-into-single-wq-fix.patch
> > > > >
> > > > > should go to Linus as soon as convenient. Though I think the commit
> > > > > message needs something a bit stronger than "Quite annoying though".
> > > > > Maybe add a line:
> > > > >
> > > > > Fixes serious hang under load, observed repeatedly on 4.11-rc.
> > > >
> > > > Yeah, it is much less theoretical now. I will rephrase and ask Andrew to
> > > > update the chagelog and send it to Linus once I've got your final go.
> > >
> > > I don't know akpm's timetable, but your fix being more than a two-liner,
> > > I think it would be better if it could get into rc6, than wait another
> > > week for rc7, just in case others then find problems with it. So I
> > > think it's safer *not* to wait for my final go, but proceed on the
> > > assumption that it will follow a day later.
> >
> > Fair enough. Andrew, could you update the changelog of
> > mm-move-pcp-and-lru-pcp-drainging-into-single-wq.patch
> > and send it to Linus along with
> > mm-move-pcp-and-lru-pcp-drainging-into-single-wq-fix.patch before rc6?
> >
> > I would add your Teste-by Hugh but I guess you want to give your testing
> > more time before feeling comfortable to give it.
>
> Yes, fair enough: at the moment it's just
> Half-Tested-by: Hugh Dickins <[email protected]>
> and I hope to take the Half- off in about 21 hours.
> But I certainly wouldn't mind if it found its way to Linus without my
> final seal of approval.

48 hours and still going well: I declare it good, and thanks to Andrew,
Linus has ce612879ddc7 "mm: move pcp and lru-pcp draining into single wq"
already in for rc6.

Hugh

2017-04-08 18:09:19

by Mel Gorman

[permalink] [raw]
Subject: Re: Is it safe for kthreadd to drain_all_pages?

On Sat, Apr 08, 2017 at 10:04:20AM -0700, Hugh Dickins wrote:
> On Fri, 7 Apr 2017, Hugh Dickins wrote:
> > On Fri, 7 Apr 2017, Michal Hocko wrote:
> > > On Fri 07-04-17 09:58:17, Hugh Dickins wrote:
> > > > On Fri, 7 Apr 2017, Michal Hocko wrote:
> > > > > On Fri 07-04-17 09:25:33, Hugh Dickins wrote:
> > > > > [...]
> > > > > > 24 hours so far, and with a clean /var/log/messages. Not conclusive
> > > > > > yet, and of course I'll leave it running another couple of days, but
> > > > > > I'm increasingly sure that it works as you intended: I agree that
> > > > > >
> > > > > > mm-move-pcp-and-lru-pcp-drainging-into-single-wq.patch
> > > > > > mm-move-pcp-and-lru-pcp-drainging-into-single-wq-fix.patch
> > > > > >
> > > > > > should go to Linus as soon as convenient. Though I think the commit
> > > > > > message needs something a bit stronger than "Quite annoying though".
> > > > > > Maybe add a line:
> > > > > >
> > > > > > Fixes serious hang under load, observed repeatedly on 4.11-rc.
> > > > >
> > > > > Yeah, it is much less theoretical now. I will rephrase and ask Andrew to
> > > > > update the chagelog and send it to Linus once I've got your final go.
> > > >
> > > > I don't know akpm's timetable, but your fix being more than a two-liner,
> > > > I think it would be better if it could get into rc6, than wait another
> > > > week for rc7, just in case others then find problems with it. So I
> > > > think it's safer *not* to wait for my final go, but proceed on the
> > > > assumption that it will follow a day later.
> > >
> > > Fair enough. Andrew, could you update the changelog of
> > > mm-move-pcp-and-lru-pcp-drainging-into-single-wq.patch
> > > and send it to Linus along with
> > > mm-move-pcp-and-lru-pcp-drainging-into-single-wq-fix.patch before rc6?
> > >
> > > I would add your Teste-by Hugh but I guess you want to give your testing
> > > more time before feeling comfortable to give it.
> >
> > Yes, fair enough: at the moment it's just
> > Half-Tested-by: Hugh Dickins <[email protected]>
> > and I hope to take the Half- off in about 21 hours.
> > But I certainly wouldn't mind if it found its way to Linus without my
> > final seal of approval.
>
> 48 hours and still going well: I declare it good, and thanks to Andrew,
> Linus has ce612879ddc7 "mm: move pcp and lru-pcp draining into single wq"
> already in for rc6.
>

Excellent, thanks for that testing.

--
Mel Gorman
SUSE Labs

2017-04-13 17:56:30

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Is it safe for kthreadd to drain_all_pages?

Hello,

On Sat, Apr 08, 2017 at 07:09:10PM +0100, Mel Gorman wrote:
> On Sat, Apr 08, 2017 at 10:04:20AM -0700, Hugh Dickins wrote:
> > On Fri, 7 Apr 2017, Hugh Dickins wrote:
> > > On Fri, 7 Apr 2017, Michal Hocko wrote:
> > > > On Fri 07-04-17 09:58:17, Hugh Dickins wrote:
> > > > > On Fri, 7 Apr 2017, Michal Hocko wrote:
> > > > > > On Fri 07-04-17 09:25:33, Hugh Dickins wrote:
> > > > > > [...]
> > > > > > > 24 hours so far, and with a clean /var/log/messages. Not conclusive
> > > > > > > yet, and of course I'll leave it running another couple of days, but
> > > > > > > I'm increasingly sure that it works as you intended: I agree that
> > > > > > >
> > > > > > > mm-move-pcp-and-lru-pcp-drainging-into-single-wq.patch
> > > > > > > mm-move-pcp-and-lru-pcp-drainging-into-single-wq-fix.patch
> > > > > > >
> > > > > > > should go to Linus as soon as convenient. Though I think the commit
> > > > > > > message needs something a bit stronger than "Quite annoying though".
> > > > > > > Maybe add a line:
> > > > > > >
> > > > > > > Fixes serious hang under load, observed repeatedly on 4.11-rc.
> > > > > >
> > > > > > Yeah, it is much less theoretical now. I will rephrase and ask Andrew to
> > > > > > update the chagelog and send it to Linus once I've got your final go.
> > > > >
> > > > > I don't know akpm's timetable, but your fix being more than a two-liner,
> > > > > I think it would be better if it could get into rc6, than wait another
> > > > > week for rc7, just in case others then find problems with it. So I
> > > > > think it's safer *not* to wait for my final go, but proceed on the
> > > > > assumption that it will follow a day later.
> > > >
> > > > Fair enough. Andrew, could you update the changelog of
> > > > mm-move-pcp-and-lru-pcp-drainging-into-single-wq.patch
> > > > and send it to Linus along with
> > > > mm-move-pcp-and-lru-pcp-drainging-into-single-wq-fix.patch before rc6?
> > > >
> > > > I would add your Teste-by Hugh but I guess you want to give your testing
> > > > more time before feeling comfortable to give it.
> > >
> > > Yes, fair enough: at the moment it's just
> > > Half-Tested-by: Hugh Dickins <[email protected]>
> > > and I hope to take the Half- off in about 21 hours.
> > > But I certainly wouldn't mind if it found its way to Linus without my
> > > final seal of approval.
> >
> > 48 hours and still going well: I declare it good, and thanks to Andrew,
> > Linus has ce612879ddc7 "mm: move pcp and lru-pcp draining into single wq"
> > already in for rc6.
> >
>
> Excellent, thanks for that testing.

Not questioning the fixes for this, but just giving another possible
explanation for any hangs experienced in v4.11-rc in
drain_local_pages_wq never running.

If you use i915 the drain_local_pages_wq would have hanged under load
also because of i915 is deadlocking the system_wq so if
drain_local_pages_wq is queued on it, it'll get stuck as well.

Not queuing drain_local_pages in the system_wq will prevent it to
hang, but i915 would still hang the system_wq leading to a full hang
eventually.

BUG: workqueue lockup - pool cpus=5 node=0 flags=0x0 nice=0 stuck for 45s!
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 10: cpus=5 node=0 flags=0x0 nice=0 active=4/256
in-flight: 5217:__i915_gem_free_work
pending: __i915_gem_free_work, drain_local_pages_wq BAR(2), wait_rcu_exp_gp

The deadlock materializes between __i915_gem_free_work and
wait_rcu_exp_gp (shrinker waits for wait_rcu_exp_gp with shared_mutex
held 100% of the time, and sometime __i915_gem_free_work will be
pending and stuck in mutex_lock(&shared_mutex)).

drain_local_pages_wq is not involved but it just to happen to be
queued in a deadlocked system_wq, so it will appear hanging as well.

This got fixed in upstream commit
c053b5a506d3afc038c485a86e3d461f6c7fb207 with a patch that I consider
inefficient as its calling synchronize_rcu_expedited in places even
the older code was never attempting to do that because it's
unnecessary (i915_gem_shrinker_count is an example). It'd be nice to
optimize that away at least from i915_gem_shrinker_count before v4.11
final.

Despite the reduced performance if compared to my fix, it will solve
the reproducible hang as is equivalent to my fix in that respect.

In any case synchronize_rcu_expedited remains unsafe in the shrinker
as we discussed recently (lockdep isn't capable of warning about it).

I suggested to make two separate WQ_MEM_RECLAIM workqueues, one for
RCU wait_rcu_exp_gp, one for __i915_gem_free_work. The former will
allow to call synchronize_* in PF_MEMALLOC context (i.e. i915
shrinker). The latter will allow to call
flush_work(&__i915_gem_free_work_wq) in PF_MEMALLOC context (i.e. i915
shrinker) so that when the i915 shrinker returns all memory has been
freed.

Chris (CC'ed) suggested to drop all RCU/flush_work synchronization and
let the system free the memory when it can. Which is likely to work,
but the larger the system the less infrequent the quiescent points
will be. Not waiting the i915 memory to be freed before returning from
the shrinker is clearly simpler and a few liner change, just it makes
things run a bit more by luck. It would be faster overall though.

The assumption the VM reclaim code will take care of serializing on
quiescent points or system_wq runs by itself, is probably not correct
as it currently can't. If the i915 shrinker can't serialize and
throttle on that, the reclaim code can't do that either, it's still
the same problematic PF_MEMALLOC context after all. Only kswapd
actually could, because unlike direct reclaim, it's setting
PF_MEMALLOC but it cannot be holding any random lock when it calls
reclaim.

Comments welcome.

Thanks,
Andrea