Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934248AbdDGQZu (ORCPT ); Fri, 7 Apr 2017 12:25:50 -0400 Received: from mail-pg0-f51.google.com ([74.125.83.51]:36465 "EHLO mail-pg0-f51.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932598AbdDGQZo (ORCPT ); Fri, 7 Apr 2017 12:25:44 -0400 Date: Fri, 7 Apr 2017 09:25:33 -0700 (PDT) From: Hugh Dickins X-X-Sender: hugh@eggly.anvils To: Andrew Morton , Mel Gorman , Michal Hocko cc: Hugh Dickins , Tejun Heo , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: Is it safe for kthreadd to drain_all_pages? In-Reply-To: Message-ID: References: <20170406130614.a6ygueggpwseqysd@techsingularity.net> User-Agent: Alpine 2.11 (LSU 23 2013-08-11) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2345 Lines: 55 On Thu, 6 Apr 2017, Hugh Dickins wrote: > On Thu, 6 Apr 2017, Mel Gorman wrote: > > On Wed, Apr 05, 2017 at 01:59:49PM -0700, Hugh Dickins wrote: > > > Hi Mel, > > > > > > I suspect that it's not safe for kthreadd to drain_all_pages(); > > > but I haven't studied flush_work() etc, so don't really know what > > > I'm talking about: hoping that you will jump to a realization. > > > > > > > You're right, it's not safe. If kthreadd is creating the workqueue > > thread to do the drain and it'll recurse into itself. > > > > > 4.11-rc has been giving me hangs after hours of swapping load. At > > > first they looked like memory leaks ("fork: Cannot allocate memory"); > > > but for no good reason I happened to do "cat /proc/sys/vm/stat_refresh" > > > before looking at /proc/meminfo one time, and the stat_refresh stuck > > > in D state, waiting for completion of flush_work like many kworkers. > > > kthreadd waiting for completion of flush_work in drain_all_pages(). > > > > > > > It's asking itself to do work in all likelihood. > > > > > Patch below has been running well for 36 hours now: > > > a bit too early to be sure, but I think it's time to turn to you. > > > > > > > I think the patch is valid but like Michal, would appreciate if you > > could run the patch he linked to see if it also side-steps the same > > problem. > > > > Good spot! > > Thank you both for explanations, and direction to the two "drainging" > patches. I've put those on to 4.11-rc5 (and double-checked that I've > taken mine off), and set it going. Fine so far but much too soon to > tell - mine did 56 hours with clean /var/log/messages before I switched, > so I demand no less of Michal's :). I'll report back tomorrow and the > day after (unless badness appears sooner once I'm home). 24 hours so far, and with a clean /var/log/messages. Not conclusive yet, and of course I'll leave it running another couple of days, but I'm increasingly sure that it works as you intended: I agree that mm-move-pcp-and-lru-pcp-drainging-into-single-wq.patch mm-move-pcp-and-lru-pcp-drainging-into-single-wq-fix.patch should go to Linus as soon as convenient. Though I think the commit message needs something a bit stronger than "Quite annoying though". Maybe add a line: Fixes serious hang under load, observed repeatedly on 4.11-rc. Thanks! Hugh