Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756415AbdDFIzp (ORCPT ); Thu, 6 Apr 2017 04:55:45 -0400 Received: from mx2.suse.de ([195.135.220.15]:58195 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753677AbdDFIzg (ORCPT ); Thu, 6 Apr 2017 04:55:36 -0400 Date: Thu, 6 Apr 2017 10:55:30 +0200 From: Michal Hocko To: Hugh Dickins Cc: Mel Gorman , Andrew Morton , Tejun Heo , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: Is it safe for kthreadd to drain_all_pages? Message-ID: <20170406085529.GF5497@dhcp22.suse.cz> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 1589 Lines: 31 On Wed 05-04-17 13:59:49, Hugh Dickins wrote: > Hi Mel, > > I suspect that it's not safe for kthreadd to drain_all_pages(); > but I haven't studied flush_work() etc, so don't really know what > I'm talking about: hoping that you will jump to a realization. > > 4.11-rc has been giving me hangs after hours of swapping load. At > first they looked like memory leaks ("fork: Cannot allocate memory"); > but for no good reason I happened to do "cat /proc/sys/vm/stat_refresh" > before looking at /proc/meminfo one time, and the stat_refresh stuck > in D state, waiting for completion of flush_work like many kworkers. > kthreadd waiting for completion of flush_work in drain_all_pages(). > > But I only noticed that pattern later: originally tried to bisect > rc1 before rc2 came out, but underestimated how long to wait before > deciding a stage good - I thought 12 hours, but would now say 2 days. > Too late for bisection, I suspect your drain_all_pages() changes. Yes, this is a fallout from Mel's changes. I was about to say that my follow up fixes which made this flushing to the single WQ with rescuer fixed that but it seems that http://www.ozlabs.org/~akpm/mmotm/broken-out/mm-move-pcp-and-lru-pcp-drainging-into-single-wq.patch didn't make it to the Linus tree. Could you re-test with this one? While your change is obviously correct I think the above should address it as well and it is more generic. If it works then I will ask Andrew to send the above to Linus (along with its follow up mm-move-pcp-and-lru-pcp-drainging-into-single-wq-fix.patch) -- Michal Hocko SUSE Labs