Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755517AbbLAL0F (ORCPT ); Tue, 1 Dec 2015 06:26:05 -0500 Received: from relay.parallels.com ([195.214.232.42]:57567 "EHLO relay.parallels.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751000AbbLAL0C (ORCPT ); Tue, 1 Dec 2015 06:26:02 -0500 Date: Tue, 1 Dec 2015 14:25:45 +0300 From: Vladimir Davydov To: Michal Hocko CC: Vlastimil Babka , Andrew Morton , Johannes Weiner , Mel Gorman , , Subject: Re: [PATCH] vmscan: do not throttle kthreads due to too_many_isolated Message-ID: <20151201112544.GB11488@esperanza> References: <1448465801-3280-1-git-send-email-vdavydov@virtuozzo.com> <5655D789.80201@suse.cz> <20151125162756.GJ29014@esperanza> <20151126081624.GK29014@esperanza> <20151127125005.GH2493@dhcp22.suse.cz> <20151127134003.GR29014@esperanza> <20151127150133.GI2493@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Disposition: inline In-Reply-To: <20151127150133.GI2493@dhcp22.suse.cz> X-ClientProxiedBy: US-EXCH.sw.swsoft.com (10.255.249.47) To MSK-EXCH1.sw.swsoft.com (10.67.48.55) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4293 Lines: 85 On Fri, Nov 27, 2015 at 04:01:33PM +0100, Michal Hocko wrote: > On Fri 27-11-15 16:40:03, Vladimir Davydov wrote: > > On Fri, Nov 27, 2015 at 01:50:05PM +0100, Michal Hocko wrote: > > > On Thu 26-11-15 11:16:24, Vladimir Davydov wrote: > [...] > > > > Anyway, kthreads that use GFP_NOIO and/or mempool aren't safe either, > > > > because it isn't an allocation context problem: the reclaimer locks up > > > > not because it tries to take an fs/io lock the caller holds, but because > > > > it waits for isolated pages to be put back, which will never happen, > > > > since processes that isolated them depend on the kthread making > > > > progress. This is purely a reclaimer heuristic, which kmalloc users are > > > > not aware of. > > > > > > > > My point is that, in contrast to userspace processes, it is dangerous to > > > > throttle kthreads in the reclaimer, because they might be responsible > > > > for reclaimer progress (e.g. performing writeback). > > > > > > Wouldn't it be better if your writeback kthread did PF_MEMALLOC/__GFP_MEMALLOC > > > instead because it is in fact a reclaimer so it even get to the reclaim. > > > > The driver we use is similar to loop. It works as a proxy to fs it works > > on top of. Allowing it to access emergency reserves would deplete them > > quickly, just like in case of plain loop. > > OK, I see. I thought it would be using only a very limited amount of > memory for the writeback. Even if it does, there's still a chance of a dead lock: even if kthread uses mempool + NOIO, it can go into direct reclaim and start looping in too_many_isolated AFAIU. The probability of this happening is much less though. > > > The problem is not about our driver, in fact. I'm pretty sure one can > > hit it when using memcg along with loop or dm-crypt for instance. > > I am not familiar much with neither but from a quick look the loop > driver doesn't use mempools tool, it simply relays the data to the > underlaying file and relies on the underlying fs to write all the pages > and only prevents from the recursion by clearing GFP_FS and GFP_IO. Then > I am not really sure how can we guarantee a forward progress. The > GFP_NOFS allocation might loop inside the allocator endlessly and so > the writeback wouldn't make any progress. This doesn't seem to be only > memcg specific. The global case would just replace the deadlock by a > livelock. I certainly must be missing something here. Yeah, I think you're right. If the loop kthread gets stuck in the reclaimer, it might occur that other processes will isolate, reclaim and then dirty reclaimed pages, preventing the kthread from running and cleaning memory, so that we might end up with all memory being under writeback and no reclaimable memory left for kthread to run and clean it. Due to dirty limit, this is unlikely to happen though, but I'm not sure. OTOH, with legacy memcg, there is no dirty limit and we can isolate a lot of pages (SWAP_CLUSTER_MAX = 512 now) per process and wait on page writeback to complete before releasing them, which sounds bad. And we can't just remove this wait_on_page_writeback from shrink_page_list, otherwise OOM might be triggered prematurely. May be, we should putback rotated pages and release all reclaimed pages before initiating wait? > > > > There way too many allocations done from the kernel thread context to be > > > not throttled (just look at worker threads). > > > > What about throttling them only once then? > > This still sounds way too broad to me and I am even not sure it solves > the problem. If anything I think we really should make it specific > only to those callers who are really required to make a forward > progress. What about PF_LESS_THROTTLE? NFS is already using this flag for a > similar purpose and we indeed do not throttle at few places during the > reclaim. So I would expect current_may_throttle(current) check there > although I must confess I have no idea about the whole condition right > now. Yeah, thanks for the tip. I'll take a look. Thanks, Vladimir -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/