Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753717AbdCMPrq (ORCPT ); Mon, 13 Mar 2017 11:47:46 -0400 Received: from mx2.suse.de ([195.135.220.15]:49220 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753337AbdCMPqh (ORCPT ); Mon, 13 Mar 2017 11:46:37 -0400 Date: Mon, 13 Mar 2017 16:46:28 +0100 From: Michal Hocko To: Shakeel Butt Cc: Johannes Weiner , Mel Gorman , Vlastimil Babka , Andrew Morton , Jia He , Hillf Danton , Linux MM , LKML Subject: Re: [PATCH] mm: fix condition for throttle_direct_reclaim Message-ID: <20170313154627.GU31518@dhcp22.suse.cz> References: <20170310194620.5021-1-shakeelb@google.com> <20170313090206.GC31518@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4057 Lines: 93 On Mon 13-03-17 08:07:15, Shakeel Butt wrote: > On Mon, Mar 13, 2017 at 2:02 AM, Michal Hocko wrote: > > On Fri 10-03-17 11:46:20, Shakeel Butt wrote: > >> Recently kswapd has been modified to give up after MAX_RECLAIM_RETRIES > >> number of unsucessful iterations. Before going to sleep, kswapd thread > >> will unconditionally wakeup all threads sleeping on pfmemalloc_wait. > >> However the awoken threads will recheck the watermarks and wake the > >> kswapd thread and sleep again on pfmemalloc_wait. There is a chance > >> of continuous back and forth between kswapd and direct reclaiming > >> threads if the kswapd keep failing and thus defeat the purpose of > >> adding backoff mechanism to kswapd. So, add kswapd_failures check > >> on the throttle_direct_reclaim condition. > > > > I have to say I really do not like this. kswapd_failures shouldn't > > really be checked outside of the kswapd context. The > > pfmemalloc_watermark_ok/throttle_direct_reclaim is quite complex even > > without putting another variable into it. I wish we rather replace this > > throttling by something else. Johannes had an idea to throttle by the > > number of reclaimers. > > > > Do you suspect race in accessing kswapd_failures in non-kswapd > context? No, this is not about race conditions. It is more about the logic of the code. kswapd_failures is the private thing to the kswapd daemon. Direct reclaimers shouldn't have any business in it - well except resetting it. > Please do let me know more about replacing this throttling. The idea behind a different throttling would be to not allow too many direct reclaimers on the same set of nodes/zones. Johannes would tell you more. > > Anyway, I am wondering whether we can hit this issue in > > practice? Have you seen it happening or is this a result of the code > > review? I would assume that that !zone_reclaimable_pages check in > > pfmemalloc_watermark_ok should help to some degree. > > > Yes, I have seen this issue going on for more than one hour on my > test. It was a simple test where the number of processes, in the > presence of swap, try to allocate memory more than RAM. this is an anonymous memory, right? > The number of > processes are equal to the number of cores and are pinned to each > individual core. I am suspecting that !zone_reclaimable_pages() check > did not help. Hmm, interesting! I would expect the OOM killer triggering but I guess I see what is going on. kswapd couldn't reclaim a single page and ran out of its kswapd_failures attempts while no direct reclaimers could reclaim a single page either until we reached the throttling point when we are basically livelocked because neither kswapd nor _all_ direct reclaimers can make a forward progress. Although this sounds quite unlikely I think it is quite possible to happen. So we cannot really throttle _all_ direct reclaimers when the kswapd is out of game which I haven't fully realized when reviewing "mm: fix 100% CPU kswapd busyloop on unreclaimable nodes". The simplest thing to do would be something like you have proposed and do not throttle if kswapd is out of game. diff --git a/mm/vmscan.c b/mm/vmscan.c index bae698484e8e..d34b1afc781a 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2791,6 +2791,9 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat) int i; bool wmark_ok; + if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES) + return true; + for (i = 0; i <= ZONE_NORMAL; i++) { zone = &pgdat->node_zones[i]; if (!managed_zone(zone)) I do not like this as I've already said but it would allow to merge "mm: fix 100% CPU kswapd busyloop on unreclaimable nodes" without too many additional changes. Another option would be to cap the waiting time same as we do for GFP_NOFS. Not ideal either because I suspect we would just get herds of direct reclaimers that way. The best option would be to rethink the throttling and move it out of the direct reclaim path somehow. Thanks and sorry for not spotting the potential lockup previously. -- Michal Hocko SUSE Labs