Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751182Ab2LAAqh (ORCPT ); Fri, 30 Nov 2012 19:46:37 -0500 Received: from zene.cmpxchg.org ([85.214.230.12]:35448 "EHLO zene.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750839Ab2LAAqf (ORCPT ); Fri, 30 Nov 2012 19:46:35 -0500 Date: Fri, 30 Nov 2012 19:45:20 -0500 From: Johannes Weiner To: Thorsten Leemhuis Cc: Mel Gorman , Andrew Morton , Linus Torvalds , Rik van Riel , George Spelvin , Johannes Hirte , Tomas Racek , Jan Kara , Dave Hansen , Josh Boyer , Valdis Kletnieks , Jiri Slaby , Zdenek Kabelac , Bruno Wolff III , linux-mm , Linux Kernel Mailing List , John Ellson Subject: Re: kswapd craziness in 3.7 Message-ID: <20121201004520.GK2301@cmpxchg.org> References: <20121127214928.GA20253@cmpxchg.org> <50B5387C.1030005@redhat.com> <20121127222637.GG2301@cmpxchg.org> <20121128101359.GT8218@suse.de> <20121128145215.d23aeb1b.akpm@linux-foundation.org> <20121128235412.GW8218@suse.de> <50B77F84.1030907@leemhuis.info> <20121129170512.GI2301@cmpxchg.org> <50B8A8E7.4030108@leemhuis.info> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <50B8A8E7.4030108@leemhuis.info> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 7612 Lines: 209 Hi Thorsten, On Fri, Nov 30, 2012 at 01:39:03PM +0100, Thorsten Leemhuis wrote: > /me wonders how to elegantly get out of his man-in-the-middle position You control the mighty koji :-) But seriously, this is very helpful, thank you! John now also Cc'd directly. > John was able to reproduce the problem quickly with a kernel that > contained the patch from your mail. For details see > > https://bugzilla.redhat.com/show_bug.cgi?id=866988#c42 and later > > He provided the informations there. Parts of it: > /proc/vmstat while kswad0 at 100%cpu > /proc/zoneinfo with kswapd0 at 100% cpu > perf profile Thanks. I'm quoting the interesting bits in order of the cars on my possibly derailing train of thought: > pageoutrun 117729182 > allocstall 5 Okay, so kswapd is stupidly looping but it's still managing to do it's actual job; nobody is dropping into direct reclaim. > pgsteal_kswapd_dma 1 > pgsteal_kswapd_normal 202106 > pgsteal_kswapd_high 36515 > pgsteal_kswapd_movable 0 > pgscan_kswapd_dma 1 > pgscan_kswapd_normal 203044 > pgscan_kswapd_high 40407 > pgscan_kswapd_movable 0 Does not seem excessive, so apparently it also does not overreclaim. > Node 0, zone DMA > pages free 1655 > min 196 > low 245 > high 294 > Node 0, zone Normal > pages free 186234 > min 10953 > low 13691 > high 16429 > Node 0, zone HighMem > pages free 8983 > min 34 > low 475 > high 917 These are all well above their watermarks, yet kswapd is definitely finding something wrong with one of these as it actually does drop into the reclaim loop, so zone_balanced() must be returning false: > 16.52% kswapd0 [kernel.kallsyms] [k] idr_get_next > | > --- idr_get_next > | > |--99.76%-- css_get_next > | mem_cgroup_iter > | | > | |--50.49%-- shrink_zone > | | kswapd > | | kthread > | | ret_from_kernel_thread > | | > | --49.51%-- kswapd > | kthread > | ret_from_kernel_thread > --0.24%-- [...] > > 11.23% kswapd0 [kernel.kallsyms] [k] prune_super > | > --- prune_super > | > |--86.74%-- shrink_slab > | kswapd > | kthread > | ret_from_kernel_thread > | > --13.26%-- kswapd > kthread > ret_from_kernel_thread Spending so much time in shrink_zone and shrink_slab without overreclaiming a zone, I would say that a) this always stays on the DEF_PRIORITY and b) only loops on the DMA zone. At DEF_PRIORITY, the scan goal for filepages in the other zones would be > 0 e.g. As the DMA zone watermarks are fine, it must be the fragmentation index that indicates a lack of memory. Filling in the 1655 free pages into the fragmentation index formula indicates lack of free memory when these 1655 pages are lumped together in less than 9 page blocks. Not unrealistic, I think: on my desktop machine, the DMA zone's free 3975 pages are lumped together in only 12 blocks. But on my system, the DMA zone is either never used and there is always at least one page block available that could satisfy a huge page allocation (fragmentation index == -1000). Unless the system gets really close to OOM, at which point the DMA zone is highly fragmented. And keep in mind that if the priority level goes below DEF_PRIORITY, as it does close to OOM, the unreclaimable DMA zone is ignored anyway. But the DMA zone here is just barely used: > Node 0, zone DMA [...] > nr_slab_reclaimable 3 > nr_slab_unreclaimable 1 [...] > nr_dirtied 315 > nr_written 315 which could explain a fragmentation index that asks for more free memory while the watermarks are fine. Why this all loops: there is one more inconsistency where the conditions for reclaim and the conditions for compaction contradict each other: reclaim also does not consider the DMA zone balanced, but it needs only 25% of the whole node to be balanced, while compaction requires every single zone to be balanced individually. So these strict per-zone checks for compaction at the end of balance_pgdat() are likely to be the culprits that keep kswapd looping forever on this machine, trying to balance DMA for compaction while reclaim decides it has enough balanced memory in the node overall. I think we can just remove them: whenever the compaction code is reached, the reclaim code balanced 25% of the memory available for the classzone to be suitable for compaction. Mel? Rik? --- From: Johannes Weiner Subject: [patch] mm: vmscan: do not keep kswapd looping forever due to individual uncompactable zones When a zone meets its high watermark and is compactable in case of higher order allocations, it contributes to the percentage of the node's memory that is considered balanced. This requirement, that a node be only partially balanced, came about when kswapd was desparately trying to balance tiny zones when all bigger zones in the node had plenty of free memory. Arguably, the same should apply to compaction: if a significant part of the node is balanced enough to run compaction, do not get hung up on that tiny zone that might never get in shape. When the compaction logic in kswapd is reached, we know that at least 25% of the node's memory is balanced properly for compaction (see zone_balanced and pgdat_balanced). Remove the individual zone checks that restart the kswapd cycle. Otherwise, we may observe more endless looping in kswapd where the compaction code loops back to reclaim because of a single zone and reclaim does nothing because the node is considered balanced overall. Reported-by: Thorsten Leemhuis Signed-off-by: Johannes Weiner --- mm/vmscan.c | 16 ---------------- 1 file changed, 16 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 3b0aef4..486100f 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2806,22 +2806,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order, if (!populated_zone(zone)) continue; - if (zone->all_unreclaimable && - sc.priority != DEF_PRIORITY) - continue; - - /* Would compaction fail due to lack of free memory? */ - if (COMPACTION_BUILD && - compaction_suitable(zone, order) == COMPACT_SKIPPED) - goto loop_again; - - /* Confirm the zone is balanced for order-0 */ - if (!zone_watermark_ok(zone, 0, - high_wmark_pages(zone), 0, 0)) { - order = sc.order = 0; - goto loop_again; - } - /* Check if the memory needs to be defragmented. */ if (zone_watermark_ok(zone, order, low_wmark_pages(zone), *classzone_idx, 0)) -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/