Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757999AbZLNRFu (ORCPT ); Mon, 14 Dec 2009 12:05:50 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1757985AbZLNRFt (ORCPT ); Mon, 14 Dec 2009 12:05:49 -0500 Received: from mx1.redhat.com ([209.132.183.28]:36648 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757982AbZLNRFs (ORCPT ); Mon, 14 Dec 2009 12:05:48 -0500 Subject: Re: [PATCH v2] vmscan: limit concurrent reclaimers in shrink_zone From: Larry Woodman To: Rik van Riel Cc: akpm@linux-foundation.org, KOSAKI Motohiro , linux-mm@kvack.org, linux-kernel@vger.kernel.org, minchan.kim@gmail.com In-Reply-To: <20091211164651.036f5340@annuminas.surriel.com> References: <20091211164651.036f5340@annuminas.surriel.com> Content-Type: text/plain Date: Mon, 14 Dec 2009 12:08:01 -0500 Message-Id: <1260810481.6666.13.camel@dhcp-100-19-198.bos.redhat.com> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 8661 Lines: 242 On Fri, 2009-12-11 at 16:46 -0500, Rik van Riel wrote: Rik, the latest patch appears to have a problem although I dont know what the problem is yet. When the system ran out of memory we see thousands of runnable processes and 100% system time: 9420 2 29824 79856 62676 19564 0 0 0 0 8054 379 0 100 0 0 0 9420 2 29824 79368 62292 19564 0 0 0 0 8691 413 0 100 0 0 0 9421 1 29824 79780 61780 19820 0 0 0 0 8928 408 0 100 0 0 0 The system would not respond so I dont know whats going on yet. I'll add debug code to figure out why its in that state as soon as I get access to the hardware. Larry > Under very heavy multi-process workloads, like AIM7, the VM can > get into trouble in a variety of ways. The trouble start when > there are hundreds, or even thousands of processes active in the > page reclaim code. > > Not only can the system suffer enormous slowdowns because of > lock contention (and conditional reschedules) between thousands > of processes in the page reclaim code, but each process will try > to free up to SWAP_CLUSTER_MAX pages, even when the system already > has lots of memory free. > > It should be possible to avoid both of those issues at once, by > simply limiting how many processes are active in the page reclaim > code simultaneously. > > If too many processes are active doing page reclaim in one zone, > simply go to sleep in shrink_zone(). > > On wakeup, check whether enough memory has been freed already > before jumping into the page reclaim code ourselves. We want > to use the same threshold here that is used in the page allocator > for deciding whether or not to call the page reclaim code in the > first place, otherwise some unlucky processes could end up freeing > memory for the rest of the system. > > Reported-by: Larry Woodman > Signed-off-by: Rik van Riel > > --- > v2: > - fix typos in sysctl.c and vm.txt > - move the code in sysctl.c out from under the ifdef > - only __GFP_FS|__GFP_IO tasks can wait > > Documentation/sysctl/vm.txt | 18 ++++++++++++++ > include/linux/mmzone.h | 4 +++ > include/linux/swap.h | 1 + > kernel/sysctl.c | 7 +++++ > mm/page_alloc.c | 3 ++ > mm/vmscan.c | 40 +++++++++++++++++++++++++++++++++ > 6 files changed, 73 insertions(+), 0 deletions(-) > > diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt > index fc5790d..8bd1a96 100644 > --- a/Documentation/sysctl/vm.txt > +++ b/Documentation/sysctl/vm.txt > @@ -32,6 +32,7 @@ Currently, these files are in /proc/sys/vm: > - legacy_va_layout > - lowmem_reserve_ratio > - max_map_count > +- max_zone_concurrent_reclaimers > - memory_failure_early_kill > - memory_failure_recovery > - min_free_kbytes > @@ -278,6 +279,23 @@ The default value is 65536. > > ============================================================= > > +max_zone_concurrent_reclaimers: > + > +The number of processes that are allowed to simultaneously reclaim > +memory from a particular memory zone. > + > +With certain workloads, hundreds of processes end up in the page > +reclaim code simultaneously. This can cause large slowdowns due > +to lock contention, freeing of way too much memory and occasionally > +false OOM kills. > + > +To avoid these problems, only allow a smaller number of processes > +to reclaim pages from each memory zone simultaneously. > + > +The default value is 8. > + > +============================================================= > + > memory_failure_early_kill: > > Control how to kill processes when uncorrected memory error (typically > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index 30fe668..ed614b8 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -345,6 +345,10 @@ struct zone { > /* Zone statistics */ > atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS]; > > + /* Number of processes running page reclaim code on this zone. */ > + atomic_t concurrent_reclaimers; > + wait_queue_head_t reclaim_wait; > + > /* > * prev_priority holds the scanning priority for this zone. It is > * defined as the scanning priority at which we achieved our reclaim > diff --git a/include/linux/swap.h b/include/linux/swap.h > index a2602a8..661eec7 100644 > --- a/include/linux/swap.h > +++ b/include/linux/swap.h > @@ -254,6 +254,7 @@ extern unsigned long shrink_all_memory(unsigned long nr_pages); > extern int vm_swappiness; > extern int remove_mapping(struct address_space *mapping, struct page *page); > extern long vm_total_pages; > +extern int max_zone_concurrent_reclaimers; > > #ifdef CONFIG_NUMA > extern int zone_reclaim_mode; > diff --git a/kernel/sysctl.c b/kernel/sysctl.c > index 6ff0ae6..4ec17ed 100644 > --- a/kernel/sysctl.c > +++ b/kernel/sysctl.c > @@ -1271,6 +1271,13 @@ static struct ctl_table vm_table[] = { > .extra2 = &one, > }, > #endif > + { > + .procname = "max_zone_concurrent_reclaimers", > + .data = &max_zone_concurrent_reclaimers, > + .maxlen = sizeof(max_zone_concurrent_reclaimers), > + .mode = 0644, > + .proc_handler = &proc_dointvec, > + }, > > /* > * NOTE: do not add new entries to this table unless you have read > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 11ae66e..ca9cae1 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -3852,6 +3852,9 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat, > > zone->prev_priority = DEF_PRIORITY; > > + atomic_set(&zone->concurrent_reclaimers, 0); > + init_waitqueue_head(&zone->reclaim_wait); > + > zone_pcp_init(zone); > for_each_lru(l) { > INIT_LIST_HEAD(&zone->lru[l].list); > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 2bbee91..ecfe28c 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -40,6 +40,7 @@ > #include > #include > #include > +#include > > #include > #include > @@ -129,6 +130,17 @@ struct scan_control { > int vm_swappiness = 60; > long vm_total_pages; /* The total number of pages which the VM controls */ > > +/* > + * Maximum number of processes concurrently running the page > + * reclaim code in a memory zone. Having too many processes > + * just results in them burning CPU time waiting for locks, > + * so we're better off limiting page reclaim to a sane number > + * of processes at a time. We do this per zone so local node > + * reclaim on one NUMA node will not block other nodes from > + * making progress. > + */ > +int max_zone_concurrent_reclaimers = 8; > + > static LIST_HEAD(shrinker_list); > static DECLARE_RWSEM(shrinker_rwsem); > > @@ -1600,6 +1612,31 @@ static void shrink_zone(int priority, struct zone *zone, > struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc); > int noswap = 0; > > + if (!current_is_kswapd() && atomic_read(&zone->concurrent_reclaimers) > > + max_zone_concurrent_reclaimers && > + (sc->gfp_mask & (__GFP_IO|__GFP_FS)) == > + (__GFP_IO|__GFP_FS)) { > + /* > + * Do not add to the lock contention if this zone has > + * enough processes doing page reclaim already, since > + * we would just make things slower. > + */ > + sleep_on(&zone->reclaim_wait); > + > + /* > + * If other processes freed enough memory while we waited, > + * break out of the loop and go back to the allocator. > + */ > + if (zone_watermark_ok(zone, sc->order, low_wmark_pages(zone), > + 0, 0)) { > + wake_up(&zone->reclaim_wait); > + sc->nr_reclaimed += nr_to_reclaim; > + return; > + } > + } > + > + atomic_inc(&zone->concurrent_reclaimers); > + > /* If we have no swap space, do not bother scanning anon pages. */ > if (!sc->may_swap || (nr_swap_pages <= 0)) { > noswap = 1; > @@ -1655,6 +1692,9 @@ static void shrink_zone(int priority, struct zone *zone, > shrink_active_list(SWAP_CLUSTER_MAX, zone, sc, priority, 0); > > throttle_vm_writeout(sc->gfp_mask); > + > + atomic_dec(&zone->concurrent_reclaimers); > + wake_up(&zone->reclaim_wait); > } > > /* > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/