Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751611Ab3CAJb6 (ORCPT ); Fri, 1 Mar 2013 04:31:58 -0500 Received: from mail-oa0-f45.google.com ([209.85.219.45]:41266 "EHLO mail-oa0-f45.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751078Ab3CAJb4 (ORCPT ); Fri, 1 Mar 2013 04:31:56 -0500 Message-ID: <51307583.2020006@gmail.com> Date: Fri, 01 Mar 2013 17:31:47 +0800 From: Simon Jeons User-Agent: Mozilla/5.0 (X11; Linux i686; rv:17.0) Gecko/20130221 Thunderbird/17.0.3 MIME-Version: 1.0 To: Johannes Weiner CC: dormando , Andrew Morton , Rik van Riel , Seiji Aguchi , Satoru Moriya , Randy Dunlap , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , "lwoodman@redhat.com" , "hughd@google.com" , Mel Gorman Subject: Re: [PATCH] add extra free kbytes tunable References: <511EB5CB.2060602@redhat.com> <20130219152936.f079c971.akpm@linux-foundation.org> <20130222175634.GA4824@cmpxchg.org> <51307354.5000401@gmail.com> In-Reply-To: <51307354.5000401@gmail.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 6680 Lines: 160 On 03/01/2013 05:22 PM, Simon Jeons wrote: > Hi Johannes, > > On 02/23/2013 01:56 AM, Johannes Weiner wrote: >> On Tue, Feb 19, 2013 at 09:19:27PM -0800, dormando wrote: >>>> The problem is that adding this tunable will constrain future VM >>>> implementations. We will forever need to at least retain the >>>> pseudo-file. We will also need to make some effort to retain its >>>> behaviour. >>>> >>>> It would of course be better to fix things so you don't need to tweak >>>> VM internals to get acceptable behaviour. >>> I sympathize with this. It's presently all that keeps us afloat though. >>> I'll whine about it again later if nothing else pans out. >>> >>>> You said: >>>> >>>> : We have a server workload wherein machines with 100G+ of "free" >>>> memory >>>> : (used by page cache), scattered but frequent random io reads from >>>> 12+ >>>> : SSD's, and 5gbps+ of internet traffic, will frequently hit direct >>>> reclaim >>>> : in a few different ways. >>>> : >>>> : 1) It'll run into small amounts of reclaim randomly (a few hundred >>>> : thousand). >>>> : >>>> : 2) A burst of reads or traffic can cause extra pressure, which >>>> kswapd >>>> : occasionally responds to by freeing up 40g+ of the pagecache all >>>> at once >>>> : (!) while pausing the system (Argh). >>>> : >>>> : 3) A blip in an upstream provider or failover from a peer causes the >>>> : kernel to allocate massive amounts of memory for retransmission >>>> : queues/etc, potentially along with buffered IO reads and (some, >>>> but not >>>> : often a ton) of new allocations from an application. This paired >>>> with 2) >>>> : can cause the box to stall for 15+ seconds. >>>> >>>> Can we prioritise these? 2) looks just awful - kswapd shouldn't just >>>> go off and free 40G of pagecache. Do you know what's actually in that >>>> pagecache? Large number of small files or small number of (very) >>>> large >>>> files? >>> We have a handful of huge files (6-12ish 200g+) that are mmap'ed and >>> accessed via address. occasionally madvise (WILLNEED) applied to the >>> address ranges before attempting to use them. There're a mix of other >>> files but nothing significant. The mmap's are READONLY and writes >>> are done >>> via pwrite-ish functions. >>> >>> I could use some guidance on inspecting/tracing the problem. I've been >>> trying to reproduce it in a lab, and respecting to 2)'s issue I've >>> found: >>> >>> - The amount of memory freed back up is either a percentage of total >>> memory or a percentage of free memory. (a machine with 48G of ram will >>> "only" free up an extra 4-7g) >>> >>> - It's most likely to happen after a fresh boot, or if "3 > >>> drop_caches" >>> is applied with the application down. As it fills it seems to get >>> itself >>> into trouble, but becomes more stable after that. Unfortunately 1) >>> and 3) >>> still apply to a stable instance. >>> >>> - Protecting the DMA32 zone with something like "1 1 32" into >>> lowmem_reserve_ratio makes the mass-reclaiming less likely to happen. >>> >>> - While watching "sar -B 1" I'll see kswapd wake up, and scan up to >>> a few >>> hundred thousand pages before finding anything it actually wants to >>> reclaim (low vmeff). I've only been able to reproduce this from a clean >>> start. It can take up to 3 seconds before kswapd starts actually >>> reclaiming pages. >>> >>> - So far as I can tell we're almost exclusively using 0 order >>> allocations. >>> THP is disabled. >>> >>> There's not much dirty memory involved. It's not flushing out writes >>> while >>> reclaiming, it just kills off massive amount of cached memory. >> Mapped file pages have to get scanned twice before they are reclaimed >> because we don't have enough usage information after the first scan. > > It seems that just VM_EXEC mapped file pages are protected. > Issue in page reclaim subsystem: > static inline int page_is_file_cache(struct page *page) > { > return !PageSwapBacked(page); > } > AFAIK, PG_swapbacked is set if anonymous page added to swap cache, and > be cleaned if removed from swap cache. So anonymous pages which are > reclaimed and add to swap cache won't have this flag, then they will > be treated as s/are/aren't > file backed pages? Is it buggy? In function __add_to_swap_cache if > add to radix tree successfully will result in increase NR_FILE_PAGES, > why? >> >> In your case, when you start this workload after a fresh boot or >> dropping the caches, there will be 48G of mapped file pages that have >> never been scanned before and that need to be looked at twice. >> >> Unfortunately, if kswapd does not make progress (and it won't for some >> time at first), it will scan more and more aggressively with > > Why kswapd does not make progress for some time at first? > >> increasing scan priority. And when the 48G of pages are finally >> cycled, kswapd's scan window is a large percentage of your machine's >> memory, and it will free every single page in it. >> >> I think we should think about capping kswapd zone reclaim cycles just >> as we do for direct reclaim. It's a little ridiculous that it can run >> unbounded and reclaim every page in a zone without ever checking back >> against the watermark. We still increase the scan window evenly when >> we don't make forward progress, but we are more carefully inching zone >> levels back toward the watermarks. >> >> diff --git a/mm/vmscan.c b/mm/vmscan.c >> index c4883eb..8a4c446 100644 >> --- a/mm/vmscan.c >> +++ b/mm/vmscan.c >> @@ -2645,10 +2645,11 @@ static unsigned long balance_pgdat(pg_data_t >> *pgdat, int order, >> .may_unmap = 1, >> .may_swap = 1, >> /* >> - * kswapd doesn't want to be bailed out while reclaim. because >> - * we want to put equal scanning pressure on each zone. >> + * Even kswapd zone scans want to be bailed out after >> + * reclaiming a good chunk of pages. It will just >> + * come back if the watermarks are still not met. >> */ >> - .nr_to_reclaim = ULONG_MAX, >> + .nr_to_reclaim = SWAP_CLUSTER_MAX, >> .order = order, >> .target_mem_cgroup = NULL, >> }; >> >> -- >> To unsubscribe, send a message with 'unsubscribe linux-mm' in >> the body to majordomo@kvack.org. For more info on Linux MM, >> see: http://www.linux-mm.org/ . >> Don't email: email@kvack.org > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/