Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752332Ab1D1ETv (ORCPT ); Thu, 28 Apr 2011 00:19:51 -0400 Received: from mga01.intel.com ([192.55.52.88]:9019 "EHLO mga01.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751055Ab1D1ETt (ORCPT ); Thu, 28 Apr 2011 00:19:49 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.64,278,1301900400"; d="sh'?scan'208";a="685089028" Date: Thu, 28 Apr 2011 12:19:47 +0800 From: Wu Fengguang To: Andrew Morton Cc: Minchan Kim , Dave Young , linux-mm , Linux Kernel Mailing List , Mel Gorman Subject: Re: readahead and oom Message-ID: <20110428041947.GA8761@localhost> References: <20110426055521.GA18473@localhost> <20110426062535.GB19717@localhost> <20110426063421.GC19717@localhost> <20110426092029.GA27053@localhost> <20110426124743.e58d9746.akpm@linux-foundation.org> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="9amGYk9869ThD9tj" Content-Disposition: inline In-Reply-To: <20110426124743.e58d9746.akpm@linux-foundation.org> User-Agent: Mutt/1.5.20 (2009-06-14) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4239 Lines: 118 --9amGYk9869ThD9tj Content-Type: text/plain; charset=us-ascii Content-Disposition: inline On Wed, Apr 27, 2011 at 03:47:43AM +0800, Andrew Morton wrote: > On Tue, 26 Apr 2011 17:20:29 +0800 > Wu Fengguang wrote: > > > Pass __GFP_NORETRY|__GFP_NOWARN for readahead page allocations. > > > > readahead page allocations are completely optional. They are OK to > > fail and in particular shall not trigger OOM on themselves. > > I have distinct recollections of trying this many years ago, finding > that it caused problems then deciding not to do it. But I can't find > an email trail and I don't remember the reasons :( The most possible reason can be page allocation failures even if there are plenty of _global_ reclaimable pages. > If the system is so stressed for memory that the oom-killer might get > involved then the readahead pages may well be getting reclaimed before > the application actually gets to use them. But that's just an aside. Yes, when direct reclaim is working as expected, readahead thrashing should happen long before NORETRY page allocation failures and OOM. With that assumption I think it's OK to do this patch. As for readahead, sporadic allocation failures are acceptable. But there is a problem, see below. > Ho hum. The patch *seems* good (as it did 5-10 years ago ;)) but there > may be surprising side-effects which could be exposed under heavy > testing. Testing which I'm sure hasn't been performed... The NORETRY direct reclaim does tend to fail a lot more on concurrent reclaims, where one task's reclaimed pages can be stoled by others before it's able to get it. __alloc_pages_direct_reclaim() { did_some_progress = try_to_free_pages(); // pages stolen by others page = get_page_from_freelist(); } Here are the tests to demonstrate this problem. Out of 1000GB reads and page allocations, test-ra-thrash.sh: read 1000 1G files interleaved in 1 single task: nr_alloc_fail 733 test-dd-sparse.sh: read 1000 1G files concurrently in 1000 tasks: nr_alloc_fail 11799 Thanks, Fengguang --- --- linux-next.orig/include/linux/mmzone.h 2011-04-27 21:58:27.000000000 +0800 +++ linux-next/include/linux/mmzone.h 2011-04-27 21:58:39.000000000 +0800 @@ -106,6 +106,7 @@ enum zone_stat_item { NR_SHMEM, /* shmem pages (included tmpfs/GEM pages) */ NR_DIRTIED, /* page dirtyings since bootup */ NR_WRITTEN, /* page writings since bootup */ + NR_ALLOC_FAIL, #ifdef CONFIG_NUMA NUMA_HIT, /* allocated in intended node */ NUMA_MISS, /* allocated in non intended node */ --- linux-next.orig/mm/page_alloc.c 2011-04-27 21:58:27.000000000 +0800 +++ linux-next/mm/page_alloc.c 2011-04-27 21:58:39.000000000 +0800 @@ -2176,6 +2176,8 @@ rebalance: } nopage: + inc_zone_state(preferred_zone, NR_ALLOC_FAIL); + /* count_zone_vm_events(PGALLOCFAIL, preferred_zone, 1 << order); */ if (!(gfp_mask & __GFP_NOWARN) && printk_ratelimit()) { unsigned int filter = SHOW_MEM_FILTER_NODES; --- linux-next.orig/mm/vmstat.c 2011-04-27 21:58:27.000000000 +0800 +++ linux-next/mm/vmstat.c 2011-04-27 21:58:53.000000000 +0800 @@ -879,6 +879,7 @@ static const char * const vmstat_text[] "nr_shmem", "nr_dirtied", "nr_written", + "nr_alloc_fail", #ifdef CONFIG_NUMA "numa_hit", --9amGYk9869ThD9tj Content-Type: application/x-sh Content-Disposition: attachment; filename="test-dd-sparse.sh" Content-Transfer-Encoding: quoted-printable #!/bin/sh=0A=0Amount /dev/sda7 /fs=0A=0Afor i in `seq 1000`=0Ado=0A truncat= e -s 1G /fs/sparse-$i=0A dd if=3D/fs/sparse-$i of=3D/dev/null &=0Adone=0A --9amGYk9869ThD9tj Content-Type: application/x-sh Content-Disposition: attachment; filename="test-ra-thrash.sh" Content-Transfer-Encoding: quoted-printable #!/bin/sh=0A=0Amount /dev/sda7 /fs=0A=0Afor i in `seq 1000`=0Ado=0A truncat= e -s 1G /fs/sparse-$i=0Adone=0A=0Ara-thrash /fs/sparse-*=0A --9amGYk9869ThD9tj-- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/