Date: Wed, 23 Jun 2004 17:03:08 -0700
From: William Lee Irwin III <wli@holomorphy.com>
To: Andrew Morton <akpm@osdl.org>
Cc: linux-kernel@vger.kernel.org
Subject: Re: [oom]: [0/4] fix OOM deadlock running OAST
Message-ID: <20040624000308.GK1552@holomorphy.com>
Mail-Followup-To: William Lee Irwin III <wli@holomorphy.com>,
	Andrew Morton <akpm@osdl.org>, linux-kernel@vger.kernel.org
References: <0406231407.HbLbJbXaHbKbWa5aJb1a4aKb0a3aKb1a0a2aMbMbYa3aLbMb3aJbWaJbXaMbLb1a342@holomorphy.com> <20040623151659.70333c6d.akpm@osdl.org> <20040623223146.GG1552@holomorphy.com> <20040623153758.40e3a865.akpm@osdl.org> <20040623230730.GJ1552@holomorphy.com> <20040623163819.013c8585.akpm@osdl.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20040623163819.013c8585.akpm@osdl.org>
User-Agent: Mutt/1.5.5.1+cvs20040105i
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 4256
Lines: 78

William Lee Irwin III <wli@holomorphy.com> wrote:
>> It's unclear which zones must be checked for this to be of use. In
>> general, suppose one determines all possible zones from which a given
>> allocation may be satisfied (this still assumes out_of_memory() is
>> passed a gfp_mask, and then discovers ->all_unreclaimable. It is still
>> not possible to determine whether nr_swap_pages is relevant.

On Wed, Jun 23, 2004 at 04:38:19PM -0700, Andrew Morton wrote:
> I don't think it _is_ relevant.  Wev'e scanned the crap out of all the
> eligible zones, found nothing swappable outable or otherwise reclaimable. 
> That's as good a definition of oom as you're likely to get.  It takes care
> of mlocked user memory too.

The actual net effect of all this is blowing away if (nr_swap_pages > 0)
for __GFP_WIRED allocations. Removing those 2 lines (and the one line of
whitespace next to it) will pass the test I observed failure in. It's a
judgment call as to whether it's beneficial in general, as it does
insulate userspace somewhat from needing to wait for slow IO being the
ostensible cause of the allocation failure. RedHat vendor kernels have
removed the check entirely, which may be considered to be evidence of
the approach's viability.


William Lee Irwin III <wli@holomorphy.com> wrote:
>> There are larger semantic questions also. For instance, the runs that
>> deadlocked were with non-overcommit enabled (that is, I left the sysctl
>> /proc/sys/vm/overcommit_memory == 0 after boot). The intention was to
>> provide best effort unfailability to userspace allocations by detecting
>> insufficient resources up-front and returning MAP_FAILED from mmap() and
>> so on. There is a current hole in this, which is that pinned kernel
>> memory allocations aren't subtracted from the pool of memory considered
>> to be available to userspace.

On Wed, Jun 23, 2004 at 04:38:19PM -0700, Andrew Morton wrote:
> The overcommit logic doesn't need to care about pinned kernel allocations. 
> All it cares about is reclaimable and free pages.  Its calculation of what
> is reclaimable is a bit wonky because of ramfs/sysfs/ramdisk pages and
> metadata, and because of mlock.

Hmm. That sort of thing is what I had in mind by saying "pinned kernel
allocations". What did it sound like I had in mind?


On Wed, Jun 23, 2004 at 04:38:19PM -0700, Andrew Morton wrote:
> But for this problem, I do suspect that going oom if all the eligible zones
> are pegged out should suffice.  It is certainly accurate.
> It could be that the ->all_unreclaimable logic has rotted over the ages and
> may ned some attention - I haven't explicitly checked it in a year or more.
> Also there's some problem at present wherein the oomkiller takes 30 or
> more seconds to kick in and do its thing, which I need to look at.  This
> happens when there's no swap online.  It might also happen when swapspace
> is full - I didn't check.

The timeout semantics are crusty. 10 failures 1s or more apart must be
seen, but with no delay of more than 5s between failures. If the average
delay between attempts more than 1s from the last successful attempt
that incremented the counter was 3s, that would make for a 30s delay
before OOM killing. In the presence of e.g. delays waiting for IO and
so on preventing retries from happening rapidly, 30s is plausible. Also,
it's rather likely that the tasks targeted by the OOM killer are in
state TASK_UNINTERRUPTIBLE and so will never respond to the OOM kill
until whatever they were waiting for happens, which may require (you
guessed it) memory to occur.

I would prefer to migrate away from timeout-based semantics and eliminate
indefinite retry logic and out-of-context reclamation (which is the source
of reclamation failures having no context to which to return errors).
However, this also requires some method of recovering from or insulating
administrators from indefinite retry loops originating from userspace, and
better contexts to which to return failure than syslog daemons.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/