Date: Thu, 18 May 2006 13:05:10 -0700
From: Andrew Morton <akpm@osdl.org>
To: Dave Peterson <dsp@llnl.gov>
Cc: linux-kernel@vger.kernel.org, nickpiggin@yahoo.com.au, pj@sgi.com,
       ak@suse.de, linux-mm@kvack.org, garlick@llnl.gov, mgrondona@llnl.gov
Subject: Re: [PATCH] mm: avoid unnecessary OOM kills
Message-Id: <20060518130510.29444628.akpm@osdl.org>
In-Reply-To: <200605181729.k4IHToRU025597@calaveras.llnl.gov>
References: <200605181729.k4IHToRU025597@calaveras.llnl.gov>
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 5258
Lines: 147

Dave Peterson <dsp@llnl.gov> wrote:
>
> Below is a 2.6.17-rc4-mm1 patch that fixes a problem where the OOM killer was
> unnecessarily killing system daemons in addition to memory-hogging user
> processes.  The patch fixes things so that the following assertion is
> satisfied:
> 
>     If a failed attempt to allocate memory triggers the OOM killer, then the
>     failed attempt must have occurred _after_ any process previously shot by
>     the OOM killer has cleaned out its mm_struct.
> 
> Thus we avoid situations where concurrent invocations of the OOM killer cause
> more processes to be shot than necessary to resolve the OOM condition.
> 
> ...
>
> --- linux-2.6.17-rc4-mm1.orig/include/linux/swap.h	2006-05-17 22:31:38.000000000 -0700
> +++ linux-2.6.17-rc4-mm1/include/linux/swap.h	2006-05-17 22:33:54.000000000 -0700
> @@ -155,6 +156,29 @@
>  #define vm_swap_full() (nr_swap_pages*2 < total_swap_pages)
>  
>  /* linux/mm/oom_kill.c */
> +extern volatile unsigned long oom_kill_in_progress;

This shouldn't be volatile.

> +/*
> + * Attempt to start an OOM kill operation.  Return 0 on success, or 1 if an
> + * OOM kill is already in progress.
> + */
> +static inline int oom_kill_start(void)
> +{
> +	return test_and_set_bit(0, &oom_kill_in_progress);
> +}

Suggest this be called oom_kill_trystart().

> ===================================================================
> --- linux-2.6.17-rc4-mm1.orig/mm/oom_kill.c	2006-05-17 22:31:38.000000000 -0700
> +++ linux-2.6.17-rc4-mm1/mm/oom_kill.c	2006-05-17 22:33:54.000000000 -0700
> @@ -25,6 +25,8 @@
>  int sysctl_panic_on_oom;
>  /* #define DEBUG */
>  
> +volatile unsigned long oom_kill_in_progress = 0;

This shouldn't be initialised to zero.  The kernel zeroes bss at startup.

>  /**
>   * badness - calculate a numeric value for how bad this task has been
>   * @p: task struct of which task we should calculate
> @@ -260,27 +262,31 @@
>  	struct mm_struct *mm;
>  	task_t * g, * q;
>  
> +	task_lock(p);
>  	mm = p->mm;
>  
> -	/* WARNING: mm may not be dereferenced since we did not obtain its
> -	 * value from get_task_mm(p).  This is OK since all we need to do is
> -	 * compare mm to q->mm below.
> +	if (mm == NULL || mm == &init_mm) {
> +		task_unlock(p);
> +		return 1;
> +	}
> +
> +	set_bit(MM_FLAG_OOM_NOTIFY, &mm->flags);
> +	task_unlock(p);

Putting task_lock() in here would be a fairly obvious way to address the
fragility which that comment describes.  But I have a feeling that we
discussed that a couple of weeks ago and found a problem with it, didn't we?

> ===================================================================
> --- linux-2.6.17-rc4-mm1.orig/mm/page_alloc.c	2006-05-17 22:31:38.000000000 -0700
> +++ linux-2.6.17-rc4-mm1/mm/page_alloc.c	2006-05-17 22:33:54.000000000 -0700
> @@ -910,6 +910,36 @@
>  	return 1;
>  }
>  
> +/* Try to allocate one more time before invoking the OOM killer. */
> +static struct page * oom_alloc(gfp_t gfp_mask, unsigned int order,
> +		struct zonelist *zonelist)
> +{
> +	struct page *page;
> +
> +	/* The use of oom_kill_start() below prevents parallel OOM kill
> +	 * operations.  This fixes a problem where the OOM killer was observed
> +	 * shooting system daemons in addition to memory-hogging user
> +	 * processes.
> +	 */
> +	if (oom_kill_start())
> +		return NULL;  /* previous OOM kill still in progress */
> +
> +	/* If we get this far, we _know_ that any previous OOM killer victim
> +	 * has cleaned out its mm_struct.  Therefore we should pick a victim
> +	 * to shoot if the allocation below fails.
> +	 */
> +	page = get_page_from_freelist(gfp_mask | __GFP_HARDWALL, order,
> +			zonelist, ALLOC_WMARK_HIGH | ALLOC_CPUSET);
> +
> +	if (page) {
> +		oom_kill_finish();  /* cancel OOM kill operation */
> +		return page;
> +	}
> +
> +	out_of_memory(zonelist, gfp_mask, order);
> +	return NULL;
> +}
> +
>  /*
>   * get_page_from_freeliest goes through the zonelist trying to allocate
>   * a page.
> @@ -1116,18 +1146,8 @@
>  		if (page)
>  			goto got_pg;
>  	} else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
> -		/*
> -		 * Go through the zonelist yet one more time, keep
> -		 * very high watermark here, this is only to catch
> -		 * a parallel oom killing, we must fail if we're still
> -		 * under heavy pressure.
> -		 */
> -		page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, order,
> -				zonelist, ALLOC_WMARK_HIGH|ALLOC_CPUSET);
> -		if (page)
> +		if ((page = oom_alloc(gfp_mask, order, zonelist)) != NULL)
>  			goto got_pg;
> -
> -		out_of_memory(zonelist, gfp_mask, order);
>  		goto restart;
>  	}

So if process A did a successful oom_kill_start(), and processes B, C, D
and E all get into difficulty as well, they'll go into busywait loops.  The
cond_resched() will eventually save us from locking up, but it's a bit
unpleasant.

And I'm trying to work out where process A will run oom_kill_finish() if
`cancel' ends up being 0 in out_of_memory().  afaict, it doesn't, and the
oom-killer will be accidentally disabled?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/