Date: Sat, 12 Dec 2015 12:00:32 -0500
From: Johannes Weiner <hannes@cmpxchg.org>
To: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: mhocko@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org,
        torvalds@linux-foundation.org, rientjes@google.com, oleg@redhat.com,
        kwalker@redhat.com, cl@linux.com, akpm@linux-foundation.org,
        vdavydov@parallels.com, skozina@redhat.com, mgorman@suse.de,
        riel@redhat.com, arekm@maven.pl
Subject: Re: [PATCH v4] mm,oom: Add memory allocation watchdog kernel thread.
Message-ID: <20151212170032.GB7107@cmpxchg.org>
References: <201512130033.ABH90650.FtFOMOFLVOJHQS@I-love.SAKURA.ne.jp>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <201512130033.ABH90650.FtFOMOFLVOJHQS@I-love.SAKURA.ne.jp>
User-Agent: Mutt/1.5.24 (2015-08-30)
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 2412
Lines: 54

On Sun, Dec 13, 2015 at 12:33:04AM +0900, Tetsuo Handa wrote:
> +Currently, when something went wrong inside memory allocation request,
> +the system will stall with either 100% CPU usage (if memory allocating
> +tasks are doing busy loop) or 0% CPU usage (if memory allocating tasks
> +are waiting for file data to be flushed to storage).
> +But /proc/sys/kernel/hung_task_warnings is not helpful because memory
> +allocating tasks unlikely sleep in uninterruptible state for
> +/proc/sys/kernel/hung_task_timeout_secs seconds.

Yes, this is very annoying. Other tasks in the system get dumped out
as they are blocked for too long, but not the allocating task itself
as it's busy looping.

That being said, I'm not entirely sure why we need daemon to do this,
which then requires us to duplicate allocation state to task_struct.
There is no scenario where the allocating task is not moving at all
anymore, right? So can't we dump the allocation state from within the
allocator and leave the rest to the hung task detector?

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 05ef7fb..fbfc581 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3004,6 +3004,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	enum migrate_mode migration_mode = MIGRATE_ASYNC;
 	bool deferred_compaction = false;
 	int contended_compaction = COMPACT_CONTENDED_NONE;
+	unsigned int nr_tries = 0;
 
 	/*
 	 * In the slowpath, we sanity check order to avoid ever trying to
@@ -3033,6 +3034,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		goto nopage;
 
 retry:
+	if (++nr_retries % 1000 == 0)
+		warn_alloc_failed(gfp_mask, order, "Potential GFP deadlock\n");
+
 	if (gfp_mask & __GFP_KSWAPD_RECLAIM)
 		wake_all_kswapds(order, ac);
 
Basing it on nr_retries alone might be too crude and take too long
when each cycle spends time waiting for IO. However, if that is a
problem we can make it time-based instead, like your memalloc_timer,
to catch tasks that spend too much time in a single alloc attempt.

> +		start_memalloc_timer(alloc_mask, order);
>  		page = __alloc_pages_slowpath(alloc_mask, order, &ac);
> +		stop_memalloc_timer(alloc_mask);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/