Date: Sun, 2 Dec 2007 22:19:25 +0100
From: Andi Kleen <andi@firstfloor.org>
To: Ingo Molnar <mingo@elte.hu>
Cc: Andi Kleen <andi@firstfloor.org>, Arjan van de Ven <arjan@infradead.org>,
       linux-kernel@vger.kernel.org, Andrew Morton <akpm@linux-foundation.org>,
       Thomas Gleixner <tglx@linutronix.de>
Subject: Re: [feature] automatically detect hung TASK_UNINTERRUPTIBLE tasks
Message-ID: <20071202211925.GA26414@one.firstfloor.org>
References: <20071201092037.GA32544@elte.hu> <p737ijwylet.fsf@bingen.suse.de> <20071202185945.GA25990@elte.hu> <20071202114152.3bf4332d@laptopd505.fenrus.org> <20071202200953.GA23994@one.firstfloor.org> <20071202202602.GA16480@elte.hu> <20071202204725.GA25891@one.firstfloor.org> <20071202211027.GA32282@elte.hu>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20071202211027.GA32282@elte.hu>
User-Agent: Mutt/1.4.2.1i
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 2337
Lines: 51

On Sun, Dec 02, 2007 at 10:10:27PM +0100, Ingo Molnar wrote:
> what if you considered - just for a minute - the possibility of this 
> debug tool being the thing that actually animates developers to fix such 
> long delay bugs that have bothered users for almost a decade meanwhile?

Throwing frequent debugging messages for non buggy cases will
just lead to people generally ignore softlockups.

I don't think runtime instrumentation is the way to introduce
TASK_KILLABLE in general. The only way there is people going through
the source and identify places where it makes sense.

> 
> Until now users had little direct recourse to get such problems fixed. 
> (we had sysrq-t, but that included no real metric of how long a task was 

Actually task delay accounting can measure this now.  iirc someone
had a latencytop based on it already.

> blocked, so there was no direct link in the typical case and users had 
> no real reliable tool to express their frustration about unreasonable 
> delays.)
> 
> Now this changes: they get a "smoking gun" backtrace reported by the 
> kernel, and blamed on exactly the place that caused that unreasonable 
> delay. And it's not like the kernel breaks - at most 10 such messages 
> are reported per bootup.
> 
> We increase the delay timeout to say 300 seconds, and if the system is 
> under extremely high IO load then 120+ might be a reasonable delay, so 
> it's all tunable and runtime disable-able anyway. So if you _know_ that 
> you will see and tolerate such long delays, you can tweak it - but i can 

This means the user has to see their kernel log fill by such
messages at least once - do a round trip to some mailing list to 
explain that it is expected and not a kernel bug - then tweak
some obscure parameters. Doesn't seem like a particular fruitful
procedure to me.

> tell you with 100% certainty that 99.9% of the typical Linux users do 
> not characterize such long delays as "correct behavior".

It's about robustness, not the typical case.
Throwing backtraces when something slightly unusual happens is not a robust system.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/