Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756725Ab2B1JtN (ORCPT ); Tue, 28 Feb 2012 04:49:13 -0500 Received: from casper.infradead.org ([85.118.1.10]:46828 "EHLO casper.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753769Ab2B1JtH convert rfc822-to-8bit (ORCPT ); Tue, 28 Feb 2012 04:49:07 -0500 Message-ID: <1330422535.11248.78.camel@twins> Subject: Re: [RFC PATCH] kick ksoftirqd more often to please soft lockup detector From: Peter Zijlstra To: Dan Williams Cc: linux-kernel@vger.kernel.org, Jens Axboe , linux-scsi@vger.kernel.org, Lukasz Dorau , James Bottomley , Andrzej Jakowski , Thomas Gleixner Date: Tue, 28 Feb 2012 10:48:55 +0100 In-Reply-To: <20120227203847.22153.62468.stgit@dwillia2-linux.jf.intel.com> References: <20120227203847.22153.62468.stgit@dwillia2-linux.jf.intel.com> Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7BIT X-Mailer: Evolution 3.2.2- Mime-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2763 Lines: 85 On Mon, 2012-02-27 at 12:38 -0800, Dan Williams wrote: > An experimental hack to tease out whether we are continuing to > run the softirq handler past the point of needing scheduling. > > It allows only one trip through __do_softirq() as long as need_resched() > is set which hopefully creates the back pressure needed to get ksoftirqd > scheduled. > > Targeted to address reports like the following that are produced > with i/o tests to a sas domain with a large number of disks (48+), and > lots of debugging enabled (slub_deubg, lockdep) that makes the > block+scsi softirq path more cpu-expensive than normal. > > With this patch applied the softlockup detector seems appeased, but it > seems odd to need changes to kernel/softirq.c so maybe I have overlooked > something that needs changing at the block/scsi level? > > BUG: soft lockup - CPU#3 stuck for 22s! [kworker/3:1:78] So you're stuck in softirq for 22s+, max_restart is 10, this gives that on average you spend 2.2s+ per softirq invocation, this is completely absolutely bonkers. Softirq handlers should never consume significant amount of cpu-time. Thomas, think its about time we put something like the below in? --- kernel/softirq.c | 16 ++++++++++++++++ 1 files changed, 16 insertions(+), 0 deletions(-) diff --git a/kernel/softirq.c b/kernel/softirq.c index ff066a4..6137ee1 100644 --- a/kernel/softirq.c +++ b/kernel/softirq.c @@ -210,6 +210,7 @@ asmlinkage void __do_softirq(void) __u32 pending; int max_restart = MAX_SOFTIRQ_RESTART; int cpu; + u64 start, callback, now; pending = local_softirq_pending(); account_system_vtime(current); @@ -223,6 +224,8 @@ asmlinkage void __do_softirq(void) /* Reset the pending bitmask before enabling irqs */ set_softirq_pending(0); + start = callback = cpu_clock(cpu); + local_irq_enable(); h = softirq_vec; @@ -246,6 +249,15 @@ asmlinkage void __do_softirq(void) preempt_count() = prev_count; } + now = cpu_clock(cpu); + if (now - callback > TICK_NSEC / 4) { + printk(KERN_ERR "softirq took longer than 1/4 tick: " + "%u %s %p\n", vec_nr, + softirq_to_name[vec_nr], + h->action); + } + callback = now; + rcu_bh_qs(cpu); } h++; @@ -254,6 +266,10 @@ asmlinkage void __do_softirq(void) local_irq_disable(); + now = cpu_clock(cpu); + if (now - start > TICK_NSEC / 2) + printk(KERN_ERR "softirq loop took longer than 1/2 tick\n"); + pending = local_softirq_pending(); if (pending && --max_restart) goto restart; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/