Date: Sun, 24 Aug 2003 00:13:15 +0900
From: TeJun Huh <tejun@aratech.co.kr>
To: Stephan von Krawczynski <skraw@ithnet.com>
Cc: linux-kernel@vger.kernel.org
Subject: Re: Race condition in 2.4 tasklet handling (cli() broken?)
Message-ID: <20030823151315.GA6781@atj.dyndns.org>
Mail-Followup-To: Stephan von Krawczynski <skraw@ithnet.com>,
	linux-kernel@vger.kernel.org
References: <20030823025448.GA32547@atj.dyndns.org> <20030823040931.GA3872@atj.dyndns.org> <20030823052633.GA4307@atj.dyndns.org> <20030823122813.0c90e241.skraw@ithnet.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20030823122813.0c90e241.skraw@ithnet.com>
User-Agent: Mutt/1.5.4i
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 2837
Lines: 60

On Sat, Aug 23, 2003 at 12:28:13PM +0200, Stephan von Krawczynski wrote:
> 
> If we follow your analysis and say it is broken, do you have a suggestion/patch
> how to fix both? I am willing to try your proposals, as it seems I am one of
> very few who really experience stability issues on SMP with the current
> implementation.
> 

 Hello, Stephan.

 The race conditions I'm mentioning in this thread are not likely to
cause real troubles.  The first one does not make any difference on
x86, and AFAIK bh isn't used extensively anymore so the second one
isn't very relevant either.  Only the race condition mentioned in the
other thread is of relvance if there is any :-(.

 We've been also suffering from random lockups (2.4.21 with various
patches including in-kernel irqbalancing) which show symptoms somewhat
different from usual kernel deadlock or panics.  We've seen lock ups
on several different machines.  All the machines were SMP and quite
busy with high volume network traffic and a lot of disk I/Os.  A
lockup takes from a week to a month(!) to take place.  Even though
they take very long, they occur sort of reliably.

 I had a chance to examine a locked up machine (Dual 3g xeon w/HT).  I
could turn on and off keyboard LEDs (so, keyboard irq is working) but
console didn't come back from blanked state.  The kernel was compiled
with sysrq and I've tried many sysrqs but teh console remained blank.
After a while, I pressed sysrq reboot key and it rebooted.
Fortunately, kernel log file did contain all outputs from sysrqs and I
could do a little bit of post-mortem analysis.

 The first weird thing was the timestamps.  Time seemed to be stopped
for a few hours between the lock up and the first sysrq request.
Then, time start to go again after the first sysrq request.  (NMI
watchdog was on)

 Process list showed that a server process is stuck inside kernel, but
the stuck position was very weird.  It was freeing a socket after
receving FIN.  The eip was stuck at the same place over several
sysrqs, and the instruction at the eip was plain ADD right after a
CALL instruction to kfree.  I think there is one more frame above
what's shown but I don't know how sysrq prints stack trace for other
cpus so I'm not sure.

 To gather more information, we hooked up a machine with kdb and are
waiting for the lockup to occur again.  My personal feeling is that
the race conditions I've mentioned are not the causes of the lockups
we're suffering from.

 It would be helpful if you can tell us more about your lockups.  Have
you tried sysrq, NMI watchdog, kdb or kgdb?

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/