Date: Mon, 7 Jan 2008 09:20:48 -0800
From: Randy Dunlap <randy.dunlap@oracle.com>
To: Brice Figureau <brice+lklm@daysofwonder.com>
Cc: linux-kernel@vger.kernel.org
Subject: Re: Strange freeze on 2.6.22 (deadlock?)
Message-Id: <20080107092048.4bf12349.randy.dunlap@oracle.com>
In-Reply-To: <1199720934.11173.49.camel@localhost.localdomain>
References: <1199720934.11173.49.camel@localhost.localdomain>
Organization: Oracle Linux Eng.
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 2863
Lines: 73

On Mon, 07 Jan 2008 16:48:54 +0100 Brice Figureau wrote:

> Hi,
> 
> I'm seeing a strange complete server freeze/lock-up on an bi-Xeon HT
> amd64 server running standard debian 2.6.22 (and before that vanilla
> 2.6.19.x and 2.6.20.x which exhibited the same issue).
> 
> I'm only reporting it now, since I could get a full sysrq-t only this
> morning.
> 
> The symptoms are that every 5 to 7 days, the server (which acts as a MX
> along with a few low traffic websites) locks-up. The ipmi watchdog is
> unable to reboot the server (and doesn't even trigger, since there is no
> evidence in the esmlog), the machine is still pingable. I can't ssh to
> it, but I can enter my login & password on a serial console, but no
> shell is started.
> 
> Pressing sysrq-t produced the trace hosted here:
> http://www.daysofwonder.com/bug/crash-server1.txt.gz
> 
> It happened one time when I was connected to the server through ssh and
> I could see that the load started to increase well above 100. It was
> then impossible to launch new process from the command-line (and I had
> to reboot manually).
> It happened also last week, and the server was stuck for about 6 hours.
> When I started investigating what was wrong, it slowly came back to life
> (with an avg 1-min load of more than 1500, and tons of cron processes
> running in parallel).
> 
> I'm not really familiar with kernel development so I can't really find
> the issue in the aforementioned trace output.
> What I think is that for some reason there is a race/deadlock that
> finally prevents new processes to really start (which in turns produces
> the high load).
> 
> What seems suspect in the aforementioned trace is:
>  *) lot of processes stacktrace ends in __mod_timer+0xc3/0xd3
> which seems to be this line from kernel/timer.c
> 
> 415	timer->expires = expires;
> 416	internal_add_timer(base, timer);
> -->	spin_unlock_irqrestore(&base->lock, flags);
> 
> 419	return ret;
> 420  }
> 
>  *) lot of processes stacktrace ends in __mutex_lock_slowpath and/or zone_statistics

There are also lots of processes in D state (usually waiting
for I/O to complete).  And jbd is in their stack traces.

How is/are the ext3 filesystems mounted?  I mean what data=xyz
mode?  data=journal (the heaviest duty mode) has at least one
known deadlock.  If you are using data=journal, you could try
switching to data=ordered...


> Anyway, I will soon reboot to a 2.6.23.x to see if that symptom
> persists.
> More information (config, server specs) are available on request.
> 
> I'm not subscribed to the list, so please CC: me for any anwser.
> Many thanks,


---
~Randy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/