Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S262337AbUDDMRv (ORCPT ); Sun, 4 Apr 2004 08:17:51 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S262345AbUDDMRv (ORCPT ); Sun, 4 Apr 2004 08:17:51 -0400 Received: from linuxhacker.ru ([217.76.32.60]:22703 "EHLO shrek.linuxhacker.ru") by vger.kernel.org with ESMTP id S262337AbUDDMRs (ORCPT ); Sun, 4 Apr 2004 08:17:48 -0400 Date: Sun, 4 Apr 2004 15:17:56 +0300 From: Oleg Drokin To: linux-kernel@vger.kernel.org Subject: [2.4] NMI WD detected lockup during page alloc Message-ID: <20040404121756.GA8854@linuxhacker.ru> Mime-Version: 1.0 Content-Type: text/plain; charset=koi8-r Content-Disposition: inline Content-Transfer-Encoding: 8bit User-Agent: Mutt/1.4.1i Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2962 Lines: 73 Hello! One of my servers started to experience mystic hangs after upgrade to dual P4 Xeon (before that it was running on UP kernel) (HT enabled now). So I enabled NMI watchdog and finally it triggered recently. The kernel is 2.4.25+ (pulled from 2.4 bitkeeper tree on XX/XX, but it seems related files in mm/ have not changed since at least January 2004 anyway). So the HW is Duap P4-Xeon on some Intel-branded server (E7501-based or something), 2G E?? RAM (highmem enabled). That's what I got on the serial console: NMI Watchdog detected LOCKUP on CPU2, eip c013b527, registers: CPU: 2 EIP: 0010:[] Not tainted EFLAGS: 00000086 eax: 00000000 ebx: c02dca38 ecx: 000048ce edx: c02dca38 esi: c02dca74 edi: 00000000 ebp: d34b1e5c esp: d34b1e30 ds: 0018 es: 0018 ss: 0018 Process mrtg (pid: 14663, stackpage=d34b1000) Stack: 00038000 00000282 00000000 00015006 00015006 00000286 00000000 c02dca38 c02dca38 c02dcb38 00000002 d34b1ea0 c013adfa c0139395 d34b1ea0 00000202 c02dcaec 32353530 d34b1e7c c02dca38 c02dca38 c02dcb34 00000000 000001d2 Call Trace: [] [] [] [] [ ] [] [] [] [] [] [ ] [] Code: f3 90 7e f9 e9 11 f4 ff ff 80 3f 00 f3 90 7e f9 e9 8e fd ff >>EIP; c013b527 <.text.lock.page_alloc+f/28> <===== Trace; c013adfa <__alloc_pages+6a/270> Trace; c0139395 Trace; c012dc0d Trace; c012e6d7 Trace; c0119330 Trace; c014bca5 Trace; c0159301 Trace; c014ee56 Trace; c014bd1b Trace; c0153b3b Trace; c0118f70 Trace; c01076b0 So it seems it was blocked trying to take zone->lock in mm/page_alloc.c::rmqueue() The actual calltrace seems to be (lots of stale entries seems to be on actual stack). rmqueue __alloc_pages+6a do_wp_page+6d handle_mm_fault+f7 (this is in fact handle_pte_fault()) do_page_fault+3c0 error_code+34 I fail to see a path where we can take lock on the same zone twice on same CPU, so may be the zone structure was somehow corrupted (I do not have spinlock debugging enabled yet). I do not think there are problems with memory in that box that might explain this as well. Probability of hangs vary over time, I got the first one on the next day after upgrade (not even sure if it was the same as this one since I had no traces from it), but this second one happened after 2-3 weeks of uptime. May be it will help someone to find out what happens. Bye, Oleg - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/