Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755268AbYGRJp0 (ORCPT ); Fri, 18 Jul 2008 05:45:26 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1754262AbYGRJpL (ORCPT ); Fri, 18 Jul 2008 05:45:11 -0400 Received: from wf-out-1314.google.com ([209.85.200.174]:31628 "EHLO wf-out-1314.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754176AbYGRJpJ (ORCPT ); Fri, 18 Jul 2008 05:45:09 -0400 DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:cc:mime-version:content-type :content-transfer-encoding:content-disposition; b=hvExBRZZ7MKwc+vd+4dNL2Xc/lkrpAU3TJlAZWbpQF/tTVDxUCP8QjlgJjzfiS+Fu6 iHnPw2VJqshm8r45wnk5EpmS8tz5M0KIgj2dpOAbVQLYSMu8xpAenuATQvkSWT5Bdqkm aMdYuwsZ0qrtnQcuTyWphJLESCmRF7eEhSsSg= Message-ID: <19f34abd0807180245l2a633644n1a8d91cb3587d9e4@mail.gmail.com> Date: Fri, 18 Jul 2008 11:45:08 +0200 From: "Vegard Nossum" To: linux-ext4@vger.kernel.org Subject: latest -git: A peculiar case of a stuck process (ext3/sched-related?) Cc: sct@redhat.com, akpm@linux-foundation.org, adilger@sun.com, "Ingo Molnar" , "Peter Zijlstra" , "Linux Kernel Mailing List" MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Content-Disposition: inline Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3436 Lines: 94 Hi, I was running a test which corrupts ext3 filesystem images on purpose. After quite a long time, I have ended up with a grep that runs at 98% CPU and is unkillable even though it is in state R: root 6573 98.6 0.0 4008 820 pts/0 R 11:17 15:48 grep -r . mnt It doesn't go away with kill -9 either. A sysrq-t shows this info: grep R running 5704 6573 6552 f4ff3c3c c0747b19 00000000 f4ff3bd4 c01507ba ffffffff 00000000 f4ff3bf0 f5992fd0 f4ff3c4c 01597000 00000000 c09cd080 f312afd0 f312b248 c1fb2f80 00000001 00000002 00000000 f312afd0 f312afd0 f4ff3c24 c015ab70 00000000 Call Trace: [] ? schedule+0x459/0x960 [] ? atomic_notifier_call_chain+0x1a/0x20 [] ? mark_held_locks+0x40/0x80 [] ? trace_hardirqs_on+0xb/0x10 [] ? trace_hardirqs_on_caller+0x116/0x170 [] preempt_schedule_irq+0x3e/0x70 [] need_resched+0x1f/0x23 [] ? ext3_find_entry+0x401/0x6f0 [] ? __lock_acquire+0x2c9/0x1110 [] ? slab_pad_check+0x3c/0x120 [] ? trace_hardirqs_on_caller+0x116/0x170 [] ? trace_hardirqs_off+0xb/0x10 [] ext3_lookup+0x3a/0xd0 [] ? d_alloc+0x133/0x190 [] do_lookup+0x160/0x1b0 [] __link_path_walk+0x208/0xdc0 [] ? lock_release_holdtime+0x83/0x120 [] ? mnt_want_write+0x4e/0xb0 [] __link_path_walk+0x8f7/0xdc0 [] ? trace_hardirqs_off+0xb/0x10 [] path_walk+0x54/0xb0 [] do_path_lookup+0x85/0x230 [] __user_walk_fd+0x38/0x50 [] vfs_stat_fd+0x21/0x50 [] ? put_lock_stats+0xd/0x30 [] ? mntput_no_expire+0x1d/0x110 [] vfs_stat+0x11/0x20 [] sys_stat64+0x14/0x30 [] ? fput+0x1f/0x30 [] ? trace_hardirqs_on_thunk+0xc/0x10 [] ? trace_hardirqs_on_caller+0x116/0x170 [] ? trace_hardirqs_on_thunk+0xc/0x10 [] sysenter_past_esp+0x78/0xc5 ======================= ..so it's clearly related to the corrupted ext3 filesystem. The strange thing, in my opinion, is this stack frame: [] ext3_lookup+0x3a/0xd0 ..but this address corresponds to fs/ext3/namei.c:1039: bh = ext3_find_entry(dentry, &de); inode = NULL; if (bh) { /* <--- here */ unsigned long ino = le32_to_cpu(de->inode); brelse (bh); What happened? Did the scheduler get stuck? Softlockup detection and NMI watchdog are both enabled, but none of them are triggering. Trying to strace the problem doesn't really help either: # strace -p 6573 Process 6573 attached - interrupt to quit ^C^C^C^C (and hangs unkillably too.) See full log at: http://folk.uio.no/vegardno/linux/log-1216370788.txt The machine is still running in the same state and CPU0 is still usable. What more info can I provide to help debug this? Vegard -- "The animistic metaphor of the bug that maliciously sneaked in while the programmer was not looking is intellectually dishonest as it disguises that the error is the programmer's own creation." -- E. W. Dijkstra, EWD1036 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/