From: "Murata, Dennis" Subject: RE: Kernel oops, RHEL 4 Date: Fri, 1 Feb 2008 13:41:16 -0800 Message-ID: <56D01C1B776AEE4385D29F25E83E515C0D4069F0@0599-its-exmb02.us.saic.com> References: <56D01C1B776AEE4385D29F25E83E515C0D3166A1@0599-its-exmb02.us.saic.com> <479F4EE3.8060708@RedHat.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Cc: To: "Steve Dickson" Return-path: Received: from cpmx2.mail.saic.com ([139.121.17.172]:50165 "EHLO cpmx2.mail.saic.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755366AbYBAVzs convert rfc822-to-8bit (ORCPT ); Fri, 1 Feb 2008 16:55:48 -0500 In-Reply-To: <479F4EE3.8060708-AfCzQyP5zfLQT0dZR+AlfA@public.gmane.org> Sender: linux-nfs-owner@vger.kernel.org List-ID: > -----Original Message----- > From: Steve Dickson [mailto:SteveD@redhat.com] > Sent: Tuesday, January 29, 2008 10:06 AM > To: Murata, Dennis > Cc: linux-nfs@vger.kernel.org > Subject: Re: Kernel oops, RHEL 4 > > > > Murata, Dennis wrote: > > We have had two system crashes in the past two weeks of a > RHEL 4U2 nfs > > server. The server is running with 128 nfsd daemons, has 6GB of > > memory, kernel is 2.6.9-22.Elsmp on Dell 2850 4 cpu server. > When the > > kernel oops occurs, the system must be rebooted from the DRAC. The > > server has approximately 600 clients. Something very > curious to me is > > the crashes both occurred on a Sunday, when there was > little or no client activity. > > > > I am enclosing part of the output from crash, we do have diskdump > > enabled. I haven't looked at the dump myself, but am enclosing > > comments from a fellow admin: > > > > Here's what I found from the core dump. The panic was > caused by nfsd, > > but it's hard to tell exactly what triggered it. The next > call in the > > stack was to ext3, so it could be a combination of ext3 and NFS. > > That's just speculation, but we may see improvement with a > newer kernel. > > Shawn > > crash> sys > > KERNEL: /usr/lib/debug/lib/modules/2.6.9-22.ELsmp/vmlinux > > DUMPFILE: vmcore > > CPUS: 4 > > DATE: Sun Jan 27 10:12:10 2008 > > UPTIME: 13 days, 23:40:25 > > LOAD AVERAGE: 1.13, 1.16, 1.13 > > TASKS: 268 > > NODENAME: cis2 > > RELEASE: 2.6.9-22.ELsmp > > VERSION: #1 SMP Mon Sep 19 18:00:54 EDT 2005 > > MACHINE: x86_64 (3591 Mhz) > > MEMORY: 7 GB > > PANIC: "Oops: 0000 [1] SMP " (check log for details) crash> log > > [shortened for brevity] Unable to handle kernel NULL pointer > > dereference at 0000000000000018 RIP: > > {rb_insert_color+30} > > PML4 66da5067 PGD 193413067 PMD 0 > > Oops: 0000 [1] SMP > > CPU 2 > > Modules linked in: scsi_dump diskdump nfs nfsd exportfs > lockd md5 ipv6 > > autofs4 sunrpc ds yenta_socket pcmcia_core dm_mirror dm_mod j oydev > > button battery ac uhci_hcd ehci_hcd hw_random shpchp e1000 > floppy sg > > ext3 jbd megaraid_mbox megaraid_mm sd_mod scsi_mod > > Pid: 12055, comm: nfsd Not tainted 2.6.9-22.ELsmp > > RIP: 0010:[] > {rb_insert_color+30} > > RSP: 0018:00000101b9d7d870 EFLAGS: 00010246 > > RAX: 00000000f1927c6e RBX: 00000101ba07e508 RCX: 0000000000000000 > > RDX: 00000101ba07e500 RSI: 00000101bda91880 RDI: 00000101bd374188 > > RBP: 0000000000000000 R08: 00000101bd374180 R09: 00000000de3f5426 > > R10: 0000000007070707 R11: 0000000007070707 R12: 00000101bd374188 > > R13: 00000101bda91880 R14: 00000101bda91880 R15: 00000000f1e84300 > > FS: 0000002a9589fb00(0000) GS:ffffffff804d3200(0000) > > knlGS:0000000000000000 > > CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > > CR2: 0000000000000018 CR3: 00000000bff3e000 CR4: 00000000000006e0 > > Process nfsd (pid: 12055, threadinfo 00000101b9d7c000, task > > 00000101b9d517f0) > > Stack: 000001008463a18c 0000000000000040 00000101ba07e518 > > 00000101ba07e508 > > ffffffffa004f894 de3f5426ba0234a8 000001008463a18c 00000101ba0234a8 > > 00000101b9d7d968 000001008463aff8 > > Call Trace:{:ext3:ext3_htree_store_dirent+274} > > {:ext3:htree_dirblock_to_tree+144} > > {:ext3:ext3_htree_fill_tree+119} > > {cfq_next_request+59} > > {:exportfs:filldir_one+0} > > {:ext3:ext3_readdir+371} > {iput+77} > > {:exportfs:filldir_one+0} > > {:ext3:ext3_get_parent+148} > > {:exportfs:filldir_one+0} > > {vfs_readdir+155} > > {:exportfs:get_name+190} > > {:exportfs:find_exported_dentry+859} > > {:nfsd:nfsd_acceptable+0} > > {qdisc_restart+30} > > {dev_queue_xmit+525} > > {ip_finish_output+356} > > {ip_push_pending_frames+833} > > {recalc_task_prio+337} > > {udp_push_pending_frames+548} > > {release_sock+16} > > {activate_task+124} > > {try_to_wake_up+734} > > {:nfsd:svc_expkey_lookup+623} > > {set_current_groups+376} > > {:exportfs:export_decode_fh+87} > > {:nfsd:fh_verify+1049} > > {:nfsd:nfsd3_proc_getattr+133} > > {:nfsd:nfsd_dispatch+219} > > {:sunrpc:svc_process+1160} > > {default_wake_function+0} > > {:nfsd:nfsd+0} {:nfsd:nfsd+568} > > {child_rip+8} {:nfsd:nfsd+0} > > {:nfsd:nfsd+0} {child_rip+0} > > Code: 48 8b 45 18 48 39 c3 75 44 48 8b 45 10 48 85 c0 74 06 > 83 78 RIP > > {rb_insert_color+30} RSP <00000101b9d7d870> > > CR2: 0000000000000018 > > crash> bt > > PID: 12055 TASK: 101b9d517f0 CPU: 2 COMMAND: "nfsd" > > #0 [101b9d7d6a0] start_disk_dump at ffffffffa023828f > > #1 [101b9d7d6d0] try_crashdump at ffffffff8014a8f2 > > #2 [101b9d7d6e0] do_page_fault at ffffffff80123572 > > #3 [101b9d7d740] thread_return at ffffffff80303358 > > #4 [101b9d7d7c0] error_exit at ffffffff80110aed > > RIP: ffffffff801e729c RSP: 00000101b9d7d870 RFLAGS: 00010246 > > RAX: 00000000f1927c6e RBX: 00000101ba07e508 RCX: 0000000000000000 > > RDX: 00000101ba07e500 RSI: 00000101bda91880 RDI: 00000101bd374188 > > RBP: 0000000000000000 R8: 00000101bd374180 R9: 00000000de3f5426 > > R10: 0000000007070707 R11: 0000000007070707 R12: 00000101bd374188 > > R13: 00000101bda91880 R14: 00000101bda91880 R15: 00000000f1e84300 > > ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 > > #5 [101b9d7d890] ext3_htree_store_dirent at ffffffffa004f894 > > #6 [101b9d7d8d0] htree_dirblock_to_tree at ffffffffa005539e > > #7 [101b9d7d920] ext3_htree_fill_tree at ffffffffa0055460 > > #8 [101b9d7d980] cfq_next_request at ffffffff802501d7 > > #9 [101b9d7d9c0] ext3_readdir at ffffffffa004faba #10 [101b9d7d9e0] > > iput at ffffffff8018e923 > > #11 [101b9d7da20] ext3_get_parent at ffffffffa0055dc1 > > #12 [101b9d7dac0] vfs_readdir at ffffffff80188723 > > #13 [101b9d7daf0] get_name at ffffffffa01b872d > > #14 [101b9d7db40] find_exported_dentry at ffffffffa01b835b > > #15 [101b9d7db90] qdisc_restart at ffffffff802b8258 > > #16 [101b9d7dbd0] dev_queue_xmit at ffffffff802a9ab7 > > #17 [101b9d7dbf0] ip_finish_output at ffffffff802c5555 > > #18 [101b9d7dc20] ip_push_pending_frames at ffffffff802c75f7 > > #19 [101b9d7dc60] recalc_task_prio at ffffffff801313f5 #20 > > [101b9d7dc70] udp_push_pending_frames at ffffffff802e2043 > > #21 [101b9d7dc90] release_sock at ffffffff802a3798 > > #22 [101b9d7dcd0] activate_task at ffffffff80131483 > > #23 [101b9d7dd00] try_to_wake_up at ffffffff80131931 > > #24 [101b9d7dd10] svc_expkey_lookup at ffffffffa01c1a9b > > #25 [101b9d7dd70] set_current_groups at ffffffff80145092 > > #26 [101b9d7ddb0] export_decode_fh at ffffffffa01b88f6 > > #27 [101b9d7ddc0] fh_verify at ffffffffa01bdd43 > > #28 [101b9d7de30] nfsd3_proc_getattr at ffffffffa01c64fc > > #29 [101b9d7de60] nfsd_dispatch at ffffffffa01bb7af #30 > [101b9d7de90] > > svc_process at ffffffffa012d240 > > #31 [101b9d7def0] nfsd at ffffffffa01bb534 > > #32 [101b9d7df50] kernel_thread at ffffffff80110ca3 > > > We have many identical servers at different sites that > don't seem to > > have this problem. The only real difference is transport, > we are the > > only site using udp rather than tcp. > > Is the kernel oops caused by nfsd? Would a system/kernel > upgrade fix > > this. We are looking at upgrading to RHEL 4 U6. > IMHO... this clearly looks like an ext3 problem to me. The > fact that only one of your identical server is seeing this > problem is just good luck or bad luck depending on how you > look at it... ;-) Maybe the disk on the one server might be > having problems... I would look for other error in > /var/log/message prior to this crash. > > Its always a good thing to keep updated to the latest > released kernel, but with out searching bugzilla.redhat.com, > this problem by or may not be fixed... > > steved. > We have looked at all the logs we have available, the only errors are the ones from diskdump. The server has mirrored disks for the os and a separate raid array for the data. If there is an error on the data disks, it should not cause a kernel oops should it? I really didn't see anything in bugzilla that I could search for that seemed to be specifically for ext3. Does this seem to imply the os should be reloaded? I will search for an ext3 mailing list. Thanks. Wayne