From: "Weathers, Norman R." Subject: Possible NFS failure with late kernel versions Date: Wed, 20 May 2009 11:50:02 -0500 Message-ID: <0122F800A3B64C449565A9E8C297701005E6DE4B@hoexmb9.conoco.net> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: Return-path: Received: from mailman2.ppco.com ([138.32.41.14]:40260 "EHLO mailman2.ppco.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754178AbZETQuG convert rfc822-to-8bit (ORCPT ); Wed, 20 May 2009 12:50:06 -0400 Received: from bvlextrd2.conoco.net (bvlextrd2.conoco.net [138.32.41.13]) by mailman2.ppco.com (Switch-3.3.0/Switch-3.3.0) with ESMTP id n4KGnpjb003082 for ; Wed, 20 May 2009 11:49:52 -0500 Received: from bvlexbh5.conoco.net (bvlexbh5.conoco.net [158.139.203.26]) by mail1.ppco.com (Switch-3.3.0/Switch-3.2.7) with ESMTP id n4KGnDPO009723 for ; Wed, 20 May 2009 11:49:16 -0500 Sender: linux-nfs-owner@vger.kernel.org List-ID: Hello, list. I have run across some weird failures as of late. The following is a kernel bug output from one kernel (2.6.27.24): ------------[ cut here ]------------ WARNING: at kernel/softirq.c:136 local_bh_enable_ip+0xb5/0xf0() Modules linked in: nfsd lockd nfs_acl exportfs autofs4 sunrpc scsi_dh_emc ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 xfs uinput iTCO_wdt iTCO_vendor_support ipmi_si iw_nes qla2xxx ipmi_msghandler bnx2 serio_raw pcspkr joydev ib_core i5000_edac hpwdt scsi_transport_fc hpilo edac_core scsi_tgt libcrc32c dm_round_robin dm_multipath shpchp cciss [last unloaded: freq_table] Pid: 3094, comm: nfsd Not tainted 2.6.27.24 #1 Call Trace: [] warn_on_slowpath+0x5f/0x90 [] ? local_bh_enable_ip+0x8c/0xf0 [] ? _read_unlock_bh+0x10/0x20 [] ? ipt_do_table+0x1d4/0x550 [] ? nf_conntrack_in+0x236/0x5d0 [] ? destroy_conntrack+0xaa/0x110 [] local_bh_enable_ip+0xb5/0xf0 [] _spin_unlock_bh+0xf/0x20 [] destroy_conntrack+0xaa/0x110 [] nf_conntrack_destroy+0x12/0x20 [] skb_release_all+0xc5/0x100 [] __kfree_skb+0x11/0xa0 [] kfree_skb+0x17/0x40 [] nes_nic_send+0x408/0x4b0 [iw_nes] [] ? neigh_resolve_output+0x10c/0x2d0 [] nes_netdev_start_xmit+0x109/0xa60 [iw_nes] [] ? __nf_ct_refresh_acct+0x99/0x190 [] ? tcp_packet+0xa42/0xeb0 [] ? ip_queue_xmit+0x1e4/0x3b0 [] ? ipt_do_table+0x1d4/0x550 [] ? local_bh_enable_ip+0x8c/0xf0 [] ? _read_unlock_bh+0x10/0x20 [] ? ipt_do_table+0x1d4/0x550 [] ? nf_conntrack_in+0x236/0x5d0 [] dev_hard_start_xmit+0x21d/0x2a0 [] __qdisc_run+0x1ee/0x230 [] dev_queue_xmit+0x2f8/0x580 [] neigh_resolve_output+0x10c/0x2d0 [] ip_finish_output+0x1cc/0x2f0 [] ip_output+0x65/0xb0 [] ip_local_out+0x20/0x30 [] ip_queue_xmit+0x1e4/0x3b0 [] tcp_transmit_skb+0x4eb/0x760 [] tcp_send_ack+0xd7/0x110 [] __tcp_ack_snd_check+0x5c/0xc0 [] tcp_rcv_established+0x6e9/0x9e0 [] tcp_v4_do_rcv+0x2c0/0x410 [] ? lock_sock_nested+0xbc/0xd0 [] release_sock+0x65/0xd0 [] tcp_ioctl+0xc1/0x190 [] inet_ioctl+0x27/0xc0 [] kernel_sock_ioctl+0x3a/0x60 [] svc_tcp_recvfrom+0x11d/0x450 [sunrpc] [] svc_recv+0x560/0x850 [sunrpc] [] ? default_wake_function+0x0/0x10 [] nfsd+0xdd/0x2d0 [nfsd] [] ? nfsd+0x0/0x2d0 [nfsd] [] ? nfsd+0x0/0x2d0 [nfsd] [] kthread+0x49/0x90 [] child_rip+0xa/0x11 [] ? restore_args+0x0/0x30 [] ? kthread+0x0/0x90 [] ? child_rip+0x0/0x11 ---[ end trace 7decf549249f3f2a ]--- I have used 2.6.28.10 and 2.6.29 and they all have this same bug. The end result is that under heavy load, these servers crash within a few minutes of emitting this trace. Hardware: HP Proliant Server, Dual 3.0 GHz Intel CPUs, 16 GB memory. Storage: Qlogic QLA2xxx 4 Gb fibre card to EMC CX3-80 (Multipath) Network: Intel / NetEffect 10 Gb iWarp NE20 (fibre) OS: Fedora 10 Clients: CentOS 5.2 10 Gb nodes / 10 Gb switches, so a very fast network. Any assistance would be greatly appreciated. If need be, I can restart the server under the different kernels and see if I can get the error from those as well. Thanks, Norman Weathers