From: Stuart Anderson Subject: Re: soft lockup Date: Sat, 30 Jun 2007 20:58:34 -0700 Message-ID: <20070701035833.GA20145@ligo.caltech.edu> References: <20070628235249.GM32531@ligo.caltech.edu> <1183088891.6163.119.camel@heimdal.trondhjem.org> <20070629155831.GA14875@ligo.caltech.edu> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Cc: nfs@lists.sourceforge.net To: Trond Myklebust Return-path: Received: from sc8-sf-mx1-b.sourceforge.net ([10.3.1.91] helo=mail.sourceforge.net) by sc8-sf-list2-new.sourceforge.net with esmtp (Exim 4.43) id 1I4qaF-0002zg-0S for nfs@lists.sourceforge.net; Sat, 30 Jun 2007 20:58:47 -0700 Received: from acrux.ligo.caltech.edu ([131.215.115.14]) by mail.sourceforge.net with esmtps (TLSv1:AES256-SHA:256) (Exim 4.44) id 1I4qaH-0007L0-1u for nfs@lists.sourceforge.net; Sat, 30 Jun 2007 20:58:50 -0700 In-Reply-To: <20070629155831.GA14875@ligo.caltech.edu> List-Id: "Discussion of NFS under Linux development, interoperability, and testing." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: nfs-bounces@lists.sourceforge.net Errors-To: nfs-bounces@lists.sourceforge.net Here is another soft lockup after running dmesg -s 90000 that shows a deeper call trace, but still has EOI. The 19 occurances after this one are back to the shallower call trace. Do I have to recompile the kernel to get a larger kernel ring buffer for dmesg to query? Thanks. kernel: BUG: soft lockup detected on CPU#0! kernel: kernel: Call Trace: kernel: [] softlockup_tick+0xfc/0x140 kernel: [] update_process_times+0x57/0x90 kernel: [] smp_local_timer_interrupt+0x34/0x60 kernel: [] smp_apic_timer_interrupt+0x59/0x80 kernel: [] apic_timer_interrupt+0x66/0x70 kernel: [] ip_finish_output+0x0/0x1b0 kernel: [] dev_queue_xmit+0x102/0x2b0 kernel: [] ip_output+0x208/0x260 kernel: [] ip_queue_xmit+0x446/0x4a0 kernel: [] :ip_conntrack:__ip_ct_refresh_acct+0x73/0xa0 kernel: [] :ip_conntrack:tcp_packet+0xa3f/0xa90 kernel: [] tcp_transmit_skb+0x6d0/0x710 kernel: [] __alloc_skb+0x8f/0x160 kernel: [] tcp_rcv_established+0x631/0x9b0 kernel: [] tcp_v4_do_rcv+0x3f/0x330 kernel: [] nf_iterate+0x5c/0xa0 kernel: [] tcp_v4_rcv+0x8ce/0x940 kernel: [] ip_local_deliver+0x131/0x1d0 kernel: [] ip_rcv+0x4de/0x550 kernel: [] process_backlog+0xa6/0x150 kernel: [] net_rx_action+0x84/0x180 kernel: [] __do_softirq+0x63/0xd0 kernel: [] :sunrpc:rpc_async_schedule+0x0/0x10 kernel: [] call_softirq+0x1c/0x30 kernel: [] do_softirq+0x2c/0x90 kernel: [] do_IRQ+0xd9/0x100 kernel: [] ret_from_intr+0x0/0xa kernel: [] :sunrpc:unx_validate+0x0/0x70 kernel: [] _raw_spin_lock+0xb1/0x150 kernel: [] :nfs:nfs3_xdr_readres+0x0/0x170 kernel: [] lock_kernel+0x1d/0x30 kernel: [] :sunrpc:call_decode+0x5ba/0x660 kernel: [] :sunrpc:rpc_async_schedule+0x0/0x10 kernel: [] :sunrpc:__rpc_execute+0x8e/0x280 kernel: [] run_workqueue+0xae/0x160 kernel: [] worker_thread+0x0/0x190 kernel: [] keventd_create_kthread+0x0/0x90 kernel: [] worker_thread+0x151/0x190 kernel: [] default_wake_function+0x0/0x10 kernel: [] worker_thread+0x0/0x190 kernel: [] kthread+0xd9/0x120 kernel: [] child_rip+0xa/0x12 kernel: [] keventd_create_kthread+0x0/0x90 kernel: [] kthread+0x0/0x120 kernel: [] child_rip+0x0/0x12 kernel: On Fri, Jun 29, 2007 at 08:58:31AM -0700, Stuart Anderson wrote: > On Thu, Jun 28, 2007 at 11:48:11PM -0400, Trond Myklebust wrote: > > On Thu, 2007-06-28 at 16:52 -0700, Stuart Anderson wrote: > > > I am currently getting a soft lockup every few minutes on a Sun X4600M2 > > > (8 x Opteron 8218) running 2.6.20.14 plus Tronds revalidate-the-fsid patch > > > and Jeff's O_EXCL-OPEN patch. The machine is still usable but a bit slow and > > > currently the rpciod/0 thread is being reported as using 100%CPU by top. > > > > > > Any ideas? > > > > > > Thanks. > > > > > > kernel: BUG: soft lockup detected on CPU#0! > > > kernel: > > > kernel: Call Trace: > > > kernel: [] softlockup_tick+0xfc/0x140 > > > kernel: [] update_process_times+0x57/0x90 > > > kernel: [] smp_local_timer_interrupt+0x34/0x60 > > > kernel: [] smp_apic_timer_interrupt+0x59/0x80 > > > kernel: [] apic_timer_interrupt+0x66/0x70 > > > kernel: [] :nfs:nfs3_xdr_readres+0x0/0x170 > > > kernel: [] _raw_spin_lock+0xb1/0x150 > > > kernel: [] :nfs:nfs3_xdr_readres+0x0/0x170 > > > kernel: [] lock_kernel+0x1d/0x30 > > > kernel: [] :sunrpc:rpc_exit_task+0x1f/0x90 > > > kernel: [] :sunrpc:__rpc_execute+0x8e/0x280 > > > kernel: [] run_workqueue+0xae/0x160 > > > kernel: [] worker_thread+0x0/0x190 > > > kernel: [] keventd_create_kthread+0x0/0x90 > > > kernel: [] worker_thread+0x151/0x190 > > > kernel: [] default_wake_function+0x0/0x10 > > > kernel: [] worker_thread+0x0/0x190 > > > kernel: [] kthread+0xd9/0x120 > > > kernel: [] child_rip+0xa/0x12 > > > kernel: [] keventd_create_kthread+0x0/0x90 > > > kernel: [] kthread+0x0/0x120 > > > kernel: [] child_rip+0x0/0x12 > > > kernel: > > > > I don't see how anything could lock up in nfs3_xdr_readres(). There > > should be no loops in that routine at all. > > > > What is that "" in the above dump? Is that syslog failing to record > > something? > > syslog is working as far as I can tell. What does EOI indicate (End Of Input?) > > > If so, could you please try using 'dmesg -s 90000' in order to recover > > the missing info? > > I have increased the buffer for the next occurance. However, after getting > 268 soft lockups yesterday they are no longer occuring today. The rpciod/0 > kernel thread is now using "just" ~70% of one cpu compared to 100% when we > where getting the lockups yesterday. > > Only about half of the soft lockups refer to nfs3_xdr_readres. They all > have an and all but one has "sunrpc:__rpc_execute" > > Here is example that does not refer to nfs3_xdr_readres. Perhaps this > is more general than NFS? though we did not see these until we started > using NFSV4 client mounts. The other change we made when adding NFSV4 > mounts was updating nfs-utils to 1.0.9-16 (CentOS 5 patches) to avoid > some of the known memory leaks/spin loops in rpc.idmapd and rpc.mountd. > > kernel: BUG: soft lockup detected on CPU#0! > kernel: > kernel: Call Trace: > kernel: [] softlockup_tick+0xfc/0x140 > kernel: [] update_process_times+0x57/0x90 > kernel: [] smp_local_timer_interrupt+0x34/0x60 > kernel: [] smp_apic_timer_interrupt+0x59/0x80 > kernel: [] apic_timer_interrupt+0x66/0x70 > kernel: [] _raw_spin_lock+0xb1/0x150 > kernel: [] lock_kernel+0x1d/0x30 > kernel: [] :sunrpc:rpc_release_calldata+0x20/0x50 > kernel: [] :sunrpc:__rpc_execute+0x26f/0x280 > kernel: [] run_workqueue+0xae/0x160 > kernel: [] worker_thread+0x0/0x190 > kernel: [] keventd_create_kthread+0x0/0x90 > kernel: [] worker_thread+0x151/0x190 > kernel: [] default_wake_function+0x0/0x10 > kernel: [] worker_thread+0x0/0x190 > kernel: [] kthread+0xd9/0x120 > kernel: [] child_rip+0xa/0x12 > kernel: [] keventd_create_kthread+0x0/0x90 > kernel: [] kthread+0x0/0x120 > kernel: [] child_rip+0x0/0x12 > kernel: > -- Stuart Anderson anderson@ligo.caltech.edu http://www.ligo.caltech.edu/~anderson ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs