From: Chuck Lever Subject: lost interrupt after a signal? Date: Thu, 22 May 2008 10:57:35 -0400 Message-ID: <2A43EAAA-8AEC-4EA1-AAA6-1AE1C750DB4C@oracle.com> Mime-Version: 1.0 (Apple Message framework v919.2) Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes Cc: Linux NFS Mailing List , Matthew Wilcox To: Trond Myklebust Return-path: Received: from agminet01.oracle.com ([141.146.126.228]:12082 "EHLO agminet01.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752141AbYEVPBZ (ORCPT ); Thu, 22 May 2008 11:01:25 -0400 Sender: linux-nfs-owner@vger.kernel.org List-ID: We've been running some tests to understand how the 2.6.25 "intr/ nointr" behavior affects signal handling during I/O on NFS mounts. While running an Oracle database workload, we signal the database (this is a normal way administrative tools control database activity). Subsequently all of the I/O threads block on the inode mutex in nfs_invalidate_mapping() except this one: INFO: task oracle:27214 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. oracle D f6d85e84 1592 27214 1 c93d2920 00200086 00000001 f6d85e84 c04a0080 c04a0080 c04a0080 c93d2b84 c93d2b84 c4021f80 00000001 cc072000 f341c900 f6d85e7c 10a1a042 f6d85e7c cc072ddc c4021f80 03b7e000 cc072ddc c40082b4 c036e21c cc072dd4 00000001 Call Trace: [] io_schedule+0x4c/0x90 [] sync_page+0x2c/0x40 [] __wait_on_bit_lock+0x45/0x70 [] sync_page+0x0/0x40 [] __lock_page+0x73/0x80 [] wake_bit_function+0x0/0x80 [] invalidate_inode_pages2_range+0xb8/0x200 [] nfs_writepages+0x68/0x90 [nfs] [] nfs_invalidate_mapping_nolock+0x1f/0xd0 [nfs] [] nfs_invalidate_mapping+0x5a/0x60 [nfs] [] nfs_file_read+0x85/0x120 [nfs] [] do_sync_read+0xd5/0x120 [] __do_fault+0x1ca/0x400 [] __update_rq_clock+0x27/0x180 [] autoremove_wake_function+0x0/0x50 [] k_getrusage+0x1f5/0x200 [] security_file_permission+0xc/0x10 [] rw_verify_area+0x66/0xd0 [] getrusage+0x22/0x40 [] vfs_read+0xa1/0x140 [] do_sync_read+0x0/0x120 [] sys_pread64+0x6a/0x70 [] syscall_call+0x7/0xb I haven't looked too closely at this, but maybe the signal caused a lost I/O interrupt? What would be the next steps to troubleshoot this further? -- Chuck Lever chuck[dot]lever[at]oracle[dot]com