LinuxLists.cc - nfs client process/rpciod deadlock

2003-09-24 11:38:52

Subject: nfs client process/rpciod deadlock

Please CC: me as I am not subscribed to this list.

I have a problem with processes hanging in D state on a linux nfs
client. Both linux client and server are stock kernel.org 2.4.22
kernels with no extra drivers or patches. This problem is not new and
exists on older kernel.org and red hat kernels I have used.

The full setup is a smp linux nfs server, linux nfs client, and a few
other unix clients. Both linux systems have kernels without highmem.
The problem occurs with both SMP and UP kernels on the client. When
placed under load, the linux client will periodically get processes
stuck in D state. The processes stuck in D state will be one or more
work processes and rpciod.

Using sysrq-T to show state shows the deadlocked processes to be waiting
on a locked page in ___wait_on_page. (I have the full show state if
someone wants to see it.)

rpciod D F7FBF0A0 4468 749 1 777 750 (L-TLB)
Call Trace:
[___wait_on_page+158/192]
[truncate_list_pages+387/464]
[e100:e100_manage_adaptive_ifs+753/816]
[truncate_inode_pages+94/112]
[iput+201/544]
[nfs3_xdr_commitres+173/224]
[nfs_commit_done+550/1072]
[nfs3_xdr_commitres+0/224]
[__rpc_execute+554/688]
[schedule+756/800]
[__rpc_schedule+179/288]
[rpciod+184/496]
[arch_kernel_thread+38/64]
[rpciod+0/496]

javac D F33D5D40 0 3830 3829 3833 (NOTLB)
Call Trace:
[___wait_on_page+158/192]
[do_generic_file_read+756/1088]
[generic_file_read+137/352]
[file_read_actor+0/176]
[nfs_file_read+146/160]
[sys_read+152/240]
[system_call+51/64]

cp D F33D5DC0 0 3915 3525 (NOTLB)
Call Trace:
[___wait_on_page+158/192]
[do_generic_file_read+756/1088]
[generic_file_read+137/352]
[file_read_actor+0/176]
[nfs_file_read+146/160]
[sys_read+152/240]
[system_call+51/64]

Is this related to the comment in fs/nfs/write.c or is this a different
race condition?

/*
* Update attributes as result of writeback.
* FIXME: There is an inherent race with invalidate_inode_pages and
* writebacks since the page->count is kept > 1 for as long
* as the page has a write request pending.
*/

I'd be happy to test patches. It can take up to a week for the problem
to occur but it has become more frequent with the loads we're putting on
the machine.

David Jeffery

-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2003-09-27 09:47:56

by David Jeffery

[permalink] [raw]

Subject: RE: nfs client process/rpciod deadlock

On Thu, 2003-09-25 at 12:24, Lever, Charles wrote:
> these processes are not deadlocked; they are merely waiting
> for I/O to complete. there may be a deadlock elsewhere, or
> they may be waiting for a reply that never arrived.
>
> what are your NFS mount options? output from "nfsstat"?
> as an experiment, if you use the "soft" mount option, does
> the problem go away?
>

I should have included more information. *sigh*. Never hurry bug
reports.

The mount options are:"defaults,wsize=8192,rsize=8192,intr". The hung
processes may "go away" with soft mount, but only because little
would get done with all the IO errors. I periodically get these
messages in the logs:

nfs: server rtp-buildnfs not responding, still trying
nfs: server rtp-buildnfs OK

The network tends to get hit hard at night while these jobs are running
so latency spikes and some dropped packets happen. The still trying and
the OK messages always match up.

I left out what I now thing are some important parts of the sysrq-T
output. While rpciod and a few others are getting stuck in D state,
many others keep waiting in S state in nfs_wait_on_request(). This
includes kupdated.

kupdated S F33D5EC0 4088 6 1 7 5 (L-TLB)
Call Trace:
[nfs_wait_on_request+156/320]
[nfs_wait_on_requests+92/160]
[nfs_sync_file+72/144]
[nfs_writepage+53/208]
[filemap_fdatasync+117/176]
[nfs_writepage+0/208]
[sync_unlocked_inodes+147/384]
[schedule_timeout+132/160]
[process_timeout+0/80]
[sync_old_buffers+8/64]
[kupdate+228/256]

The others are getting to nfs_sync_file() by calling nfs_flush_file()
when trying to close a file.

I have a theory (and a patch!) as to what is going on. But, I'm not
expert on the linux NFS or VFS code so feel free to try and tear my
theory apart.

My theory is that there is a nasty race with nfs_writepage_async(),
nfs_sync_file(), and a inode no longer referenced by the dcache. The
first write for a page comes through and follows the writepage_async
path. The inode refcount gets bumped up 1. The page gets re-dirtied
and then all other inode refcounts other than the nfs_writepage_async go
away. The system then tries to write the redirtied page. This second
write locks the page and then waits for the previous write request in
nfs_sync_file() -> nfs_wait_on_request(). The first write then
completes and calls iput() in nfs_inode_remove_request() which drops the
refcount on the inode to 0. iput() then finds its way to
truncate_inode_pages() where it waits on the locked page from the second
write. But nfs_sync_file() will never return and drop the lock as
nfs_inode_remove_request() called iput() before the nfs request (which
nfs_sync_file() is waiting on) was unlocked. So the first write needs
the page unlocked to complete the iput() call, but the second write
can't unlock the page until the first request unlocks its nfs request.

My patch moves the iput() to after the request is unlocked in
nfs_writeback_done() or nfs_commit_done(). It hasn't been properly
tested yet so we'll see how it goes.

Now, a normalsys_write() would have a filp which would hold the dentry
and inode counts from dropping to 0 during the nfs_sync_file(). But a
mmapped write can be delayed until after the references to the inode do
to filp and dentry references are dropped, correct? I've never seen it
with some tests I ran, but they never used mmapped writes and I believe
the nightly workload does use mmapped writes in places.

This has become a rather long email and I hope my theory made some
sense. If not, its still been an educational experience trying to learn
the NFS/VFS code in a couple of days.

David Jeffery

Attachments:

nfsfix.patch (2.91 kB)

2003-09-25 16:25:15

by Lever, Charles

[permalink] [raw]

Subject: RE: nfs client process/rpciod deadlock

these processes are not deadlocked; they are merely waiting
for I/O to complete. there may be a deadlock elsewhere, or
they may be waiting for a reply that never arrived.

what are your NFS mount options? output from "nfsstat"?
as an experiment, if you use the "soft" mount option, does
the problem go away?

> Please CC: me as I am not subscribed to this list.
>=20
> I have a problem with processes hanging in D state on a linux nfs
> client. Both linux client and server are stock kernel.org 2.4.22
> kernels with no extra drivers or patches. This problem is not new and
> exists on older kernel.org and red hat kernels I have used.
>=20
> The full setup is a smp linux nfs server, linux nfs client, and a few
> other unix clients. Both linux systems have kernels without highmem.=20
> The problem occurs with both SMP and UP kernels on the client. When
> placed under load, the linux client will periodically get processes
> stuck in D state. The processes stuck in D state will be one or more=20
> work processes and rpciod.
>=20
> Using sysrq-T to show state shows the deadlocked processes to=20
> be waiting
> on a locked page in ___wait_on_page. (I have the full show state if
> someone wants to see it.)
>=20
>=20
> rpciod D F7FBF0A0 4468 749 1 777 750 (L-TLB)
> Call Trace: =20
> [___wait_on_page+158/192]
> [truncate_list_pages+387/464]
> [e100:e100_manage_adaptive_ifs+753/816]
> [truncate_inode_pages+94/112]
> [iput+201/544]
> [nfs3_xdr_commitres+173/224]
> [nfs_commit_done+550/1072]
> [nfs3_xdr_commitres+0/224]
> [__rpc_execute+554/688]
> [schedule+756/800]
> [__rpc_schedule+179/288]
> [rpciod+184/496]
> [arch_kernel_thread+38/64]
> [rpciod+0/496]
>=20
>=20
> javac D F33D5D40 0 3830 3829 3833 =20
> (NOTLB)
> Call Trace: =20
> [___wait_on_page+158/192]
> [do_generic_file_read+756/1088]
> [generic_file_read+137/352]
> [file_read_actor+0/176]
> [nfs_file_read+146/160]
> [sys_read+152/240]
> [system_call+51/64]
>=20
> cp D F33D5DC0 0 3915 3525 =20
> (NOTLB)
> Call Trace: =20
> [___wait_on_page+158/192]
> [do_generic_file_read+756/1088]
> [generic_file_read+137/352]
> [file_read_actor+0/176]
> [nfs_file_read+146/160]
> [sys_read+152/240]
> [system_call+51/64]
>=20
> Is this related to the comment in fs/nfs/write.c or is this a=20
> different
> race condition?
>=20
> /*
> * Update attributes as result of writeback.
> * FIXME: There is an inherent race with invalidate_inode_pages and
> * writebacks since the page->count is kept > 1 for as long
> * as the page has a write request pending.
> */
>=20
> I'd be happy to test patches. It can take up to a week for=20
> the problem
> to occur but it has become more frequent with the loads we're=20
> putting on
> the machine.
>=20
> David Jeffery
>=20
>=20
>=20
> -------------------------------------------------------
> This sf.net email is sponsored by:ThinkGeek
> Welcome to geek heaven.
> http://thinkgeek.com/sf
> _______________________________________________
> NFS maillist - [email protected]
> https://lists.sourceforge.net/lists/listinfo/nfs
>=20

-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs