2004-06-27 05:12:15

by Garrick Staples

[permalink] [raw]
Subject: nfsd threads locked, 2.6.7 & ia64

Well here's something I've not seen yet. All 512 nfsd threads are stuck in IO
wait (state D in ps). All clients are hung on that mount. The actual
filesystem on the server seems fine.

It start at 8am with no messages at all.

Jun 26 08:06:58 hpc-master nagios: SERVICE ALERT: hpc-fs3;nfs;CRITICAL;SOFT;1;CRITICAL: RPC program nfs version 3 udp is not running
Jun 26 08:07:21 hpc934-e0 kernel: nfs: server hpc-fs3 not responding, still trying
Jun 26 08:09:12 hpc972-e0 kernel: nfs: server hpc-fs3 not responding, still trying
Jun 26 08:09:13 hpc941-e0 kernel: nfs: server hpc-fs3 not responding, still trying
...

I can't find anything otherwise wrong with the machine at all, just that nfsd
threads are stuck. rpcinfo and showmount still work fine. Nothing in dmesg or
messages. I'm stuck too.

--
Garrick Staples, Linux/HPCC Administrator
University of Southern California


Attachments:
(No filename) (883.00 B)
(No filename) (189.00 B)
Download all attachments

2004-06-27 19:33:16

by J. Bruce Fields

[permalink] [raw]
Subject: Re: nfsd threads locked, 2.6.7 & ia64

On Sat, Jun 26, 2004 at 10:11:29PM -0700, Garrick Staples wrote:
> Well here's something I've not seen yet. All 512 nfsd threads are stuck in IO
> wait (state D in ps). All clients are hung on that mount. The actual
> filesystem on the server seems fine.
>
> It start at 8am with no messages at all.
>
> Jun 26 08:06:58 hpc-master nagios: SERVICE ALERT: hpc-fs3;nfs;CRITICAL;SOFT;1;CRITICAL: RPC program nfs version 3 udp is not running
> Jun 26 08:07:21 hpc934-e0 kernel: nfs: server hpc-fs3 not responding, still trying
> Jun 26 08:09:12 hpc972-e0 kernel: nfs: server hpc-fs3 not responding, still trying
> Jun 26 08:09:13 hpc941-e0 kernel: nfs: server hpc-fs3 not responding, still trying
> ...
>
> I can't find anything otherwise wrong with the machine at all, just that nfsd
> threads are stuck. rpcinfo and showmount still work fine. Nothing in dmesg or
> messages. I'm stuck too.

Does Sysrq-T give you any idea where they're stuck?

--Bruce Fields


-------------------------------------------------------
This SF.Net email sponsored by Black Hat Briefings & Training.
Attend Black Hat Briefings & Training, Las Vegas July 24-29 -
digital self defense, top technical experts, no vendor pitches,
unmatched networking opportunities. Visit http://www.blackhat.com
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2004-06-28 00:46:48

by Garrick Staples

[permalink] [raw]
Subject: Re: nfsd threads locked, 2.6.7 & ia64

On Sun, Jun 27, 2004 at 03:33:14PM -0400, J. Bruce Fields alleged:
> On Sat, Jun 26, 2004 at 10:11:29PM -0700, Garrick Staples wrote:
> > Well here's something I've not seen yet. All 512 nfsd threads are stuck in IO
> > wait (state D in ps). All clients are hung on that mount. The actual
> > filesystem on the server seems fine.
> >
> > It start at 8am with no messages at all.
> >
> > Jun 26 08:06:58 hpc-master nagios: SERVICE ALERT: hpc-fs3;nfs;CRITICAL;SOFT;1;CRITICAL: RPC program nfs version 3 udp is not running
> > Jun 26 08:07:21 hpc934-e0 kernel: nfs: server hpc-fs3 not responding, still trying
> > Jun 26 08:09:12 hpc972-e0 kernel: nfs: server hpc-fs3 not responding, still trying
> > Jun 26 08:09:13 hpc941-e0 kernel: nfs: server hpc-fs3 not responding, still trying
> > ...
> >
> > I can't find anything otherwise wrong with the machine at all, just that nfsd
> > threads are stuck. rpcinfo and showmount still work fine. Nothing in dmesg or
> > messages. I'm stuck too.
>
> Does Sysrq-T give you any idea where they're stuck?

I won't know until I get into the machine room tomorrow. Any info I can give
you from a remote shell?

--
Garrick Staples, Linux/HPCC Administrator
University of Southern California


Attachments:
(No filename) (1.21 kB)
(No filename) (189.00 B)
Download all attachments

2004-06-28 01:04:14

by NeilBrown

[permalink] [raw]
Subject: Re: nfsd threads locked, 2.6.7 & ia64

On Sunday June 27, [email protected] wrote:
> On Sun, Jun 27, 2004 at 03:33:14PM -0400, J. Bruce Fields alleged:
> >
> > Does Sysrq-T give you any idea where they're stuck?
>
> I won't know until I get into the machine room tomorrow. Any info I can give
> you from a remote shell?
>

# echo t > /proc/sysrq-trigger
# dmesg

NeilBrown

> --
> Garrick Staples, Linux/HPCC Administrator
> University of Southern California


-------------------------------------------------------
This SF.Net email sponsored by Black Hat Briefings & Training.
Attend Black Hat Briefings & Training, Las Vegas July 24-29 -
digital self defense, top technical experts, no vendor pitches,
unmatched networking opportunities. Visit http://www.blackhat.com
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2004-06-28 01:48:23

by Garrick Staples

[permalink] [raw]
Subject: Re: nfsd threads locked, 2.6.7 & ia64

On Mon, Jun 28, 2004 at 11:04:08AM +1000, Neil Brown alleged:
> On Sunday June 27, [email protected] wrote:
> > On Sun, Jun 27, 2004 at 03:33:14PM -0400, J. Bruce Fields alleged:
> > >
> > > Does Sysrq-T give you any idea where they're stuck?
> >
> > I won't know until I get into the machine room tomorrow. Any info I can give
> > you from a remote shell?
> >
>
> # echo t > /proc/sysrq-trigger
> # dmesg

Learn something new every day :)

As you may remember, I have a pair of these ia64 machines. The one with the
stuck threads is currently only serving 1 xfs filesystem. The other machine
currently has 8 ext3 filesystems.

Below is 3 different traces of nfsd, all 512 threads seem to follow one of
those 3 patterns.


nfsd D a000000100560810 0 2462 1 2406 2404 (L-TLB)

Call Trace:
[<a000000100561a00>] schedule+0xd20/0x12a0
sp=e0000000059cfaa0 bsp=e0000000059c9290
[<a000000100560810>] __down+0x210/0x320
sp=e0000000059cfab0 bsp=e0000000059c9230
[<a000000200723ef0>] linvfs_writev+0x290/0x320 [xfs]
sp=e0000000059cfae0 bsp=e0000000059c91d8
[<a000000100126110>] do_readv_writev+0x330/0x500
sp=e0000000059cfc10 bsp=e0000000059c9170
[<a0000002003e8be0>] nfsd_write+0x1c0/0x7e0 [nfsd]
sp=e0000000059cfc90 bsp=e0000000059c90f8
[<a0000002003fc360>] nfsd3_proc_write+0x180/0x260 [nfsd]
sp=e0000000059cfdf0 bsp=e0000000059c90a8
[<a0000002003df030>] nfsd_dispatch+0x290/0x540 [nfsd]
sp=e0000000059cfdf0 bsp=e0000000059c9058
[<a00000020039bde0>] svc_process+0x10a0/0x1380 [sunrpc]
sp=e0000000059cfdf0 bsp=e0000000059c8fe8
[<a0000002003de8e0>] nfsd+0x500/0x9c0 [nfsd]
sp=e0000000059cfe00 bsp=e0000000059c8ee8
[<a00000010001b380>] kernel_thread_helper+0xe0/0x100
sp=e0000000059cfe30 bsp=e0000000059c8ec0
[<a000000100009080>] start_kernel_thread+0x20/0x40
sp=e0000000059cfe30 bsp=e0000000059c8ec0

nfsd D a000000100560810 0 2400 1 2458 2455 (L-TLB)

Call Trace:
[<a000000100561a00>] schedule+0xd20/0x12a0
sp=e00000003e13fcb0 bsp=e00000003e1391b8
[<a000000100560810>] __down+0x210/0x320
sp=e00000003e13fcc0 bsp=e00000003e139158
[<a0000002003e7cc0>] nfsd_sync+0x240/0x280 [nfsd]
sp=e00000003e13fcf0 bsp=e00000003e139118
[<a0000002003e9360>] nfsd_commit+0x160/0x180 [nfsd]
sp=e00000003e13fcf0 bsp=e00000003e1390e8
[<a0000002003fe720>] nfsd3_proc_commit+0x180/0x220 [nfsd]
sp=e00000003e13fdf0 bsp=e00000003e1390a8
[<a0000002003df030>] nfsd_dispatch+0x290/0x540 [nfsd]
sp=e00000003e13fdf0 bsp=e00000003e139058
[<a00000020039bde0>] svc_process+0x10a0/0x1380 [sunrpc]
sp=e00000003e13fdf0 bsp=e00000003e138fe8
[<a0000002003de8e0>] nfsd+0x500/0x9c0 [nfsd]
sp=e00000003e13fe00 bsp=e00000003e138ee8
[<a00000010001b380>] kernel_thread_helper+0xe0/0x100
sp=e00000003e13fe30 bsp=e00000003e138ec0
[<a000000100009080>] start_kernel_thread+0x20/0x40
sp=e00000003e13fe30 bsp=e00000003e138ec0


nfsd D a000000100562bd0 0 2455 1 2400 2399 (L-TLB)

Call Trace:
[<a000000100561a00>] schedule+0xd20/0x12a0
sp=e000000005aa76c0 bsp=e000000005aa18b0
[<a000000100562bd0>] io_schedule+0x70/0xa0
sp=e000000005aa76d0 bsp=e000000005aa1898
[<a0000001000dbf00>] __lock_page+0x260/0x2e0
sp=e000000005aa76d0 bsp=e000000005aa1860
[<a000000100178730>] mpage_writepages+0x290/0x700
sp=e000000005aa7750 bsp=e000000005aa1780
[<a0000001000e8c00>] do_writepages+0xe0/0x100
sp=e000000005aa7800 bsp=e000000005aa1758
[<a0000001000daf00>] __filemap_fdatawrite+0x160/0x180
sp=e000000005aa7800 bsp=e000000005aa1738
[<a00000020072eb60>] xfs_flush_inode+0x40/0x60 [xfs]
sp=e000000005aa7880 bsp=e000000005aa1718
[<a0000002006da680>] xfs_flush_space+0x1c0/0x200 [xfs]
sp=e000000005aa7880 bsp=e000000005aa16f0
[<a0000002006db2a0>] xfs_iomap_write_delay+0x560/0x760 [xfs]
sp=e000000005aa7880 bsp=e000000005aa1618
[<a0000002006da110>] xfs_iomap+0x450/0x800 [xfs]
sp=e000000005aa7930 bsp=e000000005aa15a0
[<a00000020072e060>] xfs_bmap+0x40/0x60 [xfs]
sp=e000000005aa7970 bsp=e000000005aa1558
[<a00000020071c160>] linvfs_get_block_core+0xe0/0x5c0 [xfs]
sp=e000000005aa7970 bsp=e000000005aa14f0
[<a00000010012e790>] __block_prepare_write+0x5f0/0xa60
sp=e000000005aa79b0 bsp=e000000005aa1468
[<a00000010012fc80>] block_prepare_write+0x40/0xa0
sp=e000000005aa79e0 bsp=e000000005aa1438
[<a00000020071d1d0>] linvfs_prepare_write+0x90/0xc0 [xfs]
sp=e000000005aa79e0 bsp=e000000005aa1400
[<a0000001000dfbf0>] generic_file_aio_write_nolock+0x810/0x12e0
sp=e000000005aa79e0 bsp=e000000005aa1308
[<a00000020072d540>] xfs_write+0x3e0/0xda0 [xfs]
sp=e000000005aa7ab0 bsp=e000000005aa1230
[<a000000200723e10>] linvfs_writev+0x1b0/0x320 [xfs]
sp=e000000005aa7ae0 bsp=e000000005aa11d8
[<a000000100126110>] do_readv_writev+0x330/0x500
sp=e000000005aa7c10 bsp=e000000005aa1170
[<a0000002003e8be0>] nfsd_write+0x1c0/0x7e0 [nfsd]
sp=e000000005aa7c90 bsp=e000000005aa10f8
[<a0000002003fc360>] nfsd3_proc_write+0x180/0x260 [nfsd]
sp=e000000005aa7df0 bsp=e000000005aa10a8
[<a0000002003df030>] nfsd_dispatch+0x290/0x540 [nfsd]
sp=e000000005aa7df0 bsp=e000000005aa1058
[<a00000020039bde0>] svc_process+0x10a0/0x1380 [sunrpc]
sp=e000000005aa7df0 bsp=e000000005aa0fe8
[<a0000002003de8e0>] nfsd+0x500/0x9c0 [nfsd]
sp=e000000005aa7e00 bsp=e000000005aa0ee8
[<a00000010001b380>] kernel_thread_helper+0xe0/0x100
sp=e000000005aa7e30 bsp=e000000005aa0ec0
[<a000000100009080>] start_kernel_thread+0x20/0x40
sp=e000000005aa7e30 bsp=e000000005aa0ec0

--
Garrick Staples, Linux/HPCC Administrator
University of Southern California


Attachments:
(No filename) (6.81 kB)
(No filename) (189.00 B)
Download all attachments

2004-06-28 02:09:45

by Garrick Staples

[permalink] [raw]
Subject: Re: nfsd threads locked, 2.6.7 & ia64

On Sat, Jun 26, 2004 at 10:11:29PM -0700, Garrick Staples alleged:
> Well here's something I've not seen yet. All 512 nfsd threads are stuck in IO
> wait (state D in ps). All clients are hung on that mount. The actual
> filesystem on the server seems fine.

I just ran a few 'find's through the 4TB filesystem and 1 of them got stuck. I
guess the filesystem hosed itself. Guess I'll go back to reiser.

*sigh* I just found this patch for random data corruption over NFS.
http://linus.bkbits.net:8080/linux-2.5/cset@40d263bcB5wLHA2ALqKxBm1mzuGiFw?nav=index.html|tags|[email protected]..

But it doesn't mention anything about getting stuck in IO wait.

--
Garrick Staples, Linux/HPCC Administrator
University of Southern California


Attachments:
(No filename) (744.00 B)
(No filename) (189.00 B)
Download all attachments

2004-06-28 02:19:15

by NeilBrown

[permalink] [raw]
Subject: Re: nfsd threads locked, 2.6.7 & ia64

On Sunday June 27, [email protected] wrote:
>
> As you may remember, I have a pair of these ia64 machines. The one with the
> stuck threads is currently only serving 1 xfs filesystem. The other machine
> currently has 8 ext3 filesystems.

Sounds like the finger is pointing at xfs ....

>
> Below is 3 different traces of nfsd, all 512 threads seem to follow one of
> those 3 patterns.
>
>
> nfsd D a000000100560810 0 2462 1 2406 2404 (L-TLB)
>
> Call Trace:
> [<a000000100561a00>] schedule+0xd20/0x12a0
> sp=e0000000059cfaa0 bsp=e0000000059c9290
> [<a000000100560810>] __down+0x210/0x320
> sp=e0000000059cfab0 bsp=e0000000059c9230
> [<a000000200723ef0>] linvfs_writev+0x290/0x320 [xfs]
> sp=e0000000059cfae0 bsp=e0000000059c91d8

So this is waiting to "down" inode->i_sem. Someone must be holding
that semaphore already...

>
> nfsd D a000000100560810 0 2400 1 2458 2455 (L-TLB)
>
> Call Trace:
> [<a000000100561a00>] schedule+0xd20/0x12a0
> sp=e00000003e13fcb0 bsp=e00000003e1391b8
> [<a000000100560810>] __down+0x210/0x320
> sp=e00000003e13fcc0 bsp=e00000003e139158
> [<a0000002003e7cc0>] nfsd_sync+0x240/0x280 [nfsd]
> sp=e00000003e13fcf0 bsp=e00000003e139118

This is waiting on the same semapore....

>
>
> nfsd D a000000100562bd0 0 2455 1 2400 2399 (L-TLB)
>

This one is holding the semaphore (below)....
> Call Trace:
> [<a000000100561a00>] schedule+0xd20/0x12a0
> sp=e000000005aa76c0 bsp=e000000005aa18b0
> [<a000000100562bd0>] io_schedule+0x70/0xa0
> sp=e000000005aa76d0 bsp=e000000005aa1898
> [<a0000001000dbf00>] __lock_page+0x260/0x2e0
> sp=e000000005aa76d0 bsp=e000000005aa1860

... and it seems to be trying to lock a page. I wonder why it can't.
Neither of the other threads should be holding a lock on this page.

Maybe if you put all the traces of all threads on a website somewhere,
and send mail to [email protected] suggesting that you
might have hit an XFS problem.

NeilBrown

> [<a000000100178730>] mpage_writepages+0x290/0x700
> sp=e000000005aa7750 bsp=e000000005aa1780
> [<a0000001000e8c00>] do_writepages+0xe0/0x100
> sp=e000000005aa7800 bsp=e000000005aa1758
> [<a0000001000daf00>] __filemap_fdatawrite+0x160/0x180
> sp=e000000005aa7800 bsp=e000000005aa1738
> [<a00000020072eb60>] xfs_flush_inode+0x40/0x60 [xfs]
> sp=e000000005aa7880 bsp=e000000005aa1718
> [<a0000002006da680>] xfs_flush_space+0x1c0/0x200 [xfs]
> sp=e000000005aa7880 bsp=e000000005aa16f0
> [<a0000002006db2a0>] xfs_iomap_write_delay+0x560/0x760 [xfs]
> sp=e000000005aa7880 bsp=e000000005aa1618
> [<a0000002006da110>] xfs_iomap+0x450/0x800 [xfs]
> sp=e000000005aa7930 bsp=e000000005aa15a0
> [<a00000020072e060>] xfs_bmap+0x40/0x60 [xfs]
> sp=e000000005aa7970 bsp=e000000005aa1558
> [<a00000020071c160>] linvfs_get_block_core+0xe0/0x5c0 [xfs]
> sp=e000000005aa7970 bsp=e000000005aa14f0
> [<a00000010012e790>] __block_prepare_write+0x5f0/0xa60
> sp=e000000005aa79b0 bsp=e000000005aa1468
> [<a00000010012fc80>] block_prepare_write+0x40/0xa0
> sp=e000000005aa79e0 bsp=e000000005aa1438
> [<a00000020071d1d0>] linvfs_prepare_write+0x90/0xc0 [xfs]
> sp=e000000005aa79e0 bsp=e000000005aa1400
> [<a0000001000dfbf0>] generic_file_aio_write_nolock+0x810/0x12e0
> sp=e000000005aa79e0 bsp=e000000005aa1308
> [<a00000020072d540>] xfs_write+0x3e0/0xda0 [xfs]
> sp=e000000005aa7ab0 bsp=e000000005aa1230
> [<a000000200723e10>] linvfs_writev+0x1b0/0x320 [xfs]
> sp=e000000005aa7ae0 bsp=e000000005aa11d8
***This is where the sem is being held****
> [<a000000100126110>] do_readv_writev+0x330/0x500
> sp=e000000005aa7c10 bsp=e000000005aa1170
> [<a0000002003e8be0>] nfsd_write+0x1c0/0x7e0 [nfsd]
> sp=e000000005aa7c90 bsp=e000000005aa10f8
> [<a0000002003fc360>] nfsd3_proc_write+0x180/0x260 [nfsd]
> sp=e000000005aa7df0 bsp=e000000005aa10a8
> [<a0000002003df030>] nfsd_dispatch+0x290/0x540 [nfsd]
> sp=e000000005aa7df0 bsp=e000000005aa1058
> [<a00000020039bde0>] svc_process+0x10a0/0x1380 [sunrpc]
> sp=e000000005aa7df0 bsp=e000000005aa0fe8
> [<a0000002003de8e0>] nfsd+0x500/0x9c0 [nfsd]
> sp=e000000005aa7e00 bsp=e000000005aa0ee8
> [<a00000010001b380>] kernel_thread_helper+0xe0/0x100
> sp=e000000005aa7e30 bsp=e000000005aa0ec0
> [<a000000100009080>] start_kernel_thread+0x20/0x40
> sp=e000000005aa7e30 bsp=e000000005aa0ec0
>
> --
> Garrick Staples, Linux/HPCC Administrator
> University of Southern California


-------------------------------------------------------
This SF.Net email sponsored by Black Hat Briefings & Training.
Attend Black Hat Briefings & Training, Las Vegas July 24-29 -
digital self defense, top technical experts, no vendor pitches,
unmatched networking opportunities. Visit http://www.blackhat.com
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs