2009-04-18 05:17:46

by Kasparek Tomas

[permalink] [raw]
Subject: Re: [PATCH 0/3] NFS regression in 2.6.26?, "task blocked for more than 120 seconds"

On Tue, Mar 03, 2009 at 09:16:07AM -0500, Trond Myklebust wrote:
> On Tue, 2009-03-03 at 13:08 +0100, Kasparek Tomas wrote:
> > On Tue, Jan 20, 2009 at 10:32:27AM -0500, Trond Myklebust wrote:
> > > A binary wireshark dump of the traffic between one such client and the
> > > server would help.
> >
> > I was able to finally got the tcpdump. I got it from 2.6.27.19 client but
> > after several weeks without problems. I include the file and place it on
> > http://merlin.fit.vutbr.cz/tmp/nfs/dump_kas2_mat.dump_small (have over 1GB
> > of dump, but it's all the time the same SYN+RST packets). The packet rate
> > maxed at 260000pps from two clients.
> >
> > This dump is taken from server after reset (the server does not respond
> > even to keybord) before clients are disconnected/rebooted. To remind it - all
> > clients seems to work well with reversed
> > e06799f958bf7f9f8fae15f0c6f519953fb0257c
>
> Yes. I saw that behaviour when testing at Connectathon last week. When
> one of the servers I was testing against crashed and later came up
> again, the patched client went into that same SYN+RST frenzy. I'm
> planning to look at this now that I'm back at home.

Hi, got a bit more data today as I get to the client early before it become
unresponsible.

: BUG: soft lockup - CPU#5 stuck for 61s! [rpciod/5:2730]
: Modules linked in: nfsd auth_rpcgss
exportfs i2c_dev i2c_core nfs lockd nfs_acl sunrpc ipv6 xfs dm_mirror
dm_log dm_mod pci_slot fa n snd_hda_intel snd_seq_dummy thermal snd_seq_oss
snd_seq_midi_event snd_seq processor igb 8250_pnp sg firewire_ohci
firewire_core crc_itu_t thermal_sys snd_seq_dev ice snd_pcm_oss snd_mixer_oss
evdev snd_pcm hwmon 3w_9xxx inet_lro button snd_timer sr_mod cdrom 8250
serial_core rtc_cmos rtc_core rtc_lib ehci_hcd uhci_hcd snd soundcore snd_page_alloc usbcore
:
: Pid: 2730, comm: rpciod/5 Not tainted (2.6.27.21 #1)
: EIP: 0060:[<c02972c8>] EFLAGS: 00000202 CPU: 5
: EIP is at tcp_connect+0x213/0x2e6
: EAX: c55cf700 EBX: f67b7b40 ECX: 00000002 EDX: ed451d8c
: ESI: c5409380 EDI: 00000000 EBP: 00000001 ESP: f67cfe78
: DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
: CR0: 8005003b CR2: b7f7b9c8 CR3: 003b5000 CR4: 000006d0
: DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
: DR6: ffff0ff0 DR7: 00000400
: [<c0299c5d>] ? tcp_v4_connect+0x3b2/0x40a
: [<c02a312e>] ? inet_stream_connect+0x87/0x20b
: [<f8b41a88>] ? rpc_wake_up_status+0x33/0x57 [sunrpc]
: [<c026417a>] ? kernel_connect+0xb/0xe
: [<f8b401b3>] ? xs_tcp_finish_connecting+0xe4/0xea [sunrpc]
: [<f8b41145>] ? xs_tcp_connect_worker4+0x0/0x15a [sunrpc]
: [<f8b41221>] ? xs_tcp_connect_worker4+0xdc/0x15a [sunrpc]
: [<c012854e>] ? run_workqueue+0x6a/0xe1
: [<c0128c77>] ? worker_thread+0x0/0x8a
: [<c0128cf6>] ? worker_thread+0x7f/0x8a
: [<c012aeac>] ? autoremove_wake_function+0x0/0x2b
: [<c0128c77>] ? worker_thread+0x0/0x8a
: [<c012ade8>] ? kthread+0x38/0x60
: [<c012adb0>] ? kthread+0x0/0x60
: [<c010371b>] ? kernel_thread_helper+0x7/0x10
: =======================

The lockup may be becouse I disconnected the cable from that client to stop
the packet storm, but still the backtrace may be usefull.

Is there anything else I can do, that will help with this problem?

Thanks in advance

--

Tomas Kasparek, PhD student E-mail: [email protected]
CVT FIT VUT Brno, L127 Web: http://www.fit.vutbr.cz/~kasparek
Bozetechova 1, 612 66 Fax: +420 54114-1270
Brno, Czech Republic Phone: +420 54114-1220

jabber: [email protected]
GPG: 2F1E 1AAF FD3B CFA3 1537 63BD DCBE 18FF A035 53BC



2009-04-29 12:12:38

by Steve Dickson

[permalink] [raw]
Subject: Re: NFS client packet storm on 2.6.27.x



Kasparek Tomas wrote:
> On Sat, Apr 18, 2009 at 07:17:39AM +0200, Kasparek Tomas wrote:
>> On Tue, Mar 03, 2009 at 09:16:07AM -0500, Trond Myklebust wrote:
>>> On Tue, 2009-03-03 at 13:08 +0100, Kasparek Tomas wrote:
>>>> On Tue, Jan 20, 2009 at 10:32:27AM -0500, Trond Myklebust wrote:
>>>>> A binary wireshark dump of the traffic between one such client and the
>>>>> server would help.
>>>> I was able to finally got the tcpdump. I got it from 2.6.27.19 client but
>>>> after several weeks without problems. I include the file and place it on
>>>> http://merlin.fit.vutbr.cz/tmp/nfs/dump_kas2_mat.dump_small (have over 1GB
>>>> of dump, but it's all the time the same SYN+RST packets). The packet rate
>>>> maxed at 260000pps from two clients.
>>>>
>>>> This dump is taken from server after reset (the server does not respond
>>>> even to keybord) before clients are disconnected/rebooted. To remind it - all
>>>> clients seems to work well with reversed
>>>> e06799f958bf7f9f8fae15f0c6f519953fb0257c
>>> Yes. I saw that behaviour when testing at Connectathon last week. When
>>> one of the servers I was testing against crashed and later came up
>>> again, the patched client went into that same SYN+RST frenzy. I'm
>>> planning to look at this now that I'm back at home.
>> Hi, got a bit more data today as I get to the client early before it become
>> unresponsible.
>>
>>
>> The lockup may be becouse I disconnected the cable from that client to stop
>> the packet storm, but still the backtrace may be usefull.
>>
>> Is there anything else I can do, that will help with this problem?
>
> Hi,
>
> (I changed the SUBJ to be more descriptive for current problem)
>
> I got another client lockup today. It was a desktop so I have some more
> dmesg warnings about soft lockup caused probably by network cable unplug
> (but hopefully still showing what happens in rpciod) on
>
> http://merlin.fit.vutbr.cz/tmp/nfs/pckas-dmesg
>
> I can check with top, that rpciod was using 100% cpu. I limited the flow
> from client to server with firewall so I was able to save the server and
> get some tcpdump -s0 data (actually RPC null with ERR response from server)
>
> Just to remind, the client is 2.6.27.21 (i386), the server is 2.6.16.62
> (x86_64).
>
> Please let me know if I can do anything more, this is really paintfull for
> me.
Try commenting out the tcp6/udp6 entries from /etc/netconfig....
This has help in other places...

steved.

2009-04-22 17:27:09

by Kasparek Tomas

[permalink] [raw]
Subject: NFS client packet storm on 2.6.27.x

On Sat, Apr 18, 2009 at 07:17:39AM +0200, Kasparek Tomas wrote:
> On Tue, Mar 03, 2009 at 09:16:07AM -0500, Trond Myklebust wrote:
> > On Tue, 2009-03-03 at 13:08 +0100, Kasparek Tomas wrote:
> > > On Tue, Jan 20, 2009 at 10:32:27AM -0500, Trond Myklebust wrote:
> > > > A binary wireshark dump of the traffic between one such client and the
> > > > server would help.
> > >
> > > I was able to finally got the tcpdump. I got it from 2.6.27.19 client but
> > > after several weeks without problems. I include the file and place it on
> > > http://merlin.fit.vutbr.cz/tmp/nfs/dump_kas2_mat.dump_small (have over 1GB
> > > of dump, but it's all the time the same SYN+RST packets). The packet rate
> > > maxed at 260000pps from two clients.
> > >
> > > This dump is taken from server after reset (the server does not respond
> > > even to keybord) before clients are disconnected/rebooted. To remind it - all
> > > clients seems to work well with reversed
> > > e06799f958bf7f9f8fae15f0c6f519953fb0257c
> >
> > Yes. I saw that behaviour when testing at Connectathon last week. When
> > one of the servers I was testing against crashed and later came up
> > again, the patched client went into that same SYN+RST frenzy. I'm
> > planning to look at this now that I'm back at home.
>
> Hi, got a bit more data today as I get to the client early before it become
> unresponsible.
>
>
> The lockup may be becouse I disconnected the cable from that client to stop
> the packet storm, but still the backtrace may be usefull.
>
> Is there anything else I can do, that will help with this problem?

Hi,

(I changed the SUBJ to be more descriptive for current problem)

I got another client lockup today. It was a desktop so I have some more
dmesg warnings about soft lockup caused probably by network cable unplug
(but hopefully still showing what happens in rpciod) on

http://merlin.fit.vutbr.cz/tmp/nfs/pckas-dmesg

I can check with top, that rpciod was using 100% cpu. I limited the flow
from client to server with firewall so I was able to save the server and
get some tcpdump -s0 data (actually RPC null with ERR response from server)

Just to remind, the client is 2.6.27.21 (i386), the server is 2.6.16.62
(x86_64).

Please let me know if I can do anything more, this is really paintfull for
me.

--

Tomas Kasparek, PhD student E-mail: [email protected]
CVT FIT VUT Brno, L127 Web: http://www.fit.vutbr.cz/~kasparek
Bozetechova 1, 612 66 Fax: +420 54114-1270
Brno, Czech Republic Phone: +420 54114-1220

jabber: [email protected]
GPG: 2F1E 1AAF FD3B CFA3 1537 63BD DCBE 18FF A035 53BC


2009-06-25 06:10:14

by Kasparek Tomas

[permalink] [raw]
Subject: Re: NFS client packet storm on 2.6.27.x

On Wed, Apr 22, 2009 at 07:27:07PM +0200, Kasparek Tomas wrote:
> I got another client lockup today. It was a desktop so I have some more
> dmesg warnings about soft lockup caused probably by network cable unplug
> (but hopefully still showing what happens in rpciod) on
>
> http://merlin.fit.vutbr.cz/tmp/nfs/pckas-dmesg
>
> I can check with top, that rpciod was using 100% cpu. I limited the flow
> from client to server with firewall so I was able to save the server and
> get some tcpdump -s0 data (actually RPC null with ERR response from server)
>
> Just to remind, the client is 2.6.27.21 (i386), the server is 2.6.16.62
> (x86_64).

Hi, I was playing with patches from

http://www.linux-nfs.org/Linux-2.6.x/2.6.27/

and find, that

.../fixups_4/linux-2.6.27-001-respond_promptly_to_socket_errors.dif
.../fixups_4/linux-2.6.27-002-respond_promptly_to_socket_errors_2.dif

change the locking behaviour from long to endless lock to 1-2sec locks and
it seems there are fewer situations when it locks.

The packet storms does not repeat once I switched to 2.6.27.24 (and .25)
kernels so far, so it may be solved by some other patch inside .24 too.

Together with tcp_linger patch it seems to improve the situation a lot to
state when it is possible for me to use 2.6.27.x kernels.

Trond, will it be possible to get tcp_linger and the upper twho patches to
2.6.27.x stable queue so others get these fixes?

Big thanks for your help to all.

--

Tomas Kasparek, PhD student E-mail: [email protected]
CVT FIT VUT Brno, L127 Web: http://www.fit.vutbr.cz/~kasparek
Bozetechova 1, 612 66 Fax: +420 54114-1270
Brno, Czech Republic Phone: +420 54114-1220

jabber: [email protected]
GPG: 2F1E 1AAF FD3B CFA3 1537 63BD DCBE 18FF A035 53BC


2009-07-13 17:40:24

by Trond Myklebust

[permalink] [raw]
Subject: Re: [stable] NFS client packet storm on 2.6.27.x

On Mon, 2009-07-13 at 10:20 -0700, Greg KH wrote:
> On Mon, Jul 13, 2009 at 01:12:15PM +0200, Kasparek Tomas wrote:
> > On Thu, Jun 25, 2009 at 07:55:32AM +0200, Kasparek Tomas wrote:
> > > http://www.linux-nfs.org/Linux-2.6.x/2.6.27/
> > >
> > > .../fixups_4/linux-2.6.27-001-respond_promptly_to_socket_errors.dif
> > > .../fixups_4/linux-2.6.27-002-respond_promptly_to_socket_errors_2.dif
> > >
> > > ...
> > > Together with tcp_linger patch it seems to improve the situation a lot to
> > > state when it is possible for me to use 2.6.27.x kernels.
> > >
> > > Trond, will it be possible to get tcp_linger and the upper twho patches to
> > > 2.6.27.x stable queue so others get these fixes?
> >
> > I got no response so far, is there someone who could advise me what to do
> > to get above patches to 2.6.27.x stable? Or is there some reason, why this
> > is not possible at all?
>
> Can you backport them and send them to [email protected] with the git
> commit ids of the orignal patches?

When testing at a customer site, we recently found that we actually need
4 patches in order to completely fix the storming problem in 2.6.27.y,
and ensure a timely socket reconnect.

commit 15f081ca8ddfe150fb639c591b18944a539da0fc (SUNRPC: Avoid an
unnecessary task reschedule on ENOTCONN)

commit 670f94573104b4a25525d3fcdcd6496c678df172 (SUNRPC: Ensure we set
XPRT_CLOSING only after we've sent a tcp FIN...)

commit 40d2549db5f515e415894def98b49db7d4c56714 (SUNRPC: Don't
disconnect if a connection is still in progress.)

and finally

commit f75e6745aa3084124ae1434fd7629853bdaf6798 (SUNRPC: Fix the problem
of EADDRNOTAVAIL syslog floods on reconnect)

The first three need to be applied to kernels 2.6.27.y to 2.6.29.y,
while the last needs to be applied to 2.6.27.y to 2.6.30.y.

I'll ensure the patches get sent to [email protected]...

Trond


2009-07-13 11:13:15

by Kasparek Tomas

[permalink] [raw]
Subject: Re: NFS client packet storm on 2.6.27.x

On Thu, Jun 25, 2009 at 07:55:32AM +0200, Kasparek Tomas wrote:
> http://www.linux-nfs.org/Linux-2.6.x/2.6.27/
>
> .../fixups_4/linux-2.6.27-001-respond_promptly_to_socket_errors.dif
> .../fixups_4/linux-2.6.27-002-respond_promptly_to_socket_errors_2.dif
>
> ...
> Together with tcp_linger patch it seems to improve the situation a lot to
> state when it is possible for me to use 2.6.27.x kernels.
>
> Trond, will it be possible to get tcp_linger and the upper twho patches to
> 2.6.27.x stable queue so others get these fixes?

I got no response so far, is there someone who could advise me what to do
to get above patches to 2.6.27.x stable? Or is there some reason, why this
is not possible at all?

Thanks for advice and/or suggestions.

--

Tomas Kasparek, PhD student E-mail: [email protected]
CVT FIT VUT Brno, L127 Web: http://www.fit.vutbr.cz/~kasparek
Bozetechova 1, 612 66 Fax: +420 54114-1270
Brno, Czech Republic Phone: +420 54114-1220

jabber: [email protected]
GPG: 2F1E 1AAF FD3B CFA3 1537 63BD DCBE 18FF A035 53BC


2009-07-13 17:23:31

by Greg KH

[permalink] [raw]
Subject: Re: [stable] NFS client packet storm on 2.6.27.x

On Mon, Jul 13, 2009 at 01:12:15PM +0200, Kasparek Tomas wrote:
> On Thu, Jun 25, 2009 at 07:55:32AM +0200, Kasparek Tomas wrote:
> > http://www.linux-nfs.org/Linux-2.6.x/2.6.27/
> >
> > .../fixups_4/linux-2.6.27-001-respond_promptly_to_socket_errors.dif
> > .../fixups_4/linux-2.6.27-002-respond_promptly_to_socket_errors_2.dif
> >
> > ...
> > Together with tcp_linger patch it seems to improve the situation a lot to
> > state when it is possible for me to use 2.6.27.x kernels.
> >
> > Trond, will it be possible to get tcp_linger and the upper twho patches to
> > 2.6.27.x stable queue so others get these fixes?
>
> I got no response so far, is there someone who could advise me what to do
> to get above patches to 2.6.27.x stable? Or is there some reason, why this
> is not possible at all?

Can you backport them and send them to [email protected] with the git
commit ids of the orignal patches?

thanks,

greg k-h