2012-05-28 22:30:03

by Hans de Bruin

[permalink] [raw]
Subject: nfsroot client will not start firefox or thunderbird from 3.4.0 nfsserver

I just upgraded my home server from kernel 3.3.5 to 3.4.0 and ran into
some trouble. My laptop, a nfsroot client, will not run firefox and
thunderbird anymore. When I start these programs from an xterm, the
cursor goes to the next line and waits indefinitely.

I do not know if there is any order is lsof's output. A lsof | grep
firefox or thunderbird shows ......./.parentlock as the last line.

It does not matter whether the client is running a 3.4.0 or a 3.3.0
kernel, or if the server is running on top of xen or not.

There is some noise in the servers dmesg:

[ 241.256684] INFO: task kworker/u:2:801 blocked for more than 120 seconds.
[ 241.256691] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 241.256698] kworker/u:2 D 000000000000000c 0 801 2
0x00000000
[ 241.256710] ffff8801390e5cf0 0000000000000046 0000000000012d00
0000000000012d00
[ 241.256721] 0000000000012d00 ffff880139f8bd50 0000000000012d00
ffff8801390e5fd8
[ 241.256732] ffff8801390e5fd8 0000000000012d00 ffff880139ce4420
ffff880139f8bd50
[ 241.256743] Call Trace:
[ 241.256759] [<ffffffff8158733e>] schedule+0x64/0x66
[ 241.256769] [<ffffffff8120184e>] cld_pipe_upcall+0x95/0xd1
[ 241.256780] [<ffffffff811fbd8d>] ? nfsd4_exchange_id+0x23e/0x23e
[ 241.256789] [<ffffffff81201d06>] nfsd4_cld_grace_done+0x50/0x8a
[ 241.256798] [<ffffffff81202233>] nfsd4_record_grace_done+0x18/0x1a
[ 241.256807] [<ffffffff811fbdd7>] laundromat_main+0x4a/0x213
[ 241.256818] [<ffffffff8106a06b>] ? need_resched+0x1e/0x28
[ 241.256826] [<ffffffff8158725d>] ? __schedule+0x49d/0x4b5
[ 241.256835] [<ffffffff811fbd8d>] ? nfsd4_exchange_id+0x23e/0x23e
[ 241.256844] [<ffffffff8105be2d>] process_one_work+0x190/0x28d
[ 241.256854] [<ffffffff8105ca67>] worker_thread+0x105/0x189
[ 241.256862] [<ffffffff81587b8d>] ? _raw_spin_unlock_irqrestore+0x1a/0x1d
[ 241.256872] [<ffffffff8105c962>] ? manage_workers.clone.17+0x173/0x173
[ 241.256881] [<ffffffff810604b0>] kthread+0x8a/0x92
[ 241.256891] [<ffffffff815899a4>] kernel_thread_helper+0x4/0x10
[ 241.256900] [<ffffffff81060426>] ?
kthread_freezable_should_stop+0x47/0x47
[ 241.256909] [<ffffffff815899a0>] ? gs_change+0x13/0x13

or xenified:


[ 240.563448] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 240.563458] kworker/u:2 D ffff88007fc12d00 0 808 2
0x00000000
[ 240.563479] ffff88007532fcf0 0000000000000246 0000000000012d00
0000000000012d00
[ 240.563504] 0000000000012d00 ffff880075f7caf0 0000000000012d00
ffff88007532ffd8
[ 240.563530] ffff88007532ffd8 0000000000012d00 ffffffff81813020
ffff880075f7caf0
[ 240.563555] Call Trace:
[ 240.563578] [<ffffffff8158733e>] schedule+0x64/0x66
[ 240.563594] [<ffffffff8120184e>] cld_pipe_upcall+0x95/0xd1
[ 240.563611] [<ffffffff811fbd8d>] ? nfsd4_exchange_id+0x23e/0x23e
[ 240.563625] [<ffffffff81201d06>] nfsd4_cld_grace_done+0x50/0x8a
[ 240.563640] [<ffffffff81202233>] nfsd4_record_grace_done+0x18/0x1a
[ 240.563654] [<ffffffff811fbdd7>] laundromat_main+0x4a/0x213
[ 240.563670] [<ffffffff8100d085>] ? xen_spin_unlock+0x12/0x30
[ 240.563685] [<ffffffff811fbd8d>] ? nfsd4_exchange_id+0x23e/0x23e
[ 240.563700] [<ffffffff8105be2d>] process_one_work+0x190/0x28d
[ 240.563714] [<ffffffff8100d337>] ? xen_spin_lock+0xb/0xd
[ 240.563729] [<ffffffff8105ca67>] worker_thread+0x105/0x189
[ 240.563743] [<ffffffff81587b8d>] ? _raw_spin_unlock_irqrestore+0x1a/0x1d
[ 240.563758] [<ffffffff8105c962>] ? manage_workers.clone.17+0x173/0x173
[ 240.563772] [<ffffffff810604b0>] kthread+0x8a/0x92
[ 240.563787] [<ffffffff815899a4>] kernel_thread_helper+0x4/0x10
[ 240.563802] [<ffffffff81587f38>] ? retint_restore_args+0x5/0x6
[ 240.563816] [<ffffffff815899a0>] ? gs_change+0x13/0x13


--
Hans



2012-06-14 01:37:18

by Jeff Layton

[permalink] [raw]
Subject: Re: nfsroot client will not start firefox or thunderbird from 3.4.0 nfsserver

On Wed, 13 Jun 2012 20:18:57 +0200
Hans de Bruin <[email protected]> wrote:

> On 06/11/2012 01:11 PM, Jeff Layton wrote:
> > On Sun, 10 Jun 2012 15:56:11 +0200
> > Hans de Bruin<[email protected]> wrote:
> >
> >> On 06/10/2012 11:52 AM, Jeff Layton wrote:
> >>> On Tue, 29 May 2012 00:19:34 +0200
> >>> Hans de Bruin<[email protected]> wrote:
> >>>
> >>>> I just upgraded my home server from kernel 3.3.5 to 3.4.0 and ran into
> >>>> some trouble. My laptop, a nfsroot client, will not run firefox and
> >>>> thunderbird anymore. When I start these programs from an xterm, the
> >>>> cursor goes to the next line and waits indefinitely.
> >>>>
> >>>> I do not know if there is any order is lsof's output. A lsof | grep
> >>>> firefox or thunderbird shows ......./.parentlock as the last line.
> >>>>
> >>>> It does not matter whether the client is running a 3.4.0 or a 3.3.0
> >>>> kernel, or if the server is running on top of xen or not.
> >>>>
> >>>> There is some noise in the servers dmesg:
> >>>>
> >>>> [ 241.256684] INFO: task kworker/u:2:801 blocked for more than 120 seconds.
> >>>> [ 241.256691] "echo 0> /proc/sys/kernel/hung_task_timeout_secs"
> >>>> disables this message.
> >>>> [ 241.256698] kworker/u:2 D 000000000000000c 0 801 2
> >>>> 0x00000000
> >>>> [ 241.256710] ffff8801390e5cf0 0000000000000046 0000000000012d00
> >>>> 0000000000012d00
> >>>> [ 241.256721] 0000000000012d00 ffff880139f8bd50 0000000000012d00
> >>>> ffff8801390e5fd8
> >>>> [ 241.256732] ffff8801390e5fd8 0000000000012d00 ffff880139ce4420
> >>>> ffff880139f8bd50
> >>>> [ 241.256743] Call Trace:
> >>>> [ 241.256759] [<ffffffff8158733e>] schedule+0x64/0x66
> >>>> [ 241.256769] [<ffffffff8120184e>] cld_pipe_upcall+0x95/0xd1
> >>>> [ 241.256780] [<ffffffff811fbd8d>] ? nfsd4_exchange_id+0x23e/0x23e
> >>>> [ 241.256789] [<ffffffff81201d06>] nfsd4_cld_grace_done+0x50/0x8a
> >>>> [ 241.256798] [<ffffffff81202233>] nfsd4_record_grace_done+0x18/0x1a
> >>>> [ 241.256807] [<ffffffff811fbdd7>] laundromat_main+0x4a/0x213
> >>>> [ 241.256818] [<ffffffff8106a06b>] ? need_resched+0x1e/0x28
> >>>> [ 241.256826] [<ffffffff8158725d>] ? __schedule+0x49d/0x4b5
> >>>> [ 241.256835] [<ffffffff811fbd8d>] ? nfsd4_exchange_id+0x23e/0x23e
> >>>> [ 241.256844] [<ffffffff8105be2d>] process_one_work+0x190/0x28d
> >>>> [ 241.256854] [<ffffffff8105ca67>] worker_thread+0x105/0x189
> >>>> [ 241.256862] [<ffffffff81587b8d>] ? _raw_spin_unlock_irqrestore+0x1a/0x1d
> >>>> [ 241.256872] [<ffffffff8105c962>] ? manage_workers.clone.17+0x173/0x173
> >>>> [ 241.256881] [<ffffffff810604b0>] kthread+0x8a/0x92
> >>>> [ 241.256891] [<ffffffff815899a4>] kernel_thread_helper+0x4/0x10
> >>>> [ 241.256900] [<ffffffff81060426>] ?
> >>>> kthread_freezable_should_stop+0x47/0x47
> >>>> [ 241.256909] [<ffffffff815899a0>] ? gs_change+0x13/0x13
> >>>>
> >>>> or xenified:
> >>>>
> >>>>
> >>>> [ 240.563448] "echo 0> /proc/sys/kernel/hung_task_timeout_secs"
> >>>> disables this message.
> >>>> [ 240.563458] kworker/u:2 D ffff88007fc12d00 0 808 2
> >>>> 0x00000000
> >>>> [ 240.563479] ffff88007532fcf0 0000000000000246 0000000000012d00
> >>>> 0000000000012d00
> >>>> [ 240.563504] 0000000000012d00 ffff880075f7caf0 0000000000012d00
> >>>> ffff88007532ffd8
> >>>> [ 240.563530] ffff88007532ffd8 0000000000012d00 ffffffff81813020
> >>>> ffff880075f7caf0
> >>>> [ 240.563555] Call Trace:
> >>>> [ 240.563578] [<ffffffff8158733e>] schedule+0x64/0x66
> >>>> [ 240.563594] [<ffffffff8120184e>] cld_pipe_upcall+0x95/0xd1
> >>>> [ 240.563611] [<ffffffff811fbd8d>] ? nfsd4_exchange_id+0x23e/0x23e
> >>>> [ 240.563625] [<ffffffff81201d06>] nfsd4_cld_grace_done+0x50/0x8a
> >>>> [ 240.563640] [<ffffffff81202233>] nfsd4_record_grace_done+0x18/0x1a
> >>>> [ 240.563654] [<ffffffff811fbdd7>] laundromat_main+0x4a/0x213
> >>>> [ 240.563670] [<ffffffff8100d085>] ? xen_spin_unlock+0x12/0x30
> >>>> [ 240.563685] [<ffffffff811fbd8d>] ? nfsd4_exchange_id+0x23e/0x23e
> >>>> [ 240.563700] [<ffffffff8105be2d>] process_one_work+0x190/0x28d
> >>>> [ 240.563714] [<ffffffff8100d337>] ? xen_spin_lock+0xb/0xd
> >>>> [ 240.563729] [<ffffffff8105ca67>] worker_thread+0x105/0x189
> >>>> [ 240.563743] [<ffffffff81587b8d>] ? _raw_spin_unlock_irqrestore+0x1a/0x1d
> >>>> [ 240.563758] [<ffffffff8105c962>] ? manage_workers.clone.17+0x173/0x173
> >>>> [ 240.563772] [<ffffffff810604b0>] kthread+0x8a/0x92
> >>>> [ 240.563787] [<ffffffff815899a4>] kernel_thread_helper+0x4/0x10
> >>>> [ 240.563802] [<ffffffff81587f38>] ? retint_restore_args+0x5/0x6
> >>>> [ 240.563816] [<ffffffff815899a0>] ? gs_change+0x13/0x13
> >>>>
> >>>>
> >>>
> >>> It sounds like you're not running the new nfsdcld daemon on the server,
> >>> and /var/lib/nfs/v4recovery does not exist. Is that correct?
> >>>
> >>
> >> Yes that correct. When I create the /var/lib/nfs/v4recovery directory my
> >> problems are gone.
> >>
> >> On the server and the client v3 and v4 are compiled into the kernel. Al
> >> mounts are v3. Should I either remove v4 from the kernels or upgrade
> >> nfs-utils-1.2.3 to something newer?
> >>
> >> (a search on the linux-nfs wiki for nfsdcld does not return any hits)
> >>
> >
> > Here's what I see on my machine when I remove the v4recovery dir and
> > don't run nfsdcld:
> >
> > [ 109.715080] NFSD: starting 90-second grace period
> > [ 229.984220] NFSD: Unable to end grace period: -110
> >
> > ...and -110 is -ETIMEDOUT. So, what happens is that nfsd kicks off the
> > laundromat job to run 90s after it starts. When it runs, it attempts to
> > talk to nfsdcld because the v4recovery dir isn't present. That fails
> > 30s later.
> >
> > Now, how that works is all rather convoluted. When we do the upcall, we
> > put the message on a rpc_pipefs pipe queue, and then set a delayed
> > workqueue job to walk that queue later and look for stuff that's been
> > sitting there for more than 30s. Anything that has will get the status
> > set to -ETIMEDOUT, and the task that submitted the upcall gets woken back up.
> >
> > It sounds like on your machine that the queue_timeout job isn't getting
> > run for some reason. That would leave the upcalling tasks sitting in
> > schedule for longer than 30s. Perhaps your workqueues are all blocking
> > and the cmwq infrastructure isn't able to spawn new ones?
> >
> > You may want to look at /proc/<pid>/stack for all of the kworker tasks
> > and see what they're doing...
> >
>
> Attached is the output of this script which ran at the server:
>
> #!/bin/bash
> cd /proc
> while :
> do
> echo >> /root/log
> date >> /root/log
> echo >> /root/log
> dmesg | tail -n 30 >> /root/log
> echo >> /root/log
> `ps ax|grep '[k]worker'| sed 's/ *\([0-9]*\) ?.*/tail \1\/stack/g'`
> >> /root/log
> sleep 20
> done
>
>
> It dumps /proc/'pid'/stack for al the worker threads.
>

I thought I had cc'ed you on the other thread, but I think we know what
the problem is now. This patch should fix it:

[PATCH] rpc_pipefs: allow rpc_purge_list to take a NULL waitq pointer

--
Jeff Layton <[email protected]>

2012-06-10 13:56:18

by Hans de Bruin

[permalink] [raw]
Subject: Re: nfsroot client will not start firefox or thunderbird from 3.4.0 nfsserver

On 06/10/2012 11:52 AM, Jeff Layton wrote:
> On Tue, 29 May 2012 00:19:34 +0200
> Hans de Bruin<[email protected]> wrote:
>
>> I just upgraded my home server from kernel 3.3.5 to 3.4.0 and ran into
>> some trouble. My laptop, a nfsroot client, will not run firefox and
>> thunderbird anymore. When I start these programs from an xterm, the
>> cursor goes to the next line and waits indefinitely.
>>
>> I do not know if there is any order is lsof's output. A lsof | grep
>> firefox or thunderbird shows ......./.parentlock as the last line.
>>
>> It does not matter whether the client is running a 3.4.0 or a 3.3.0
>> kernel, or if the server is running on top of xen or not.
>>
>> There is some noise in the servers dmesg:
>>
>> [ 241.256684] INFO: task kworker/u:2:801 blocked for more than 120 seconds.
>> [ 241.256691] "echo 0> /proc/sys/kernel/hung_task_timeout_secs"
>> disables this message.
>> [ 241.256698] kworker/u:2 D 000000000000000c 0 801 2
>> 0x00000000
>> [ 241.256710] ffff8801390e5cf0 0000000000000046 0000000000012d00
>> 0000000000012d00
>> [ 241.256721] 0000000000012d00 ffff880139f8bd50 0000000000012d00
>> ffff8801390e5fd8
>> [ 241.256732] ffff8801390e5fd8 0000000000012d00 ffff880139ce4420
>> ffff880139f8bd50
>> [ 241.256743] Call Trace:
>> [ 241.256759] [<ffffffff8158733e>] schedule+0x64/0x66
>> [ 241.256769] [<ffffffff8120184e>] cld_pipe_upcall+0x95/0xd1
>> [ 241.256780] [<ffffffff811fbd8d>] ? nfsd4_exchange_id+0x23e/0x23e
>> [ 241.256789] [<ffffffff81201d06>] nfsd4_cld_grace_done+0x50/0x8a
>> [ 241.256798] [<ffffffff81202233>] nfsd4_record_grace_done+0x18/0x1a
>> [ 241.256807] [<ffffffff811fbdd7>] laundromat_main+0x4a/0x213
>> [ 241.256818] [<ffffffff8106a06b>] ? need_resched+0x1e/0x28
>> [ 241.256826] [<ffffffff8158725d>] ? __schedule+0x49d/0x4b5
>> [ 241.256835] [<ffffffff811fbd8d>] ? nfsd4_exchange_id+0x23e/0x23e
>> [ 241.256844] [<ffffffff8105be2d>] process_one_work+0x190/0x28d
>> [ 241.256854] [<ffffffff8105ca67>] worker_thread+0x105/0x189
>> [ 241.256862] [<ffffffff81587b8d>] ? _raw_spin_unlock_irqrestore+0x1a/0x1d
>> [ 241.256872] [<ffffffff8105c962>] ? manage_workers.clone.17+0x173/0x173
>> [ 241.256881] [<ffffffff810604b0>] kthread+0x8a/0x92
>> [ 241.256891] [<ffffffff815899a4>] kernel_thread_helper+0x4/0x10
>> [ 241.256900] [<ffffffff81060426>] ?
>> kthread_freezable_should_stop+0x47/0x47
>> [ 241.256909] [<ffffffff815899a0>] ? gs_change+0x13/0x13
>>
>> or xenified:
>>
>>
>> [ 240.563448] "echo 0> /proc/sys/kernel/hung_task_timeout_secs"
>> disables this message.
>> [ 240.563458] kworker/u:2 D ffff88007fc12d00 0 808 2
>> 0x00000000
>> [ 240.563479] ffff88007532fcf0 0000000000000246 0000000000012d00
>> 0000000000012d00
>> [ 240.563504] 0000000000012d00 ffff880075f7caf0 0000000000012d00
>> ffff88007532ffd8
>> [ 240.563530] ffff88007532ffd8 0000000000012d00 ffffffff81813020
>> ffff880075f7caf0
>> [ 240.563555] Call Trace:
>> [ 240.563578] [<ffffffff8158733e>] schedule+0x64/0x66
>> [ 240.563594] [<ffffffff8120184e>] cld_pipe_upcall+0x95/0xd1
>> [ 240.563611] [<ffffffff811fbd8d>] ? nfsd4_exchange_id+0x23e/0x23e
>> [ 240.563625] [<ffffffff81201d06>] nfsd4_cld_grace_done+0x50/0x8a
>> [ 240.563640] [<ffffffff81202233>] nfsd4_record_grace_done+0x18/0x1a
>> [ 240.563654] [<ffffffff811fbdd7>] laundromat_main+0x4a/0x213
>> [ 240.563670] [<ffffffff8100d085>] ? xen_spin_unlock+0x12/0x30
>> [ 240.563685] [<ffffffff811fbd8d>] ? nfsd4_exchange_id+0x23e/0x23e
>> [ 240.563700] [<ffffffff8105be2d>] process_one_work+0x190/0x28d
>> [ 240.563714] [<ffffffff8100d337>] ? xen_spin_lock+0xb/0xd
>> [ 240.563729] [<ffffffff8105ca67>] worker_thread+0x105/0x189
>> [ 240.563743] [<ffffffff81587b8d>] ? _raw_spin_unlock_irqrestore+0x1a/0x1d
>> [ 240.563758] [<ffffffff8105c962>] ? manage_workers.clone.17+0x173/0x173
>> [ 240.563772] [<ffffffff810604b0>] kthread+0x8a/0x92
>> [ 240.563787] [<ffffffff815899a4>] kernel_thread_helper+0x4/0x10
>> [ 240.563802] [<ffffffff81587f38>] ? retint_restore_args+0x5/0x6
>> [ 240.563816] [<ffffffff815899a0>] ? gs_change+0x13/0x13
>>
>>
>
> It sounds like you're not running the new nfsdcld daemon on the server,
> and /var/lib/nfs/v4recovery does not exist. Is that correct?
>

Yes that correct. When I create the /var/lib/nfs/v4recovery directory my
problems are gone.

On the server and the client v3 and v4 are compiled into the kernel. Al
mounts are v3. Should I either remove v4 from the kernels or upgrade
nfs-utils-1.2.3 to something newer?

(a search on the linux-nfs wiki for nfsdcld does not return any hits)

--
Hans

2012-06-04 10:31:19

by Konstantin Khlebnikov

[permalink] [raw]
Subject: Re: nfsroot client will not start firefox or thunderbird from 3.4.0 nfsserver

Hans de Bruin wrote:
> On 06/01/2012 09:11 PM, Hans de Bruin wrote:
>> On 05/29/2012 12:19 AM, Hans de Bruin wrote:
>>> I just upgraded my home server from kernel 3.3.5 to 3.4.0 and ran into
>>> some trouble. My laptop, a nfsroot client, will not run firefox and
>>> thunderbird anymore. When I start these programs from an xterm, the
>>> cursor goes to the next line and waits indefinitely.
>>>
>>> I do not know if there is any order is lsof's output. A lsof | grep
>>> firefox or thunderbird shows ......./.parentlock as the last line.
>>>
>>> It does not matter whether the client is running a 3.4.0 or a 3.3.0
>>> kernel, or if the server is running on top of xen or not.
>>>
>>> There is some noise in the servers dmesg:
>>>
>>> [ 241.256684] INFO: task kworker/u:2:801 blocked for more than 120
>>> seconds.
>>> [ 241.256691] "echo 0> /proc/sys/kernel/hung_task_timeout_secs"
>>
>> ...
>>
>> On a almost identical testsystem firefox en thunderbird segfault after
>> upgrading to 3.4.0. I would have been nice if it would behave exaclty
>> like my home server. I bisected the segfault to:
>>
>> commit 0fc9d1040313047edf6a39fd4d7c7defdca97c62
>> Author: Konstantin Khlebnikov<[email protected]>
>> Date: Wed Mar 28 14:42:54 2012 -0700
>>
>> radix-tree: use iterators in find_get_pages* functions
>>
>>
>> When I revert that on top of 3.4.0 the segfaults are gone but both
>> firefox en thunderbird go in the lets wait indefinitely mode like the
>> homeserver.
>>
>> I am going to make a bit-wise copy from from my homeserver to my
>> testserver and try again.
>>
>
> The bit-wise copy also segfaults firefox and thunderbird at the same
> commit.
>

I think bug somewhere in NFS, that patch only highlighted it.
Please, try to run it with debug patch from attachment.


Attachments:
mm-debug-fing_get_pages-speculative-restart (1.75 kB)

2012-06-13 18:19:20

by Hans de Bruin

[permalink] [raw]
Subject: Re: nfsroot client will not start firefox or thunderbird from 3.4.0 nfsserver

On 06/11/2012 01:11 PM, Jeff Layton wrote:
> On Sun, 10 Jun 2012 15:56:11 +0200
> Hans de Bruin<[email protected]> wrote:
>
>> On 06/10/2012 11:52 AM, Jeff Layton wrote:
>>> On Tue, 29 May 2012 00:19:34 +0200
>>> Hans de Bruin<[email protected]> wrote:
>>>
>>>> I just upgraded my home server from kernel 3.3.5 to 3.4.0 and ran into
>>>> some trouble. My laptop, a nfsroot client, will not run firefox and
>>>> thunderbird anymore. When I start these programs from an xterm, the
>>>> cursor goes to the next line and waits indefinitely.
>>>>
>>>> I do not know if there is any order is lsof's output. A lsof | grep
>>>> firefox or thunderbird shows ......./.parentlock as the last line.
>>>>
>>>> It does not matter whether the client is running a 3.4.0 or a 3.3.0
>>>> kernel, or if the server is running on top of xen or not.
>>>>
>>>> There is some noise in the servers dmesg:
>>>>
>>>> [ 241.256684] INFO: task kworker/u:2:801 blocked for more than 120 seconds.
>>>> [ 241.256691] "echo 0> /proc/sys/kernel/hung_task_timeout_secs"
>>>> disables this message.
>>>> [ 241.256698] kworker/u:2 D 000000000000000c 0 801 2
>>>> 0x00000000
>>>> [ 241.256710] ffff8801390e5cf0 0000000000000046 0000000000012d00
>>>> 0000000000012d00
>>>> [ 241.256721] 0000000000012d00 ffff880139f8bd50 0000000000012d00
>>>> ffff8801390e5fd8
>>>> [ 241.256732] ffff8801390e5fd8 0000000000012d00 ffff880139ce4420
>>>> ffff880139f8bd50
>>>> [ 241.256743] Call Trace:
>>>> [ 241.256759] [<ffffffff8158733e>] schedule+0x64/0x66
>>>> [ 241.256769] [<ffffffff8120184e>] cld_pipe_upcall+0x95/0xd1
>>>> [ 241.256780] [<ffffffff811fbd8d>] ? nfsd4_exchange_id+0x23e/0x23e
>>>> [ 241.256789] [<ffffffff81201d06>] nfsd4_cld_grace_done+0x50/0x8a
>>>> [ 241.256798] [<ffffffff81202233>] nfsd4_record_grace_done+0x18/0x1a
>>>> [ 241.256807] [<ffffffff811fbdd7>] laundromat_main+0x4a/0x213
>>>> [ 241.256818] [<ffffffff8106a06b>] ? need_resched+0x1e/0x28
>>>> [ 241.256826] [<ffffffff8158725d>] ? __schedule+0x49d/0x4b5
>>>> [ 241.256835] [<ffffffff811fbd8d>] ? nfsd4_exchange_id+0x23e/0x23e
>>>> [ 241.256844] [<ffffffff8105be2d>] process_one_work+0x190/0x28d
>>>> [ 241.256854] [<ffffffff8105ca67>] worker_thread+0x105/0x189
>>>> [ 241.256862] [<ffffffff81587b8d>] ? _raw_spin_unlock_irqrestore+0x1a/0x1d
>>>> [ 241.256872] [<ffffffff8105c962>] ? manage_workers.clone.17+0x173/0x173
>>>> [ 241.256881] [<ffffffff810604b0>] kthread+0x8a/0x92
>>>> [ 241.256891] [<ffffffff815899a4>] kernel_thread_helper+0x4/0x10
>>>> [ 241.256900] [<ffffffff81060426>] ?
>>>> kthread_freezable_should_stop+0x47/0x47
>>>> [ 241.256909] [<ffffffff815899a0>] ? gs_change+0x13/0x13
>>>>
>>>> or xenified:
>>>>
>>>>
>>>> [ 240.563448] "echo 0> /proc/sys/kernel/hung_task_timeout_secs"
>>>> disables this message.
>>>> [ 240.563458] kworker/u:2 D ffff88007fc12d00 0 808 2
>>>> 0x00000000
>>>> [ 240.563479] ffff88007532fcf0 0000000000000246 0000000000012d00
>>>> 0000000000012d00
>>>> [ 240.563504] 0000000000012d00 ffff880075f7caf0 0000000000012d00
>>>> ffff88007532ffd8
>>>> [ 240.563530] ffff88007532ffd8 0000000000012d00 ffffffff81813020
>>>> ffff880075f7caf0
>>>> [ 240.563555] Call Trace:
>>>> [ 240.563578] [<ffffffff8158733e>] schedule+0x64/0x66
>>>> [ 240.563594] [<ffffffff8120184e>] cld_pipe_upcall+0x95/0xd1
>>>> [ 240.563611] [<ffffffff811fbd8d>] ? nfsd4_exchange_id+0x23e/0x23e
>>>> [ 240.563625] [<ffffffff81201d06>] nfsd4_cld_grace_done+0x50/0x8a
>>>> [ 240.563640] [<ffffffff81202233>] nfsd4_record_grace_done+0x18/0x1a
>>>> [ 240.563654] [<ffffffff811fbdd7>] laundromat_main+0x4a/0x213
>>>> [ 240.563670] [<ffffffff8100d085>] ? xen_spin_unlock+0x12/0x30
>>>> [ 240.563685] [<ffffffff811fbd8d>] ? nfsd4_exchange_id+0x23e/0x23e
>>>> [ 240.563700] [<ffffffff8105be2d>] process_one_work+0x190/0x28d
>>>> [ 240.563714] [<ffffffff8100d337>] ? xen_spin_lock+0xb/0xd
>>>> [ 240.563729] [<ffffffff8105ca67>] worker_thread+0x105/0x189
>>>> [ 240.563743] [<ffffffff81587b8d>] ? _raw_spin_unlock_irqrestore+0x1a/0x1d
>>>> [ 240.563758] [<ffffffff8105c962>] ? manage_workers.clone.17+0x173/0x173
>>>> [ 240.563772] [<ffffffff810604b0>] kthread+0x8a/0x92
>>>> [ 240.563787] [<ffffffff815899a4>] kernel_thread_helper+0x4/0x10
>>>> [ 240.563802] [<ffffffff81587f38>] ? retint_restore_args+0x5/0x6
>>>> [ 240.563816] [<ffffffff815899a0>] ? gs_change+0x13/0x13
>>>>
>>>>
>>>
>>> It sounds like you're not running the new nfsdcld daemon on the server,
>>> and /var/lib/nfs/v4recovery does not exist. Is that correct?
>>>
>>
>> Yes that correct. When I create the /var/lib/nfs/v4recovery directory my
>> problems are gone.
>>
>> On the server and the client v3 and v4 are compiled into the kernel. Al
>> mounts are v3. Should I either remove v4 from the kernels or upgrade
>> nfs-utils-1.2.3 to something newer?
>>
>> (a search on the linux-nfs wiki for nfsdcld does not return any hits)
>>
>
> Here's what I see on my machine when I remove the v4recovery dir and
> don't run nfsdcld:
>
> [ 109.715080] NFSD: starting 90-second grace period
> [ 229.984220] NFSD: Unable to end grace period: -110
>
> ...and -110 is -ETIMEDOUT. So, what happens is that nfsd kicks off the
> laundromat job to run 90s after it starts. When it runs, it attempts to
> talk to nfsdcld because the v4recovery dir isn't present. That fails
> 30s later.
>
> Now, how that works is all rather convoluted. When we do the upcall, we
> put the message on a rpc_pipefs pipe queue, and then set a delayed
> workqueue job to walk that queue later and look for stuff that's been
> sitting there for more than 30s. Anything that has will get the status
> set to -ETIMEDOUT, and the task that submitted the upcall gets woken back up.
>
> It sounds like on your machine that the queue_timeout job isn't getting
> run for some reason. That would leave the upcalling tasks sitting in
> schedule for longer than 30s. Perhaps your workqueues are all blocking
> and the cmwq infrastructure isn't able to spawn new ones?
>
> You may want to look at /proc/<pid>/stack for all of the kworker tasks
> and see what they're doing...
>

Attached is the output of this script which ran at the server:

#!/bin/bash
cd /proc
while :
do
echo >> /root/log
date >> /root/log
echo >> /root/log
dmesg | tail -n 30 >> /root/log
echo >> /root/log
`ps ax|grep '[k]worker'| sed 's/ *\([0-9]*\) ?.*/tail \1\/stack/g'`
>> /root/log
sleep 20
done


It dumps /proc/'pid'/stack for al the worker threads.

--
Hans




Attachments:
log (88.20 kB)

2012-06-11 11:11:46

by Jeff Layton

[permalink] [raw]
Subject: Re: nfsroot client will not start firefox or thunderbird from 3.4.0 nfsserver

On Sun, 10 Jun 2012 15:56:11 +0200
Hans de Bruin <[email protected]> wrote:

> On 06/10/2012 11:52 AM, Jeff Layton wrote:
> > On Tue, 29 May 2012 00:19:34 +0200
> > Hans de Bruin<[email protected]> wrote:
> >
> >> I just upgraded my home server from kernel 3.3.5 to 3.4.0 and ran into
> >> some trouble. My laptop, a nfsroot client, will not run firefox and
> >> thunderbird anymore. When I start these programs from an xterm, the
> >> cursor goes to the next line and waits indefinitely.
> >>
> >> I do not know if there is any order is lsof's output. A lsof | grep
> >> firefox or thunderbird shows ......./.parentlock as the last line.
> >>
> >> It does not matter whether the client is running a 3.4.0 or a 3.3.0
> >> kernel, or if the server is running on top of xen or not.
> >>
> >> There is some noise in the servers dmesg:
> >>
> >> [ 241.256684] INFO: task kworker/u:2:801 blocked for more than 120 seconds.
> >> [ 241.256691] "echo 0> /proc/sys/kernel/hung_task_timeout_secs"
> >> disables this message.
> >> [ 241.256698] kworker/u:2 D 000000000000000c 0 801 2
> >> 0x00000000
> >> [ 241.256710] ffff8801390e5cf0 0000000000000046 0000000000012d00
> >> 0000000000012d00
> >> [ 241.256721] 0000000000012d00 ffff880139f8bd50 0000000000012d00
> >> ffff8801390e5fd8
> >> [ 241.256732] ffff8801390e5fd8 0000000000012d00 ffff880139ce4420
> >> ffff880139f8bd50
> >> [ 241.256743] Call Trace:
> >> [ 241.256759] [<ffffffff8158733e>] schedule+0x64/0x66
> >> [ 241.256769] [<ffffffff8120184e>] cld_pipe_upcall+0x95/0xd1
> >> [ 241.256780] [<ffffffff811fbd8d>] ? nfsd4_exchange_id+0x23e/0x23e
> >> [ 241.256789] [<ffffffff81201d06>] nfsd4_cld_grace_done+0x50/0x8a
> >> [ 241.256798] [<ffffffff81202233>] nfsd4_record_grace_done+0x18/0x1a
> >> [ 241.256807] [<ffffffff811fbdd7>] laundromat_main+0x4a/0x213
> >> [ 241.256818] [<ffffffff8106a06b>] ? need_resched+0x1e/0x28
> >> [ 241.256826] [<ffffffff8158725d>] ? __schedule+0x49d/0x4b5
> >> [ 241.256835] [<ffffffff811fbd8d>] ? nfsd4_exchange_id+0x23e/0x23e
> >> [ 241.256844] [<ffffffff8105be2d>] process_one_work+0x190/0x28d
> >> [ 241.256854] [<ffffffff8105ca67>] worker_thread+0x105/0x189
> >> [ 241.256862] [<ffffffff81587b8d>] ? _raw_spin_unlock_irqrestore+0x1a/0x1d
> >> [ 241.256872] [<ffffffff8105c962>] ? manage_workers.clone.17+0x173/0x173
> >> [ 241.256881] [<ffffffff810604b0>] kthread+0x8a/0x92
> >> [ 241.256891] [<ffffffff815899a4>] kernel_thread_helper+0x4/0x10
> >> [ 241.256900] [<ffffffff81060426>] ?
> >> kthread_freezable_should_stop+0x47/0x47
> >> [ 241.256909] [<ffffffff815899a0>] ? gs_change+0x13/0x13
> >>
> >> or xenified:
> >>
> >>
> >> [ 240.563448] "echo 0> /proc/sys/kernel/hung_task_timeout_secs"
> >> disables this message.
> >> [ 240.563458] kworker/u:2 D ffff88007fc12d00 0 808 2
> >> 0x00000000
> >> [ 240.563479] ffff88007532fcf0 0000000000000246 0000000000012d00
> >> 0000000000012d00
> >> [ 240.563504] 0000000000012d00 ffff880075f7caf0 0000000000012d00
> >> ffff88007532ffd8
> >> [ 240.563530] ffff88007532ffd8 0000000000012d00 ffffffff81813020
> >> ffff880075f7caf0
> >> [ 240.563555] Call Trace:
> >> [ 240.563578] [<ffffffff8158733e>] schedule+0x64/0x66
> >> [ 240.563594] [<ffffffff8120184e>] cld_pipe_upcall+0x95/0xd1
> >> [ 240.563611] [<ffffffff811fbd8d>] ? nfsd4_exchange_id+0x23e/0x23e
> >> [ 240.563625] [<ffffffff81201d06>] nfsd4_cld_grace_done+0x50/0x8a
> >> [ 240.563640] [<ffffffff81202233>] nfsd4_record_grace_done+0x18/0x1a
> >> [ 240.563654] [<ffffffff811fbdd7>] laundromat_main+0x4a/0x213
> >> [ 240.563670] [<ffffffff8100d085>] ? xen_spin_unlock+0x12/0x30
> >> [ 240.563685] [<ffffffff811fbd8d>] ? nfsd4_exchange_id+0x23e/0x23e
> >> [ 240.563700] [<ffffffff8105be2d>] process_one_work+0x190/0x28d
> >> [ 240.563714] [<ffffffff8100d337>] ? xen_spin_lock+0xb/0xd
> >> [ 240.563729] [<ffffffff8105ca67>] worker_thread+0x105/0x189
> >> [ 240.563743] [<ffffffff81587b8d>] ? _raw_spin_unlock_irqrestore+0x1a/0x1d
> >> [ 240.563758] [<ffffffff8105c962>] ? manage_workers.clone.17+0x173/0x173
> >> [ 240.563772] [<ffffffff810604b0>] kthread+0x8a/0x92
> >> [ 240.563787] [<ffffffff815899a4>] kernel_thread_helper+0x4/0x10
> >> [ 240.563802] [<ffffffff81587f38>] ? retint_restore_args+0x5/0x6
> >> [ 240.563816] [<ffffffff815899a0>] ? gs_change+0x13/0x13
> >>
> >>
> >
> > It sounds like you're not running the new nfsdcld daemon on the server,
> > and /var/lib/nfs/v4recovery does not exist. Is that correct?
> >
>
> Yes that correct. When I create the /var/lib/nfs/v4recovery directory my
> problems are gone.
>
> On the server and the client v3 and v4 are compiled into the kernel. Al
> mounts are v3. Should I either remove v4 from the kernels or upgrade
> nfs-utils-1.2.3 to something newer?
>
> (a search on the linux-nfs wiki for nfsdcld does not return any hits)
>

Here's what I see on my machine when I remove the v4recovery dir and
don't run nfsdcld:

[ 109.715080] NFSD: starting 90-second grace period
[ 229.984220] NFSD: Unable to end grace period: -110

...and -110 is -ETIMEDOUT. So, what happens is that nfsd kicks off the
laundromat job to run 90s after it starts. When it runs, it attempts to
talk to nfsdcld because the v4recovery dir isn't present. That fails
30s later.

Now, how that works is all rather convoluted. When we do the upcall, we
put the message on a rpc_pipefs pipe queue, and then set a delayed
workqueue job to walk that queue later and look for stuff that's been
sitting there for more than 30s. Anything that has will get the status
set to -ETIMEDOUT, and the task that submitted the upcall gets woken back up.

It sounds like on your machine that the queue_timeout job isn't getting
run for some reason. That would leave the upcalling tasks sitting in
schedule for longer than 30s. Perhaps your workqueues are all blocking
and the cmwq infrastructure isn't able to spawn new ones?

You may want to look at /proc/<pid>/stack for all of the kworker tasks
and see what they're doing...

--
Jeff Layton <[email protected]>

2012-06-11 10:22:51

by Jeff Layton

[permalink] [raw]
Subject: Re: nfsroot client will not start firefox or thunderbird from 3.4.0 nfsserver

On Sun, 10 Jun 2012 15:56:11 +0200
Hans de Bruin <[email protected]> wrote:

> On 06/10/2012 11:52 AM, Jeff Layton wrote:
> > On Tue, 29 May 2012 00:19:34 +0200
> > Hans de Bruin<[email protected]> wrote:
> >
> >> I just upgraded my home server from kernel 3.3.5 to 3.4.0 and ran into
> >> some trouble. My laptop, a nfsroot client, will not run firefox and
> >> thunderbird anymore. When I start these programs from an xterm, the
> >> cursor goes to the next line and waits indefinitely.
> >>
> >> I do not know if there is any order is lsof's output. A lsof | grep
> >> firefox or thunderbird shows ......./.parentlock as the last line.
> >>
> >> It does not matter whether the client is running a 3.4.0 or a 3.3.0
> >> kernel, or if the server is running on top of xen or not.
> >>
> >> There is some noise in the servers dmesg:
> >>
> >> [ 241.256684] INFO: task kworker/u:2:801 blocked for more than 120 seconds.
> >> [ 241.256691] "echo 0> /proc/sys/kernel/hung_task_timeout_secs"
> >> disables this message.
> >> [ 241.256698] kworker/u:2 D 000000000000000c 0 801 2
> >> 0x00000000
> >> [ 241.256710] ffff8801390e5cf0 0000000000000046 0000000000012d00
> >> 0000000000012d00
> >> [ 241.256721] 0000000000012d00 ffff880139f8bd50 0000000000012d00
> >> ffff8801390e5fd8
> >> [ 241.256732] ffff8801390e5fd8 0000000000012d00 ffff880139ce4420
> >> ffff880139f8bd50
> >> [ 241.256743] Call Trace:
> >> [ 241.256759] [<ffffffff8158733e>] schedule+0x64/0x66
> >> [ 241.256769] [<ffffffff8120184e>] cld_pipe_upcall+0x95/0xd1
> >> [ 241.256780] [<ffffffff811fbd8d>] ? nfsd4_exchange_id+0x23e/0x23e
> >> [ 241.256789] [<ffffffff81201d06>] nfsd4_cld_grace_done+0x50/0x8a
> >> [ 241.256798] [<ffffffff81202233>] nfsd4_record_grace_done+0x18/0x1a
> >> [ 241.256807] [<ffffffff811fbdd7>] laundromat_main+0x4a/0x213
> >> [ 241.256818] [<ffffffff8106a06b>] ? need_resched+0x1e/0x28
> >> [ 241.256826] [<ffffffff8158725d>] ? __schedule+0x49d/0x4b5
> >> [ 241.256835] [<ffffffff811fbd8d>] ? nfsd4_exchange_id+0x23e/0x23e
> >> [ 241.256844] [<ffffffff8105be2d>] process_one_work+0x190/0x28d
> >> [ 241.256854] [<ffffffff8105ca67>] worker_thread+0x105/0x189
> >> [ 241.256862] [<ffffffff81587b8d>] ? _raw_spin_unlock_irqrestore+0x1a/0x1d
> >> [ 241.256872] [<ffffffff8105c962>] ? manage_workers.clone.17+0x173/0x173
> >> [ 241.256881] [<ffffffff810604b0>] kthread+0x8a/0x92
> >> [ 241.256891] [<ffffffff815899a4>] kernel_thread_helper+0x4/0x10
> >> [ 241.256900] [<ffffffff81060426>] ?
> >> kthread_freezable_should_stop+0x47/0x47
> >> [ 241.256909] [<ffffffff815899a0>] ? gs_change+0x13/0x13
> >>
> >> or xenified:
> >>
> >>
> >> [ 240.563448] "echo 0> /proc/sys/kernel/hung_task_timeout_secs"
> >> disables this message.
> >> [ 240.563458] kworker/u:2 D ffff88007fc12d00 0 808 2
> >> 0x00000000
> >> [ 240.563479] ffff88007532fcf0 0000000000000246 0000000000012d00
> >> 0000000000012d00
> >> [ 240.563504] 0000000000012d00 ffff880075f7caf0 0000000000012d00
> >> ffff88007532ffd8
> >> [ 240.563530] ffff88007532ffd8 0000000000012d00 ffffffff81813020
> >> ffff880075f7caf0
> >> [ 240.563555] Call Trace:
> >> [ 240.563578] [<ffffffff8158733e>] schedule+0x64/0x66
> >> [ 240.563594] [<ffffffff8120184e>] cld_pipe_upcall+0x95/0xd1
> >> [ 240.563611] [<ffffffff811fbd8d>] ? nfsd4_exchange_id+0x23e/0x23e
> >> [ 240.563625] [<ffffffff81201d06>] nfsd4_cld_grace_done+0x50/0x8a
> >> [ 240.563640] [<ffffffff81202233>] nfsd4_record_grace_done+0x18/0x1a
> >> [ 240.563654] [<ffffffff811fbdd7>] laundromat_main+0x4a/0x213
> >> [ 240.563670] [<ffffffff8100d085>] ? xen_spin_unlock+0x12/0x30
> >> [ 240.563685] [<ffffffff811fbd8d>] ? nfsd4_exchange_id+0x23e/0x23e
> >> [ 240.563700] [<ffffffff8105be2d>] process_one_work+0x190/0x28d
> >> [ 240.563714] [<ffffffff8100d337>] ? xen_spin_lock+0xb/0xd
> >> [ 240.563729] [<ffffffff8105ca67>] worker_thread+0x105/0x189
> >> [ 240.563743] [<ffffffff81587b8d>] ? _raw_spin_unlock_irqrestore+0x1a/0x1d
> >> [ 240.563758] [<ffffffff8105c962>] ? manage_workers.clone.17+0x173/0x173
> >> [ 240.563772] [<ffffffff810604b0>] kthread+0x8a/0x92
> >> [ 240.563787] [<ffffffff815899a4>] kernel_thread_helper+0x4/0x10
> >> [ 240.563802] [<ffffffff81587f38>] ? retint_restore_args+0x5/0x6
> >> [ 240.563816] [<ffffffff815899a0>] ? gs_change+0x13/0x13
> >>
> >>
> >
> > It sounds like you're not running the new nfsdcld daemon on the server,
> > and /var/lib/nfs/v4recovery does not exist. Is that correct?
> >
>
> Yes that correct. When I create the /var/lib/nfs/v4recovery directory my
> problems are gone.
>
> On the server and the client v3 and v4 are compiled into the kernel. Al
> mounts are v3. Should I either remove v4 from the kernels or upgrade
> nfs-utils-1.2.3 to something newer?
>
> (a search on the linux-nfs wiki for nfsdcld does not return any hits)
>


Just creating the legacy state tracking directory should be enough to
work around the problem for now. Eventually you'll want to move to a
newer nfs-utils (1.2.5 or later) and run nfsdcld to handle the v4
client tracking. Alternately, you can just build the kernel without v4
support, but that's really not necessary.

Now, that said...this upcall should be timing out in 30s, so it's not
clear to me why the daemon is hanging for 120s. I'll have to see if I
can reproduce this and track down the problem.

--
Jeff Layton <[email protected]>

2012-06-10 09:52:13

by Jeff Layton

[permalink] [raw]
Subject: Re: nfsroot client will not start firefox or thunderbird from 3.4.0 nfsserver

On Tue, 29 May 2012 00:19:34 +0200
Hans de Bruin <[email protected]> wrote:

> I just upgraded my home server from kernel 3.3.5 to 3.4.0 and ran into
> some trouble. My laptop, a nfsroot client, will not run firefox and
> thunderbird anymore. When I start these programs from an xterm, the
> cursor goes to the next line and waits indefinitely.
>
> I do not know if there is any order is lsof's output. A lsof | grep
> firefox or thunderbird shows ......./.parentlock as the last line.
>
> It does not matter whether the client is running a 3.4.0 or a 3.3.0
> kernel, or if the server is running on top of xen or not.
>
> There is some noise in the servers dmesg:
>
> [ 241.256684] INFO: task kworker/u:2:801 blocked for more than 120 seconds.
> [ 241.256691] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [ 241.256698] kworker/u:2 D 000000000000000c 0 801 2
> 0x00000000
> [ 241.256710] ffff8801390e5cf0 0000000000000046 0000000000012d00
> 0000000000012d00
> [ 241.256721] 0000000000012d00 ffff880139f8bd50 0000000000012d00
> ffff8801390e5fd8
> [ 241.256732] ffff8801390e5fd8 0000000000012d00 ffff880139ce4420
> ffff880139f8bd50
> [ 241.256743] Call Trace:
> [ 241.256759] [<ffffffff8158733e>] schedule+0x64/0x66
> [ 241.256769] [<ffffffff8120184e>] cld_pipe_upcall+0x95/0xd1
> [ 241.256780] [<ffffffff811fbd8d>] ? nfsd4_exchange_id+0x23e/0x23e
> [ 241.256789] [<ffffffff81201d06>] nfsd4_cld_grace_done+0x50/0x8a
> [ 241.256798] [<ffffffff81202233>] nfsd4_record_grace_done+0x18/0x1a
> [ 241.256807] [<ffffffff811fbdd7>] laundromat_main+0x4a/0x213
> [ 241.256818] [<ffffffff8106a06b>] ? need_resched+0x1e/0x28
> [ 241.256826] [<ffffffff8158725d>] ? __schedule+0x49d/0x4b5
> [ 241.256835] [<ffffffff811fbd8d>] ? nfsd4_exchange_id+0x23e/0x23e
> [ 241.256844] [<ffffffff8105be2d>] process_one_work+0x190/0x28d
> [ 241.256854] [<ffffffff8105ca67>] worker_thread+0x105/0x189
> [ 241.256862] [<ffffffff81587b8d>] ? _raw_spin_unlock_irqrestore+0x1a/0x1d
> [ 241.256872] [<ffffffff8105c962>] ? manage_workers.clone.17+0x173/0x173
> [ 241.256881] [<ffffffff810604b0>] kthread+0x8a/0x92
> [ 241.256891] [<ffffffff815899a4>] kernel_thread_helper+0x4/0x10
> [ 241.256900] [<ffffffff81060426>] ?
> kthread_freezable_should_stop+0x47/0x47
> [ 241.256909] [<ffffffff815899a0>] ? gs_change+0x13/0x13
>
> or xenified:
>
>
> [ 240.563448] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [ 240.563458] kworker/u:2 D ffff88007fc12d00 0 808 2
> 0x00000000
> [ 240.563479] ffff88007532fcf0 0000000000000246 0000000000012d00
> 0000000000012d00
> [ 240.563504] 0000000000012d00 ffff880075f7caf0 0000000000012d00
> ffff88007532ffd8
> [ 240.563530] ffff88007532ffd8 0000000000012d00 ffffffff81813020
> ffff880075f7caf0
> [ 240.563555] Call Trace:
> [ 240.563578] [<ffffffff8158733e>] schedule+0x64/0x66
> [ 240.563594] [<ffffffff8120184e>] cld_pipe_upcall+0x95/0xd1
> [ 240.563611] [<ffffffff811fbd8d>] ? nfsd4_exchange_id+0x23e/0x23e
> [ 240.563625] [<ffffffff81201d06>] nfsd4_cld_grace_done+0x50/0x8a
> [ 240.563640] [<ffffffff81202233>] nfsd4_record_grace_done+0x18/0x1a
> [ 240.563654] [<ffffffff811fbdd7>] laundromat_main+0x4a/0x213
> [ 240.563670] [<ffffffff8100d085>] ? xen_spin_unlock+0x12/0x30
> [ 240.563685] [<ffffffff811fbd8d>] ? nfsd4_exchange_id+0x23e/0x23e
> [ 240.563700] [<ffffffff8105be2d>] process_one_work+0x190/0x28d
> [ 240.563714] [<ffffffff8100d337>] ? xen_spin_lock+0xb/0xd
> [ 240.563729] [<ffffffff8105ca67>] worker_thread+0x105/0x189
> [ 240.563743] [<ffffffff81587b8d>] ? _raw_spin_unlock_irqrestore+0x1a/0x1d
> [ 240.563758] [<ffffffff8105c962>] ? manage_workers.clone.17+0x173/0x173
> [ 240.563772] [<ffffffff810604b0>] kthread+0x8a/0x92
> [ 240.563787] [<ffffffff815899a4>] kernel_thread_helper+0x4/0x10
> [ 240.563802] [<ffffffff81587f38>] ? retint_restore_args+0x5/0x6
> [ 240.563816] [<ffffffff815899a0>] ? gs_change+0x13/0x13
>
>

It sounds like you're not running the new nfsdcld daemon on the server,
and /var/lib/nfs/v4recovery does not exist. Is that correct?

--
Jeff Layton <[email protected]>

2012-06-01 19:12:01

by Hans de Bruin

[permalink] [raw]
Subject: Re: nfsroot client will not start firefox or thunderbird from 3.4.0 nfsserver

On 05/29/2012 12:19 AM, Hans de Bruin wrote:
> I just upgraded my home server from kernel 3.3.5 to 3.4.0 and ran into
> some trouble. My laptop, a nfsroot client, will not run firefox and
> thunderbird anymore. When I start these programs from an xterm, the
> cursor goes to the next line and waits indefinitely.
>
> I do not know if there is any order is lsof's output. A lsof | grep
> firefox or thunderbird shows ......./.parentlock as the last line.
>
> It does not matter whether the client is running a 3.4.0 or a 3.3.0
> kernel, or if the server is running on top of xen or not.
>
> There is some noise in the servers dmesg:
>
> [ 241.256684] INFO: task kworker/u:2:801 blocked for more than 120 seconds.
> [ 241.256691] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"

...

On a almost identical testsystem firefox en thunderbird segfault after
upgrading to 3.4.0. I would have been nice if it would behave exaclty
like my home server. I bisected the segfault to:

commit 0fc9d1040313047edf6a39fd4d7c7defdca97c62
Author: Konstantin Khlebnikov <[email protected]>
Date: Wed Mar 28 14:42:54 2012 -0700

radix-tree: use iterators in find_get_pages* functions


When I revert that on top of 3.4.0 the segfaults are gone but both
firefox en thunderbird go in the lets wait indefinitely mode like the
homeserver.

I am going to make a bit-wise copy from from my homeserver to my
testserver and try again.

--
Hans






2012-06-03 15:00:42

by Hans de Bruin

[permalink] [raw]
Subject: Re: nfsroot client will not start firefox or thunderbird from 3.4.0 nfsserver

On 06/01/2012 09:11 PM, Hans de Bruin wrote:
> On 05/29/2012 12:19 AM, Hans de Bruin wrote:
>> I just upgraded my home server from kernel 3.3.5 to 3.4.0 and ran into
>> some trouble. My laptop, a nfsroot client, will not run firefox and
>> thunderbird anymore. When I start these programs from an xterm, the
>> cursor goes to the next line and waits indefinitely.
>>
>> I do not know if there is any order is lsof's output. A lsof | grep
>> firefox or thunderbird shows ......./.parentlock as the last line.
>>
>> It does not matter whether the client is running a 3.4.0 or a 3.3.0
>> kernel, or if the server is running on top of xen or not.
>>
>> There is some noise in the servers dmesg:
>>
>> [ 241.256684] INFO: task kworker/u:2:801 blocked for more than 120
>> seconds.
>> [ 241.256691] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>
> ...
>
> On a almost identical testsystem firefox en thunderbird segfault after
> upgrading to 3.4.0. I would have been nice if it would behave exaclty
> like my home server. I bisected the segfault to:
>
> commit 0fc9d1040313047edf6a39fd4d7c7defdca97c62
> Author: Konstantin Khlebnikov <[email protected]>
> Date: Wed Mar 28 14:42:54 2012 -0700
>
> radix-tree: use iterators in find_get_pages* functions
>
>
> When I revert that on top of 3.4.0 the segfaults are gone but both
> firefox en thunderbird go in the lets wait indefinitely mode like the
> homeserver.
>
> I am going to make a bit-wise copy from from my homeserver to my
> testserver and try again.
>

The bit-wise copy also segfaults firefox and thunderbird at the same
commit.

--
Hans

2012-06-08 20:51:59

by Hans de Bruin

[permalink] [raw]
Subject: Re: nfsroot client will not start firefox or thunderbird from 3.4.0 nfsserver

On 05/29/2012 12:19 AM, Hans de Bruin wrote:
> I just upgraded my home server from kernel 3.3.5 to 3.4.0 and ran into
> some trouble. My laptop, a nfsroot client, will not run firefox and
> thunderbird anymore. When I start these programs from an xterm, the
> cursor goes to the next line and waits indefinitely.
>

I have bisected this to


commit f3f8014862d813cca81a597c83bd1dbf0fb2b8f6
Author: Jeff Layton <[email protected]>
Date: Wed Mar 21 09:52:07 2012 -0400

nfsd: add the infrastructure to handle the cld upcall

...and add a mechanism for switching between the "legacy" tracker and
the new one. The decision is made by looking to see whether the
v4recoverydir exists. If it does, then the legacy client tracker is
used.

There are two buts. First git will not revert this, so I can not confirm
this as the bad commit. Seceond, the previous step in the bisect
resulted in the client not being able to start X at all:

The bisect step before that:

commit 813fd320c16691eac508fe350b4ee7362c6c4a56
Author: Jeff Layton <[email protected]>
Date: Wed Mar 21 09:52:08 2012 -0400

nfsd: add notifier to handle mount/unmount of rpc_pipefs sb

In the event that rpc_pipefs isn't mounted when nfsd starts, we
must register a notifier to handle creating the dentry once it
is mounted, and to remove the dentry on unmount.



The tail of my bisect log

# bad: [cc27e0d407021a278d08c1952f5af4ab38c49eda]
nfsd: don't allow legacy client tracker init for anything but init_net
git bisect bad cc27e0d407021a278d08c1952f5af4ab38c49eda
# good: [ab4684d1560f8d77f6ce82bd3f1f82937070d397]
NFSD: Fix nfs4_verifier memory alignment
git bisect good ab4684d1560f8d77f6ce82bd3f1f82937070d397
# good: [b3537c35c21f0e6750aa8bd786949b55509c6d0d]
sunrpc: create nfsd dir in rpc_pipefs
git bisect good b3537c35c21f0e6750aa8bd786949b55509c6d0d
# good: [0ab628d856a63d63b47307b09851d1e955c706ac]
nfsd: add a header describing upcall to nfsdcld
git bisect good 0ab628d856a63d63b47307b09851d1e955c706ac
# bad: [813fd320c16691eac508fe350b4ee7362c6c4a56]
nfsd: add notifier to handle mount/unmount of rpc_pipefs sb
git bisect bad 813fd320c16691eac508fe350b4ee7362c6c4a56
# bad: [f3f8014862d813cca81a597c83bd1dbf0fb2b8f6]
nfsd: add the infrastructure to handle the cld upcall
git bisect bad f3f8014862d813cca81a597c83bd1dbf0fb2b8f6


--
Hans

2012-06-04 20:13:48

by Hans de Bruin

[permalink] [raw]
Subject: Re: nfsroot client will not start firefox or thunderbird from 3.4.0 nfsserver

On 06/04/2012 12:31 PM, Konstantin Khlebnikov wrote:
> Hans de Bruin wrote:
>> On 06/01/2012 09:11 PM, Hans de Bruin wrote:
>>> On 05/29/2012 12:19 AM, Hans de Bruin wrote:
>>>> I just upgraded my home server from kernel 3.3.5 to 3.4.0 and ran into
>>>> some trouble. My laptop, a nfsroot client, will not run firefox and
>>>> thunderbird anymore. When I start these programs from an xterm, the
>>>> cursor goes to the next line and waits indefinitely.
>>>>
>>>> I do not know if there is any order is lsof's output. A lsof | grep
>>>> firefox or thunderbird shows ......./.parentlock as the last line.
>>>>
>>>> It does not matter whether the client is running a 3.4.0 or a 3.3.0
>>>> kernel, or if the server is running on top of xen or not.
>>>>
>>>> There is some noise in the servers dmesg:
>>>>
>>>> [ 241.256684] INFO: task kworker/u:2:801 blocked for more than 120
>>>> seconds.
>>>> [ 241.256691] "echo 0> /proc/sys/kernel/hung_task_timeout_secs"
>>>
>>> ...
>>>
>>> On a almost identical testsystem firefox en thunderbird segfault after
>>> upgrading to 3.4.0. I would have been nice if it would behave exaclty
>>> like my home server. I bisected the segfault to:
>>>
>>> commit 0fc9d1040313047edf6a39fd4d7c7defdca97c62
>>> Author: Konstantin Khlebnikov<[email protected]>
>>> Date: Wed Mar 28 14:42:54 2012 -0700
>>>
>>> radix-tree: use iterators in find_get_pages* functions
>>>
>>>
>>> When I revert that on top of 3.4.0 the segfaults are gone but both
>>> firefox en thunderbird go in the lets wait indefinitely mode like the
>>> homeserver.
>>>
>>> I am going to make a bit-wise copy from from my homeserver to my
>>> testserver and try again.
>>>
>>
>> The bit-wise copy also segfaults firefox and thunderbird at the same
>> commit.
>>
>
> I think bug somewhere in NFS, that patch only highlighted it.
> Please, try to run it with debug patch from attachment.

Before I can start firefox from an xterm the lines below are shown on
the server:

[ 241.260076] INFO: task kworker/u:2:791 blocked for more than 120 seconds.
[ 241.260084] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 241.260090] kworker/u:2 D 000000000000000c 0 791 2
0x00000000
[ 241.260102] ffff8801390b1cf0 0000000000000046 0000000000012d00
0000000000012d00
[ 241.260113] 0000000000012d00 ffff880139141470 0000000000012d00
ffff8801390b1fd8
[ 241.260124] ffff8801390b1fd8 0000000000012d00 ffff880139cdc420
ffff880139141470
[ 241.260135] Call Trace:
[ 241.260152] [<ffffffff81513116>] schedule+0x64/0x66
[ 241.260162] [<ffffffff812005a6>] cld_pipe_upcall+0x95/0xd1
[ 241.260173] [<ffffffff811faae5>] ? nfsd4_exchange_id+0x23e/0x23e
[ 241.260182] [<ffffffff81200a5e>] nfsd4_cld_grace_done+0x50/0x8a
[ 241.260191] [<ffffffff81200f8b>] nfsd4_record_grace_done+0x18/0x1a
[ 241.260200] [<ffffffff811fab2f>] laundromat_main+0x4a/0x213
[ 241.260210] [<ffffffff81069aeb>] ? need_resched+0x1e/0x28
[ 241.260218] [<ffffffff81513035>] ? __schedule+0x49d/0x4b5
[ 241.260227] [<ffffffff811faae5>] ? nfsd4_exchange_id+0x23e/0x23e
[ 241.260237] [<ffffffff8105b8ad>] process_one_work+0x190/0x28d
[ 241.260248] [<ffffffff8105c4e7>] worker_thread+0x105/0x189
[ 241.260260] [<ffffffff81513a75>] ? _raw_spin_unlock_irqrestore+0x1a/0x1d
[ 241.260274] [<ffffffff8105c3e2>] ? manage_workers.clone.17+0x173/0x173
[ 241.260287] [<ffffffff8105ff30>] kthread+0x8a/0x92
[ 241.260325] [<ffffffff815158a4>] kernel_thread_helper+0x4/0x10
[ 241.260335] [<ffffffff8105fea6>] ?
kthread_freezable_should_stop+0x47/0x47
[ 241.260343] [<ffffffff815158a0>] ? gs_change+0x13/0x13
[ 361.260025] INFO: task kworker/u:2:791 blocked for more than 120 seconds.
[ 361.260032] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 361.260039] kworker/u:2 D 000000000000000c 0 791 2
0x00000000
[ 361.260051] ffff8801390b1cf0 0000000000000046 0000000000012d00
0000000000012d00
[ 361.260062] 0000000000012d00 ffff880139141470 0000000000012d00
ffff8801390b1fd8
[ 361.260072] ffff8801390b1fd8 0000000000012d00 ffff880139cdc420
ffff880139141470
[ 361.260084] Call Trace:
[ 361.260099] [<ffffffff81513116>] schedule+0x64/0x66
[ 361.260110] [<ffffffff812005a6>] cld_pipe_upcall+0x95/0xd1
[ 361.260121] [<ffffffff811faae5>] ? nfsd4_exchange_id+0x23e/0x23e
[ 361.260130] [<ffffffff81200a5e>] nfsd4_cld_grace_done+0x50/0x8a
[ 361.260139] [<ffffffff81200f8b>] nfsd4_record_grace_done+0x18/0x1a
[ 361.260148] [<ffffffff811fab2f>] laundromat_main+0x4a/0x213
[ 361.260158] [<ffffffff81069aeb>] ? need_resched+0x1e/0x28
[ 361.260166] [<ffffffff81513035>] ? __schedule+0x49d/0x4b5
[ 361.260175] [<ffffffff811faae5>] ? nfsd4_exchange_id+0x23e/0x23e
[ 361.260185] [<ffffffff8105b8ad>] process_one_work+0x190/0x28d
[ 361.260194] [<ffffffff8105c4e7>] worker_thread+0x105/0x189
[ 361.260203] [<ffffffff81513a75>] ? _raw_spin_unlock_irqrestore+0x1a/0x1d
[ 361.260213] [<ffffffff8105c3e2>] ? manage_workers.clone.17+0x173/0x173
[ 361.260222] [<ffffffff8105ff30>] kthread+0x8a/0x92
[ 361.260231] [<ffffffff815158a4>] kernel_thread_helper+0x4/0x10
[ 361.260241] [<ffffffff8105fea6>] ?
kthread_freezable_should_stop+0x47/0x47
[ 361.260249] [<ffffffff815158a0>] ? gs_change+0x13/0x13
[ 481.260010] INFO: task kworker/u:2:791 blocked for more than 120 seconds.
[ 481.260019] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 481.260028] kworker/u:2 D 000000000000000c 0 791 2
0x00000000
[ 481.260043] ffff8801390b1cf0 0000000000000046 0000000000012d00
0000000000012d00
[ 481.260058] 0000000000012d00 ffff880139141470 0000000000012d00
ffff8801390b1fd8
[ 481.260073] ffff8801390b1fd8 0000000000012d00 ffff880139cdc420
ffff880139141470
[ 481.260088] Call Trace:
[ 481.260107] [<ffffffff81513116>] schedule+0x64/0x66
[ 481.260120] [<ffffffff812005a6>] cld_pipe_upcall+0x95/0xd1
[ 481.260135] [<ffffffff811faae5>] ? nfsd4_exchange_id+0x23e/0x23e
[ 481.260147] [<ffffffff81200a5e>] nfsd4_cld_grace_done+0x50/0x8a
[ 481.260159] [<ffffffff81200f8b>] nfsd4_record_grace_done+0x18/0x1a
[ 481.260172] [<ffffffff811fab2f>] laundromat_main+0x4a/0x213
[ 481.260185] [<ffffffff81069aeb>] ? need_resched+0x1e/0x28
[ 481.260196] [<ffffffff81513035>] ? __schedule+0x49d/0x4b5
[ 481.260206] [<ffffffff811faae5>] ? nfsd4_exchange_id+0x23e/0x23e
[ 481.260215] [<ffffffff8105b8ad>] process_one_work+0x190/0x28d
[ 481.260225] [<ffffffff8105c4e7>] worker_thread+0x105/0x189
[ 481.260234] [<ffffffff81513a75>] ? _raw_spin_unlock_irqrestore+0x1a/0x1d
[ 481.260243] [<ffffffff8105c3e2>] ? manage_workers.clone.17+0x173/0x173
[ 481.260252] [<ffffffff8105ff30>] kthread+0x8a/0x92
[ 481.260262] [<ffffffff815158a4>] kernel_thread_helper+0x4/0x10
[ 481.260271] [<ffffffff8105fea6>] ?
kthread_freezable_should_stop+0x47/0x47
[ 481.260279] [<ffffffff815158a0>] ? gs_change+0x13/0x13


dmesg on the client side:

[ 27.607606] gtk-query-immod[1976]: segfault at 1d2d1f30 ip b7734391
sp bfe3e984 error 4 in ld-2.13.so[b772b000+1d000]
[ 48.136763] start_kdeinit (2086): /proc/2086/oom_adj is deprecated,
please use /proc/2086/oom_score_adj instead.
[ 75.801804] blueman-applet[2150]: segfault at 1cf2cf30 ip b7741391 sp
bfb456b8 error 4 in ld-2.13.so[b7738000+1d000]
[ 140.226371] firefox[2175]: segfault at 1b065f30 ip b76f6391 sp
bfb15db8 error 4 in ld-2.13.so[b76ed000+1d000]


The firefox dump on client side produces no messages on server side.

md5sum's of ld-2.13.so are equal on server and client and across
kernlversions.



Did I miss the output off the debug patch?


--
Hans