LinuxLists.cc - problem with nfs4: rpciod seems to loop in rpc_shutdown

2011-03-18 22:56:10

Subject: problem with nfs4: rpciod seems to loop in rpc_shutdown_client forever

Hello,

I have a problem with our nfs-server (stable .31 or .32 and probably older ones): sometimes
one or more rpciod get stuck. I used

rpcdebug -m rpc -s all

I get messages as the following one about every
Mar 18 11:15:37 au kernel: [44640.906793] RPC: Mar 18 11:15:38 au kernel: [44641.906793] RPC: Mar 18 11:15:39 au kernel: [44642.906795] RPC: Mar 18 11:15:40 au kernel: [44643.906793] RPC: Mar 18 11:15:41 au kernel: [44644.906795] RPC: Mar 18 11:15:42 au kernel: [44645.906794] RPC:
and I get this messages:

Mar 18 22:45:57 au kernel: [86061.779008] 174 0381
My theorie is this one:

* this async task is runnable but does not * this is because the same rpciod which handles rpc_shutdown_client waiting for this task * because rpc_shutdown_client is called from
At the beginning is is always one or more tasks
Once a rpciod hangs more an more other tasks
Mar 18 22:45:57 au kernel: [86061.778809] -pid- Mar 18 22:45:57 au kernel: [86061.778819] 300 0281 Mar 18 22:45:57 au kernel: [86061.778823] 289 0281 Mar 18 22:45:57 au kernel: [86061.778827] 286 0281 Mar 18 22:45:57 au kernel: [86061.778830] 283 0281 Mar 18 22:45:57 au kernel: [86061.778833] 280 0281 Mar 18 22:45:57 au kernel: [86061.778837] 277 0281 Mar 18 22:45:57 au kernel: [86061.778840] 274 0281 Mar 18 22:45:57 au kernel: [86061.778844] 271 0281 Mar 18 22:45:57 au kernel: [86061.778847] 234 0281 Mar 18 22:45:57 au kernel: [86061.778851] 233 0281 Mar 18 22:45:57 au kernel: [86061.778855] 235 0281 Mar 18 22:45:57 au kernel: [86061.778858] 236 0281 Mar 18 22:45:57 au kernel: [86061.778861] 237 0281 Mar 18 22:45:57 au kernel: [86061.778864] 238 0281 Mar 18 22:45:57 au kernel: [86061.778867] 239 0281 Mar 18 22:45:57 au kernel: [86061.778870] 240 0281 Mar 18 22:45:57 au kernel: [86061.778874] 241 0281 Mar 18 22:45:57 au kernel: [86061.778877] 242 0281 Mar 18 22:45:57 au kernel: [86061.778880] 243 0281 Mar 18 22:45:57 au kernel: [86061.778884] 244 0281 Mar 18 22:45:57 au kernel: [86061.778887] 245 0281 Mar 18 22:45:57 au kernel: [86061.778890] 246 0281 Mar 18 22:45:57 au kernel: [86061.778893] 247 0281 Mar 18 22:45:57 au kernel: [86061.778896] 248 0281 Mar 18 22:45:57 au kernel: [86061.778899] 249 0281 Mar 18 22:45:57 au kernel: [86061.778903] 250 0281 Mar 18 22:45:57 au kernel: [86061.778906] 251 0281 Mar 18 22:45:57 au kernel: [86061.778909] 252 0281 Mar 18 22:45:57 au kernel: [86061.778912] 253 0281 Mar 18 22:45:57 au kernel: [86061.778915] 254 0281 Mar 18 22:45:57 au kernel: [86061.778919] 255 0281 Mar 18 22:45:57 au kernel: [86061.778922] 256 0281 Mar 18 22:45:57 au kernel: [86061.778925] 257 0281 Mar 18 22:45:57 au kernel: [86061.778928] 258 0281 Mar 18 22:45:57 au kernel: [86061.778931] 259 0281 Mar 18 22:45:57 au kernel: [86061.778935] 260 0281 Mar 18 22:45:57 au kernel: [86061.778938] 261 0281 Mar 18 22:45:57 au kernel: [86061.778941] 262 0281 Mar 18 22:45:57 au kernel: [86061.778945] 263 0281 Mar 18 22:45:57 au kernel: [86061.778948] 264 0281 Mar 18 22:45:57 au kernel: [86061.778951] 265 0281 Mar 18 22:45:57 au kernel: [86061.778954] 266 0281 Mar 18 22:45:57 au kernel: [86061.778958] 267 0281 Mar 18 22:45:57 au kernel: [86061.778961] 268 0281 Mar 18 22:45:57 au kernel: [86061.778964] 269 0281 Mar 18 22:45:57 au kernel: [86061.778967] 272 0281 Mar 18 22:45:57 au kernel: [86061.778971] 275 0281 Mar 18 22:45:57 au kernel: [86061.778974] 278 0281 Mar 18 22:45:57 au kernel: [86061.778977] 281 0281 Mar 18 22:45:57 au kernel: [86061.778980] 284 0281 Mar 18 22:45:57 au kernel: [86061.778983] 287 0281 Mar 18 22:45:57 au kernel: [86061.778987] 290 0281 Mar 18 22:45:57 au kernel: [86061.778990] 232 0281 Mar 18 22:45:57 au kernel: [86061.778993] 230 0281 Mar 18 22:45:57 au kernel: [86061.778997] 180 0281 Mar 18 22:45:57 au kernel: [86061.779000] 179 0281 Mar 18 22:45:57 au kernel: [86061.779003] 178 0281 Mar 18 22:45:57 au kernel: [86061.779008] 174 0381
I think I saw hanging rpciods once in a while have it very often.

My theorie may be completely bogus and there familiar with the code.

Regards,
--
Wolfgang Walter
Studentenwerk M?nchen
Anstalt des ?ffentlichen Rechts 2.6.32.33 but also with
second:
killing all tasks for client ffff88041c51de00
killing all tasks for client ffff88041c51de00
killing all tasks for client ffff88041c51de00
killing all tasks for client ffff88041c51de00
killing all tasks for client ffff88041c51de00
killing all tasks for client ffff88041c51de00
-5 ffff88041c51de00 (null) 0 ffffffff817211a0 nfs4_cbv1 CB_NULL a:rpc_exit_task q:none
progress (calling rpc_exit_task).
this task loops in
to go away.
an async rpc, too
as above.
hang forever:
flgs status -client- --rqstp- -timeout ---ops--
-13 ffff8801ef5d0600 (null) 0 ffffffff817211a0 nfs4_cbv1 CB_NULL a:call_refreshresult q:none
0 ffff880142a49800 ffff8802a1dde000 0 ffffffff817a3fd0 rpcbindv2 GETPORT a:call_status q:none
0 ffff880349f57e00 ffff88010affe000 0 ffffffff817a3fd0 rpcbindv2 GETPORT a:call_status q:none
0 ffff88041d19ac00 ffff880418650000 0 ffffffff817a3fd0 rpcbindv2 GETPORT a:call_status q:none
0 ffff880380561400 ffff8803d3b7a000 0 ffffffff817a3fd0 rpcbindv2 GETPORT a:call_status q:none
0 ffff8803ef73ac00 ffff88010af50000 0 ffffffff817a3fd0 rpcbindv2 GETPORT a:call_status q:none
0 ffff880302c31c00 ffff8803e92b6000 0 ffffffff817a3fd0 rpcbindv2 GETPORT a:call_status q:none
0 ffff8803c44d0600 ffff88002687e000 0 ffffffff817a3fd0 rpcbindv2 GETPORT a:call_status q:none
-110 ffff8801f414b000 ffff880005382000 0 ffffffff817a3fd0 rpcbindv2 GETPORT a:call_connect_status q:none
-110 ffff8802c9945800 ffff880050382000 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_bind_status q:none
-110 ffff8802c9945800 ffff880050382150 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_bind_status q:none
-110 ffff8802c9945800 ffff8800503822a0 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_bind_status q:none
-110 ffff8802c9945800 ffff8800503823f0 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_bind_status q:none
-110 ffff8802c9945800 ffff880050382540 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_bind_status q:none
-110 ffff8802c9945800 ffff880050382690 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_bind_status q:none
-110 ffff8802c9945800 ffff8800503827e0 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_bind_status q:none
-110 ffff8802c9945800 ffff880050382930 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_bind_status q:none
-110 ffff8802c9945800 ffff880050382a80 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_bind_status q:none
-110 ffff8802c9945800 ffff880050382bd0 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_bind_status q:none
-110 ffff8802c9945800 ffff880050382d20 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_bind_status q:none
-110 ffff8802c9945800 ffff880050382e70 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_bind_status q:none
-110 ffff8802c9945800 ffff880050382fc0 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_bind_status q:none
-110 ffff8802c9945800 ffff880050383110 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_bind_status q:none
-110 ffff8802c9945800 ffff880050383260 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_bind_status q:none
-110 ffff8802c9945800 ffff8800503833b0 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_bind_status q:none
-11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
-11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
-11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
-11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
-11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
-11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
-11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
-11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
-11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
-11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
-11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
-11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
-11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
-11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
-11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
-11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
-11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
-11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
-11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
-11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
-11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 LOCK_RES a:call_reserveresult q:xprt_backlog
-11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 LOCK_RES a:call_reserveresult q:xprt_backlog
-11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 LOCK_RES a:call_reserveresult q:xprt_backlog
-11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 LOCK_RES a:call_reserveresult q:xprt_backlog
-11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 LOCK_RES a:call_reserveresult q:xprt_backlog
-11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 LOCK_RES a:call_reserveresult q:xprt_backlog
-11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 LOCK_RES a:call_reserveresult q:xprt_backlog
-110 ffff8803d3bc6000 ffff8803d5eaa000 0 ffffffff817a3fd0 rpcbindv2 GETPORT a:call_connect_status q:none
-13 ffff8801ef5d0200 (null) 0 ffffffff817211a0 nfs4_cbv1 CB_NULL a:call_refreshresult q:none
-13 ffff880341e36800 (null) 0 ffffffff817211a0 nfs4_cbv1 CB_NULL a:call_refreshresult q:none
-13 ffff88027dc01000 (null) 0 ffffffff817211a0 nfs4_cbv1 CB_NULL a:call_refreshresult q:none
-13 ffff8801bddac600 (null) 0 ffffffff817211a0 nfs4_cbv1 CB_NULL a:call_refreshresult q:none
-5 ffff88041c51de00 (null) 0 ffffffff817211a0 nfs4_cbv1 CB_NULL a:rpc_exit_task q:none
but since we use kerberos we
is another reason, I'm not

2011-03-21 23:29:05

by J. Bruce Fields

[permalink] [raw]

Subject: Re: problem with nfs4: rpciod seems to loop in rpc_shutdown_client forever

On Fri, Mar 18, 2011 at 11:49:21PM +0100, Wolfgang Walter wrote:
> Hello,
>
> I have a problem with our nfs-server (stable 2.6.32.33 but also with
> .31 or .32 and probably older ones): sometimes
> one or more rpciod get stuck. I used
>
> rpcdebug -m rpc -s all
>
> I get messages as the following one about every second:
>
> Mar 18 11:15:37 au kernel: [44640.906793] RPC: killing all tasks for client ffff88041c51de00
> Mar 18 11:15:38 au kernel: [44641.906793] RPC: killing all tasks for client ffff88041c51de00
> Mar 18 11:15:39 au kernel: [44642.906795] RPC: killing all tasks for client ffff88041c51de00
> Mar 18 11:15:40 au kernel: [44643.906793] RPC: killing all tasks for client ffff88041c51de00
> Mar 18 11:15:41 au kernel: [44644.906795] RPC: killing all tasks for client ffff88041c51de00
> Mar 18 11:15:42 au kernel: [44645.906794] RPC: killing all tasks for client ffff88041c51de00
>
> and I get this messages:
>
> Mar 18 22:45:57 au kernel: [86061.779008] 174 0381 -5 ffff88041c51de00 (null) 0 ffffffff817211a0 nfs4_cbv1 CB_NULL a:rpc_exit_task q:none
>
> My theorie is this one:
>
> * this async task is runnable but does not progress (calling rpc_exit_task).
> * this is because the same rpciod which handles this task loops in
> rpc_shutdown_client waiting for this task to go away.
> * because rpc_shutdown_client is called from an async rpc, too

Off hand I don't see any place where rpc_shutdown_client() is called
from rpciod; do you?

> At the beginning is is always one or more tasks as above.
>
> Once a rpciod hangs more an more other tasks hang forever:
>
> Mar 18 22:45:57 au kernel: [86061.778809] -pid- flgs status -client- --rqstp- -timeout ---ops--
> Mar 18 22:45:57 au kernel: [86061.778819] 300 0281 -13 ffff8801ef5d0600 (null) 0 ffffffff817211a0 nfs4_cbv1 CB_NULL a:call_refreshresult q:none
> Mar 18 22:45:57 au kernel: [86061.778823] 289 0281 0 ffff880142a49800 ffff8802a1dde000 0 ffffffff817a3fd0 rpcbindv2 GETPORT a:call_status q:none
> Mar 18 22:45:57 au kernel: [86061.778827] 286 0281 0 ffff880349f57e00 ffff88010affe000 0 ffffffff817a3fd0 rpcbindv2 GETPORT a:call_status q:none
> Mar 18 22:45:57 au kernel: [86061.778830] 283 0281 0 ffff88041d19ac00 ffff880418650000 0 ffffffff817a3fd0 rpcbindv2 GETPORT a:call_status q:none

There's a lot of these GETPORT calls. Is portmap/rpcbind down?

--b.

> Mar 18 22:45:57 au kernel: [86061.778833] 280 0281 0 ffff880380561400 ffff8803d3b7a000 0 ffffffff817a3fd0 rpcbindv2 GETPORT a:call_status q:none
> Mar 18 22:45:57 au kernel: [86061.778837] 277 0281 0 ffff8803ef73ac00 ffff88010af50000 0 ffffffff817a3fd0 rpcbindv2 GETPORT a:call_status q:none
> Mar 18 22:45:57 au kernel: [86061.778840] 274 0281 0 ffff880302c31c00 ffff8803e92b6000 0 ffffffff817a3fd0 rpcbindv2 GETPORT a:call_status q:none
> Mar 18 22:45:57 au kernel: [86061.778844] 271 0281 0 ffff8803c44d0600 ffff88002687e000 0 ffffffff817a3fd0 rpcbindv2 GETPORT a:call_status q:none
> Mar 18 22:45:57 au kernel: [86061.778847] 234 0281 -110 ffff8801f414b000 ffff880005382000 0 ffffffff817a3fd0 rpcbindv2 GETPORT a:call_connect_status q:none
> Mar 18 22:45:57 au kernel: [86061.778851] 233 0281 -110 ffff8802c9945800 ffff880050382000 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_bind_status q:none
> Mar 18 22:45:57 au kernel: [86061.778855] 235 0281 -110 ffff8802c9945800 ffff880050382150 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_bind_status q:none
> Mar 18 22:45:57 au kernel: [86061.778858] 236 0281 -110 ffff8802c9945800 ffff8800503822a0 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_bind_status q:none
> Mar 18 22:45:57 au kernel: [86061.778861] 237 0281 -110 ffff8802c9945800 ffff8800503823f0 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_bind_status q:none
> Mar 18 22:45:57 au kernel: [86061.778864] 238 0281 -110 ffff8802c9945800 ffff880050382540 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_bind_status q:none
> Mar 18 22:45:57 au kernel: [86061.778867] 239 0281 -110 ffff8802c9945800 ffff880050382690 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_bind_status q:none
> Mar 18 22:45:57 au kernel: [86061.778870] 240 0281 -110 ffff8802c9945800 ffff8800503827e0 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_bind_status q:none
> Mar 18 22:45:57 au kernel: [86061.778874] 241 0281 -110 ffff8802c9945800 ffff880050382930 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_bind_status q:none
> Mar 18 22:45:57 au kernel: [86061.778877] 242 0281 -110 ffff8802c9945800 ffff880050382a80 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_bind_status q:none
> Mar 18 22:45:57 au kernel: [86061.778880] 243 0281 -110 ffff8802c9945800 ffff880050382bd0 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_bind_status q:none
> Mar 18 22:45:57 au kernel: [86061.778884] 244 0281 -110 ffff8802c9945800 ffff880050382d20 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_bind_status q:none
> Mar 18 22:45:57 au kernel: [86061.778887] 245 0281 -110 ffff8802c9945800 ffff880050382e70 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_bind_status q:none
> Mar 18 22:45:57 au kernel: [86061.778890] 246 0281 -110 ffff8802c9945800 ffff880050382fc0 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_bind_status q:none
> Mar 18 22:45:57 au kernel: [86061.778893] 247 0281 -110 ffff8802c9945800 ffff880050383110 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_bind_status q:none
> Mar 18 22:45:57 au kernel: [86061.778896] 248 0281 -110 ffff8802c9945800 ffff880050383260 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_bind_status q:none
> Mar 18 22:45:57 au kernel: [86061.778899] 249 0281 -110 ffff8802c9945800 ffff8800503833b0 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_bind_status q:none
> Mar 18 22:45:57 au kernel: [86061.778903] 250 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
> Mar 18 22:45:57 au kernel: [86061.778906] 251 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
> Mar 18 22:45:57 au kernel: [86061.778909] 252 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
> Mar 18 22:45:57 au kernel: [86061.778912] 253 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
> Mar 18 22:45:57 au kernel: [86061.778915] 254 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
> Mar 18 22:45:57 au kernel: [86061.778919] 255 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
> Mar 18 22:45:57 au kernel: [86061.778922] 256 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
> Mar 18 22:45:57 au kernel: [86061.778925] 257 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
> Mar 18 22:45:57 au kernel: [86061.778928] 258 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
> Mar 18 22:45:57 au kernel: [86061.778931] 259 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
> Mar 18 22:45:57 au kernel: [86061.778935] 260 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
> Mar 18 22:45:57 au kernel: [86061.778938] 261 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
> Mar 18 22:45:57 au kernel: [86061.778941] 262 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
> Mar 18 22:45:57 au kernel: [86061.778945] 263 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
> Mar 18 22:45:57 au kernel: [86061.778948] 264 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
> Mar 18 22:45:57 au kernel: [86061.778951] 265 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
> Mar 18 22:45:57 au kernel: [86061.778954] 266 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
> Mar 18 22:45:57 au kernel: [86061.778958] 267 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
> Mar 18 22:45:57 au kernel: [86061.778961] 268 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
> Mar 18 22:45:57 au kernel: [86061.778964] 269 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
> Mar 18 22:45:57 au kernel: [86061.778967] 272 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 LOCK_RES a:call_reserveresult q:xprt_backlog
> Mar 18 22:45:57 au kernel: [86061.778971] 275 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 LOCK_RES a:call_reserveresult q:xprt_backlog
> Mar 18 22:45:57 au kernel: [86061.778974] 278 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 LOCK_RES a:call_reserveresult q:xprt_backlog
> Mar 18 22:45:57 au kernel: [86061.778977] 281 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 LOCK_RES a:call_reserveresult q:xprt_backlog
> Mar 18 22:45:57 au kernel: [86061.778980] 284 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 LOCK_RES a:call_reserveresult q:xprt_backlog
> Mar 18 22:45:57 au kernel: [86061.778983] 287 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 LOCK_RES a:call_reserveresult q:xprt_backlog
> Mar 18 22:45:57 au kernel: [86061.778987] 290 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 LOCK_RES a:call_reserveresult q:xprt_backlog
> Mar 18 22:45:57 au kernel: [86061.778990] 232 0281 -110 ffff8803d3bc6000 ffff8803d5eaa000 0 ffffffff817a3fd0 rpcbindv2 GETPORT a:call_connect_status q:none
> Mar 18 22:45:57 au kernel: [86061.778993] 230 0281 -13 ffff8801ef5d0200 (null) 0 ffffffff817211a0 nfs4_cbv1 CB_NULL a:call_refreshresult q:none
> Mar 18 22:45:57 au kernel: [86061.778997] 180 0281 -13 ffff880341e36800 (null) 0 ffffffff817211a0 nfs4_cbv1 CB_NULL a:call_refreshresult q:none
> Mar 18 22:45:57 au kernel: [86061.779000] 179 0281 -13 ffff88027dc01000 (null) 0 ffffffff817211a0 nfs4_cbv1 CB_NULL a:call_refreshresult q:none
> Mar 18 22:45:57 au kernel: [86061.779003] 178 0281 -13 ffff8801bddac600 (null) 0 ffffffff817211a0 nfs4_cbv1 CB_NULL a:call_refreshresult q:none
> Mar 18 22:45:57 au kernel: [86061.779008] 174 0381 -5 ffff88041c51de00 (null) 0 ffffffff817211a0 nfs4_cbv1 CB_NULL a:rpc_exit_task q:none
>
> I think I saw hanging rpciods once in a while but since we use kerberos we
> have it very often.
>
>
> My theorie may be completely bogus and there is another reason, I'm not
> familiar with the code.
>
> Regards,
> --
> Wolfgang Walter
> Studentenwerk München
> Anstalt des öffentlichen Rechts

2011-03-21 23:35:43

by Myklebust, Trond

[permalink] [raw]

Subject: Re: problem with nfs4: rpciod seems to loop in rpc_shutdown_client forever

On Mon, 2011-03-21 at 19:28 -0400, J. Bruce Fields wrote:
> On Fri, Mar 18, 2011 at 11:49:21PM +0100, Wolfgang Walter wrote:
> > Hello,
> >
> > I have a problem with our nfs-server (stable 2.6.32.33 but also with
> > .31 or .32 and probably older ones): sometimes
> > one or more rpciod get stuck. I used
> >
> > rpcdebug -m rpc -s all
> >
> > I get messages as the following one about every second:
> >
> > Mar 18 11:15:37 au kernel: [44640.906793] RPC: killing all tasks for client ffff88041c51de00
> > Mar 18 11:15:38 au kernel: [44641.906793] RPC: killing all tasks for client ffff88041c51de00
> > Mar 18 11:15:39 au kernel: [44642.906795] RPC: killing all tasks for client ffff88041c51de00
> > Mar 18 11:15:40 au kernel: [44643.906793] RPC: killing all tasks for client ffff88041c51de00
> > Mar 18 11:15:41 au kernel: [44644.906795] RPC: killing all tasks for client ffff88041c51de00
> > Mar 18 11:15:42 au kernel: [44645.906794] RPC: killing all tasks for client ffff88041c51de00
> >
> > and I get this messages:
> >
> > Mar 18 22:45:57 au kernel: [86061.779008] 174 0381 -5 ffff88041c51de00 (null) 0 ffffffff817211a0 nfs4_cbv1 CB_NULL a:rpc_exit_task q:none
> >
> > My theorie is this one:
> >
> > * this async task is runnable but does not progress (calling rpc_exit_task).
> > * this is because the same rpciod which handles this task loops in
> > rpc_shutdown_client waiting for this task to go away.
> > * because rpc_shutdown_client is called from an async rpc, too
>
> Off hand I don't see any place where rpc_shutdown_client() is called
> from rpciod; do you?

The only case I could think of would be if we're still calling mntput()
from some RPC callback. In principle we should only be doing that from
the rpc_call_ops->rpc_callback() from within the nfsiod thread rather
than rpciod.

Is it possible this might be another instance of the nfs_commit_inode()
busy-loop?

--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com

2011-03-22 14:52:27

by Wolfgang Walter

[permalink] [raw]

Subject: Re: problem with nfs4: rpciod seems to loop in rpc_shutdown_client forever

Am Dienstag, 22. März 2011 schrieb J. Bruce Fields:
> On Fri, Mar 18, 2011 at 11:49:21PM +0100, Wolfgang Walter wrote:
> > Hello,
> >
> > I have a problem with our nfs-server (stable 2.6.32.33 but also with
> > .31 or .32 and probably older ones): sometimes
> > one or more rpciod get stuck. I used
> >
> > rpcdebug -m rpc -s all
> >
> > I get messages as the following one about every second:
> >
> > Mar 18 11:15:37 au kernel: [44640.906793] RPC: killing all tasks
> > for client ffff88041c51de00 Mar 18 11:15:38 au kernel: [44641.906793]
> > RPC: killing all tasks for client ffff88041c51de00 Mar 18 11:15:39
> > au kernel: [44642.906795] RPC: killing all tasks for client
> > ffff88041c51de00 Mar 18 11:15:40 au kernel: [44643.906793] RPC:
> > killing all tasks for client ffff88041c51de00 Mar 18 11:15:41 au kernel:
> > [44644.906795] RPC: killing all tasks for client ffff88041c51de00
> > Mar 18 11:15:42 au kernel: [44645.906794] RPC: killing all tasks
> > for client ffff88041c51de00
> >
> > and I get this messages:
> >
> > Mar 18 22:45:57 au kernel: [86061.779008] 174 0381 -5
> > ffff88041c51de00 (null) 0 ffffffff817211a0 nfs4_cbv1 CB_NULL
> > a:rpc_exit_task q:none
> >
> > My theorie is this one:
> >
> > * this async task is runnable but does not progress (calling
> > rpc_exit_task). * this is because the same rpciod which handles this task
> > loops in rpc_shutdown_client waiting for this task to go away.
> > * because rpc_shutdown_client is called from an async rpc, too
>
> Off hand I don't see any place where rpc_shutdown_client() is called
> from rpciod; do you?

I'm not familiar with the code.

But could it be that this is in fs/nfsd/nfs4state.c ?

Just a guess because 2.6.38 does not have this problem and in 2.6.38 it seems
to have a workqueue of its own.

>
> > At the beginning is is always one or more tasks as above.
> >
> > Once a rpciod hangs more an more other tasks hang forever:
> >
> > Mar 18 22:45:57 au kernel: [86061.778809] -pid- flgs status -client-
> > --rqstp- -timeout ---ops-- Mar 18 22:45:57 au kernel: [86061.778819]
> > 300 0281 -13 ffff8801ef5d0600 (null) 0 ffffffff817211a0
> > nfs4_cbv1 CB_NULL a:call_refreshresult q:none Mar 18 22:45:57 au kernel:
> > [86061.778823] 289 0281 0 ffff880142a49800 ffff8802a1dde000
> > 0 ffffffff817a3fd0 rpcbindv2 GETPORT a:call_status q:none Mar 18 22:45:57
> > au kernel: [86061.778827] 286 0281 0 ffff880349f57e00
> > ffff88010affe000 0 ffffffff817a3fd0 rpcbindv2 GETPORT
> > a:call_status q:none Mar 18 22:45:57 au kernel: [86061.778830] 283 0281
> > 0 ffff88041d19ac00 ffff880418650000 0 ffffffff817a3fd0
> > rpcbindv2 GETPORT a:call_status q:none
>
> There's a lot of these GETPORT calls. Is portmap/rpcbind down?

No, it is running.

I think that these getports get scheduled as tasks for the hanging rpciod.

Regards,
--
Wolfgang Walter
Studentenwerk München
Anstalt des öffentlichen Rechts

2011-03-22 15:01:53

by J. Bruce Fields

[permalink] [raw]

Subject: Re: problem with nfs4: rpciod seems to loop in rpc_shutdown_client forever

On Tue, Mar 22, 2011 at 03:52:21PM +0100, Wolfgang Walter wrote:
> Am Dienstag, 22. März 2011 schrieb J. Bruce Fields:
> > On Fri, Mar 18, 2011 at 11:49:21PM +0100, Wolfgang Walter wrote:
> > > Hello,
> > >
> > > I have a problem with our nfs-server (stable 2.6.32.33 but also with
> > > .31 or .32 and probably older ones): sometimes
> > > one or more rpciod get stuck. I used
> > >
> > > rpcdebug -m rpc -s all
> > >
> > > I get messages as the following one about every second:
> > >
> > > Mar 18 11:15:37 au kernel: [44640.906793] RPC: killing all tasks
> > > for client ffff88041c51de00 Mar 18 11:15:38 au kernel: [44641.906793]
> > > RPC: killing all tasks for client ffff88041c51de00 Mar 18 11:15:39
> > > au kernel: [44642.906795] RPC: killing all tasks for client
> > > ffff88041c51de00 Mar 18 11:15:40 au kernel: [44643.906793] RPC:
> > > killing all tasks for client ffff88041c51de00 Mar 18 11:15:41 au kernel:
> > > [44644.906795] RPC: killing all tasks for client ffff88041c51de00
> > > Mar 18 11:15:42 au kernel: [44645.906794] RPC: killing all tasks
> > > for client ffff88041c51de00
> > >
> > > and I get this messages:
> > >
> > > Mar 18 22:45:57 au kernel: [86061.779008] 174 0381 -5
> > > ffff88041c51de00 (null) 0 ffffffff817211a0 nfs4_cbv1 CB_NULL
> > > a:rpc_exit_task q:none
> > >
> > > My theorie is this one:
> > >
> > > * this async task is runnable but does not progress (calling
> > > rpc_exit_task). * this is because the same rpciod which handles this task
> > > loops in rpc_shutdown_client waiting for this task to go away.
> > > * because rpc_shutdown_client is called from an async rpc, too
> >
> > Off hand I don't see any place where rpc_shutdown_client() is called
> > from rpciod; do you?
>
> I'm not familiar with the code.
>
> But could it be that this is in fs/nfsd/nfs4state.c ?
>
> Just a guess because 2.6.38 does not have this problem and in 2.6.38 it seems
> to have a workqueue of its own.

Well, spotted, yes it's true that 2.6.32 had called put_nfs4_client()
from an rpc_call_done callback, that put_nfs4_client() can end up
calling rpc_shutdown_client, and that that's since been fixed....

If someone wants to backport the fix to 2.6.32.y....

Actually I think it might be sufficient just to apply
147efd0dd702ce2f1ab44449bd70369405ef68fd ? But I haven't tried.

--b.

commit 147efd0dd702ce2f1ab44449bd70369405ef68fd
Author: J. Bruce Fields <[email protected]>
Date: Sun Feb 21 17:41:19 2010 -0800

nfsd4: shutdown callbacks on expiry

Once we've expired the client, there's no further purpose to the
callbacks; go ahead and shut down the callback client rather than
waiting for the last reference to go.

Signed-off-by: J. Bruce Fields <[email protected]>

diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index efef7f2..9ce5831 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -697,9 +697,6 @@ shutdown_callback_client(struct nfs4_client *clp)
static inline void
free_client(struct nfs4_client *clp)
{
- shutdown_callback_client(clp);
- if (clp->cl_cb_xprt)
- svc_xprt_put(clp->cl_cb_xprt);
if (clp->cl_cred.cr_group_info)
put_group_info(clp->cl_cred.cr_group_info);
kfree(clp->cl_principal);
@@ -752,6 +749,9 @@ expire_client(struct nfs4_client *clp)
se_perclnt);
release_session(ses);
}
+ shutdown_callback_client(clp);
+ if (clp->cl_cb_xprt)
+ svc_xprt_put(clp->cl_cb_xprt);
put_nfs4_client(clp);
}

2011-03-23 07:54:42

by Wolfgang Walter

[permalink] [raw]

Subject: Re: problem with nfs4: rpciod seems to loop in rpc_shutdown_client forever

Am Dienstag, 22. März 2011 schrieb J. Bruce Fields:
> On Tue, Mar 22, 2011 at 03:52:21PM +0100, Wolfgang Walter wrote:
> > Am Dienstag, 22. März 2011 schrieb J. Bruce Fields:
> > > On Fri, Mar 18, 2011 at 11:49:21PM +0100, Wolfgang Walter wrote:
> > > > Hello,
> > > >
> > > > I have a problem with our nfs-server (stable 2.6.32.33 but also with
> > > > .31 or .32 and probably older ones): sometimes
> > > > one or more rpciod get stuck. I used
> > > >
> > > > rpcdebug -m rpc -s all
> > > >
> > > > I get messages as the following one about every second:
> > > >
> > > > Mar 18 11:15:37 au kernel: [44640.906793] RPC: killing all
> > > > tasks for client ffff88041c51de00 Mar 18 11:15:38 au kernel:
> > > > [44641.906793] RPC: killing all tasks for client
> > > > ffff88041c51de00 Mar 18 11:15:39 au kernel: [44642.906795] RPC:
> > > > killing all tasks for client ffff88041c51de00 Mar 18 11:15:40 au
> > > > kernel: [44643.906793] RPC: killing all tasks for client
> > > > ffff88041c51de00 Mar 18 11:15:41 au kernel: [44644.906795] RPC:
> > > > killing all tasks for client ffff88041c51de00 Mar 18 11:15:42 au
> > > > kernel: [44645.906794] RPC: killing all tasks for client
> > > > ffff88041c51de00
> > > >
> > > > and I get this messages:
> > > >
> > > > Mar 18 22:45:57 au kernel: [86061.779008] 174 0381 -5
> > > > ffff88041c51de00 (null) 0 ffffffff817211a0 nfs4_cbv1 CB_NULL
> > > > a:rpc_exit_task q:none
> > > >
> > > > My theorie is this one:
> > > >
> > > > * this async task is runnable but does not progress (calling
> > > > rpc_exit_task). * this is because the same rpciod which handles this
> > > > task loops in rpc_shutdown_client waiting for this task to go away. *
> > > > because rpc_shutdown_client is called from an async rpc, too
> > >
> > > Off hand I don't see any place where rpc_shutdown_client() is called
> > > from rpciod; do you?
> >
> > I'm not familiar with the code.
> >
> > But could it be that this is in fs/nfsd/nfs4state.c ?
> >
> > Just a guess because 2.6.38 does not have this problem and in 2.6.38 it
> > seems to have a workqueue of its own.
>
> Well, spotted, yes it's true that 2.6.32 had called put_nfs4_client()
> from an rpc_call_done callback, that put_nfs4_client() can end up
> calling rpc_shutdown_client, and that that's since been fixed....
>
> If someone wants to backport the fix to 2.6.32.y....
>
> Actually I think it might be sufficient just to apply
> 147efd0dd702ce2f1ab44449bd70369405ef68fd ? But I haven't tried.
>
> --b.
>
> commit 147efd0dd702ce2f1ab44449bd70369405ef68fd
> Author: J. Bruce Fields <[email protected]>
> Date: Sun Feb 21 17:41:19 2010 -0800
>
> nfsd4: shutdown callbacks on expiry
>
> Once we've expired the client, there's no further purpose to the
> callbacks; go ahead and shut down the callback client rather than
> waiting for the last reference to go.
>
> Signed-off-by: J. Bruce Fields <[email protected]>
>
> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> index efef7f2..9ce5831 100644
> --- a/fs/nfsd/nfs4state.c
> +++ b/fs/nfsd/nfs4state.c
> @@ -697,9 +697,6 @@ shutdown_callback_client(struct nfs4_client *clp)
> static inline void
> free_client(struct nfs4_client *clp)
> {
> - shutdown_callback_client(clp);
> - if (clp->cl_cb_xprt)
> - svc_xprt_put(clp->cl_cb_xprt);
> if (clp->cl_cred.cr_group_info)
> put_group_info(clp->cl_cred.cr_group_info);
> kfree(clp->cl_principal);
> @@ -752,6 +749,9 @@ expire_client(struct nfs4_client *clp)
> se_perclnt);
> release_session(ses);
> }
> + shutdown_callback_client(clp);
> + if (clp->cl_cb_xprt)
> + svc_xprt_put(clp->cl_cb_xprt);
> put_nfs4_client(clp);
> }

I'll test it this weekend.

I use 2.6.38 on this server for now and probably will stay with it. But having
a working longterm-kernel to fall back is important for me :-).

Regards,
--
Wolfgang Walter
Studentenwerk München
Anstalt des öffentlichen Rechts