2011-03-18 22:56:10

by Wolfgang Walter

[permalink] [raw]
Subject: problem with nfs4: rpciod seems to loop in rpc_shutdown_client forever

Hello,

I have a problem with our nfs-server (stable 2.6.32.33 but also with
.31 or .32 and probably older ones): sometimes
one or more rpciod get stuck. I used

rpcdebug -m rpc -s all

I get messages as the following one about every second:

Mar 18 11:15:37 au kernel: [44640.906793] RPC: killing all tasks for client ffff88041c51de00
Mar 18 11:15:38 au kernel: [44641.906793] RPC: killing all tasks for client ffff88041c51de00
Mar 18 11:15:39 au kernel: [44642.906795] RPC: killing all tasks for client ffff88041c51de00
Mar 18 11:15:40 au kernel: [44643.906793] RPC: killing all tasks for client ffff88041c51de00
Mar 18 11:15:41 au kernel: [44644.906795] RPC: killing all tasks for client ffff88041c51de00
Mar 18 11:15:42 au kernel: [44645.906794] RPC: killing all tasks for client ffff88041c51de00

and I get this messages:

Mar 18 22:45:57 au kernel: [86061.779008] 174 0381 -5 ffff88041c51de00 (null) 0 ffffffff817211a0 nfs4_cbv1 CB_NULL a:rpc_exit_task q:none

My theorie is this one:

* this async task is runnable but does not progress (calling rpc_exit_task).
* this is because the same rpciod which handles this task loops in
rpc_shutdown_client waiting for this task to go away.
* because rpc_shutdown_client is called from an async rpc, too

At the beginning is is always one or more tasks as above.

Once a rpciod hangs more an more other tasks hang forever:

Mar 18 22:45:57 au kernel: [86061.778809] -pid- flgs status -client- --rqstp- -timeout ---ops--
Mar 18 22:45:57 au kernel: [86061.778819] 300 0281 -13 ffff8801ef5d0600 (null) 0 ffffffff817211a0 nfs4_cbv1 CB_NULL a:call_refreshresult q:none
Mar 18 22:45:57 au kernel: [86061.778823] 289 0281 0 ffff880142a49800 ffff8802a1dde000 0 ffffffff817a3fd0 rpcbindv2 GETPORT a:call_status q:none
Mar 18 22:45:57 au kernel: [86061.778827] 286 0281 0 ffff880349f57e00 ffff88010affe000 0 ffffffff817a3fd0 rpcbindv2 GETPORT a:call_status q:none
Mar 18 22:45:57 au kernel: [86061.778830] 283 0281 0 ffff88041d19ac00 ffff880418650000 0 ffffffff817a3fd0 rpcbindv2 GETPORT a:call_status q:none
Mar 18 22:45:57 au kernel: [86061.778833] 280 0281 0 ffff880380561400 ffff8803d3b7a000 0 ffffffff817a3fd0 rpcbindv2 GETPORT a:call_status q:none
Mar 18 22:45:57 au kernel: [86061.778837] 277 0281 0 ffff8803ef73ac00 ffff88010af50000 0 ffffffff817a3fd0 rpcbindv2 GETPORT a:call_status q:none
Mar 18 22:45:57 au kernel: [86061.778840] 274 0281 0 ffff880302c31c00 ffff8803e92b6000 0 ffffffff817a3fd0 rpcbindv2 GETPORT a:call_status q:none
Mar 18 22:45:57 au kernel: [86061.778844] 271 0281 0 ffff8803c44d0600 ffff88002687e000 0 ffffffff817a3fd0 rpcbindv2 GETPORT a:call_status q:none
Mar 18 22:45:57 au kernel: [86061.778847] 234 0281 -110 ffff8801f414b000 ffff880005382000 0 ffffffff817a3fd0 rpcbindv2 GETPORT a:call_connect_status q:none
Mar 18 22:45:57 au kernel: [86061.778851] 233 0281 -110 ffff8802c9945800 ffff880050382000 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_bind_status q:none
Mar 18 22:45:57 au kernel: [86061.778855] 235 0281 -110 ffff8802c9945800 ffff880050382150 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_bind_status q:none
Mar 18 22:45:57 au kernel: [86061.778858] 236 0281 -110 ffff8802c9945800 ffff8800503822a0 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_bind_status q:none
Mar 18 22:45:57 au kernel: [86061.778861] 237 0281 -110 ffff8802c9945800 ffff8800503823f0 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_bind_status q:none
Mar 18 22:45:57 au kernel: [86061.778864] 238 0281 -110 ffff8802c9945800 ffff880050382540 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_bind_status q:none
Mar 18 22:45:57 au kernel: [86061.778867] 239 0281 -110 ffff8802c9945800 ffff880050382690 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_bind_status q:none
Mar 18 22:45:57 au kernel: [86061.778870] 240 0281 -110 ffff8802c9945800 ffff8800503827e0 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_bind_status q:none
Mar 18 22:45:57 au kernel: [86061.778874] 241 0281 -110 ffff8802c9945800 ffff880050382930 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_bind_status q:none
Mar 18 22:45:57 au kernel: [86061.778877] 242 0281 -110 ffff8802c9945800 ffff880050382a80 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_bind_status q:none
Mar 18 22:45:57 au kernel: [86061.778880] 243 0281 -110 ffff8802c9945800 ffff880050382bd0 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_bind_status q:none
Mar 18 22:45:57 au kernel: [86061.778884] 244 0281 -110 ffff8802c9945800 ffff880050382d20 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_bind_status q:none
Mar 18 22:45:57 au kernel: [86061.778887] 245 0281 -110 ffff8802c9945800 ffff880050382e70 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_bind_status q:none
Mar 18 22:45:57 au kernel: [86061.778890] 246 0281 -110 ffff8802c9945800 ffff880050382fc0 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_bind_status q:none
Mar 18 22:45:57 au kernel: [86061.778893] 247 0281 -110 ffff8802c9945800 ffff880050383110 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_bind_status q:none
Mar 18 22:45:57 au kernel: [86061.778896] 248 0281 -110 ffff8802c9945800 ffff880050383260 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_bind_status q:none
Mar 18 22:45:57 au kernel: [86061.778899] 249 0281 -110 ffff8802c9945800 ffff8800503833b0 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_bind_status q:none
Mar 18 22:45:57 au kernel: [86061.778903] 250 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
Mar 18 22:45:57 au kernel: [86061.778906] 251 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
Mar 18 22:45:57 au kernel: [86061.778909] 252 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
Mar 18 22:45:57 au kernel: [86061.778912] 253 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
Mar 18 22:45:57 au kernel: [86061.778915] 254 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
Mar 18 22:45:57 au kernel: [86061.778919] 255 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
Mar 18 22:45:57 au kernel: [86061.778922] 256 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
Mar 18 22:45:57 au kernel: [86061.778925] 257 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
Mar 18 22:45:57 au kernel: [86061.778928] 258 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
Mar 18 22:45:57 au kernel: [86061.778931] 259 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
Mar 18 22:45:57 au kernel: [86061.778935] 260 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
Mar 18 22:45:57 au kernel: [86061.778938] 261 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
Mar 18 22:45:57 au kernel: [86061.778941] 262 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
Mar 18 22:45:57 au kernel: [86061.778945] 263 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
Mar 18 22:45:57 au kernel: [86061.778948] 264 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
Mar 18 22:45:57 au kernel: [86061.778951] 265 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
Mar 18 22:45:57 au kernel: [86061.778954] 266 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
Mar 18 22:45:57 au kernel: [86061.778958] 267 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
Mar 18 22:45:57 au kernel: [86061.778961] 268 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
Mar 18 22:45:57 au kernel: [86061.778964] 269 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
Mar 18 22:45:57 au kernel: [86061.778967] 272 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 LOCK_RES a:call_reserveresult q:xprt_backlog
Mar 18 22:45:57 au kernel: [86061.778971] 275 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 LOCK_RES a:call_reserveresult q:xprt_backlog
Mar 18 22:45:57 au kernel: [86061.778974] 278 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 LOCK_RES a:call_reserveresult q:xprt_backlog
Mar 18 22:45:57 au kernel: [86061.778977] 281 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 LOCK_RES a:call_reserveresult q:xprt_backlog
Mar 18 22:45:57 au kernel: [86061.778980] 284 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 LOCK_RES a:call_reserveresult q:xprt_backlog
Mar 18 22:45:57 au kernel: [86061.778983] 287 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 LOCK_RES a:call_reserveresult q:xprt_backlog
Mar 18 22:45:57 au kernel: [86061.778987] 290 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 LOCK_RES a:call_reserveresult q:xprt_backlog
Mar 18 22:45:57 au kernel: [86061.778990] 232 0281 -110 ffff8803d3bc6000 ffff8803d5eaa000 0 ffffffff817a3fd0 rpcbindv2 GETPORT a:call_connect_status q:none
Mar 18 22:45:57 au kernel: [86061.778993] 230 0281 -13 ffff8801ef5d0200 (null) 0 ffffffff817211a0 nfs4_cbv1 CB_NULL a:call_refreshresult q:none
Mar 18 22:45:57 au kernel: [86061.778997] 180 0281 -13 ffff880341e36800 (null) 0 ffffffff817211a0 nfs4_cbv1 CB_NULL a:call_refreshresult q:none
Mar 18 22:45:57 au kernel: [86061.779000] 179 0281 -13 ffff88027dc01000 (null) 0 ffffffff817211a0 nfs4_cbv1 CB_NULL a:call_refreshresult q:none
Mar 18 22:45:57 au kernel: [86061.779003] 178 0281 -13 ffff8801bddac600 (null) 0 ffffffff817211a0 nfs4_cbv1 CB_NULL a:call_refreshresult q:none
Mar 18 22:45:57 au kernel: [86061.779008] 174 0381 -5 ffff88041c51de00 (null) 0 ffffffff817211a0 nfs4_cbv1 CB_NULL a:rpc_exit_task q:none

I think I saw hanging rpciods once in a while but since we use kerberos we
have it very often.


My theorie may be completely bogus and there is another reason, I'm not
familiar with the code.

Regards,
--
Wolfgang Walter
Studentenwerk M?nchen
Anstalt des ?ffentlichen Rechts


2011-03-21 23:29:05

by J. Bruce Fields

[permalink] [raw]
Subject: Re: problem with nfs4: rpciod seems to loop in rpc_shutdown_client forever

On Fri, Mar 18, 2011 at 11:49:21PM +0100, Wolfgang Walter wrote:
> Hello,
>
> I have a problem with our nfs-server (stable 2.6.32.33 but also with
> .31 or .32 and probably older ones): sometimes
> one or more rpciod get stuck. I used
>
> rpcdebug -m rpc -s all
>
> I get messages as the following one about every second:
>
> Mar 18 11:15:37 au kernel: [44640.906793] RPC: killing all tasks for client ffff88041c51de00
> Mar 18 11:15:38 au kernel: [44641.906793] RPC: killing all tasks for client ffff88041c51de00
> Mar 18 11:15:39 au kernel: [44642.906795] RPC: killing all tasks for client ffff88041c51de00
> Mar 18 11:15:40 au kernel: [44643.906793] RPC: killing all tasks for client ffff88041c51de00
> Mar 18 11:15:41 au kernel: [44644.906795] RPC: killing all tasks for client ffff88041c51de00
> Mar 18 11:15:42 au kernel: [44645.906794] RPC: killing all tasks for client ffff88041c51de00
>
> and I get this messages:
>
> Mar 18 22:45:57 au kernel: [86061.779008] 174 0381 -5 ffff88041c51de00 (null) 0 ffffffff817211a0 nfs4_cbv1 CB_NULL a:rpc_exit_task q:none
>
> My theorie is this one:
>
> * this async task is runnable but does not progress (calling rpc_exit_task).
> * this is because the same rpciod which handles this task loops in
> rpc_shutdown_client waiting for this task to go away.
> * because rpc_shutdown_client is called from an async rpc, too

Off hand I don't see any place where rpc_shutdown_client() is called
from rpciod; do you?

> At the beginning is is always one or more tasks as above.
>
> Once a rpciod hangs more an more other tasks hang forever:
>
> Mar 18 22:45:57 au kernel: [86061.778809] -pid- flgs status -client- --rqstp- -timeout ---ops--
> Mar 18 22:45:57 au kernel: [86061.778819] 300 0281 -13 ffff8801ef5d0600 (null) 0 ffffffff817211a0 nfs4_cbv1 CB_NULL a:call_refreshresult q:none
> Mar 18 22:45:57 au kernel: [86061.778823] 289 0281 0 ffff880142a49800 ffff8802a1dde000 0 ffffffff817a3fd0 rpcbindv2 GETPORT a:call_status q:none
> Mar 18 22:45:57 au kernel: [86061.778827] 286 0281 0 ffff880349f57e00 ffff88010affe000 0 ffffffff817a3fd0 rpcbindv2 GETPORT a:call_status q:none
> Mar 18 22:45:57 au kernel: [86061.778830] 283 0281 0 ffff88041d19ac00 ffff880418650000 0 ffffffff817a3fd0 rpcbindv2 GETPORT a:call_status q:none

There's a lot of these GETPORT calls. Is portmap/rpcbind down?

--b.

> Mar 18 22:45:57 au kernel: [86061.778833] 280 0281 0 ffff880380561400 ffff8803d3b7a000 0 ffffffff817a3fd0 rpcbindv2 GETPORT a:call_status q:none
> Mar 18 22:45:57 au kernel: [86061.778837] 277 0281 0 ffff8803ef73ac00 ffff88010af50000 0 ffffffff817a3fd0 rpcbindv2 GETPORT a:call_status q:none
> Mar 18 22:45:57 au kernel: [86061.778840] 274 0281 0 ffff880302c31c00 ffff8803e92b6000 0 ffffffff817a3fd0 rpcbindv2 GETPORT a:call_status q:none
> Mar 18 22:45:57 au kernel: [86061.778844] 271 0281 0 ffff8803c44d0600 ffff88002687e000 0 ffffffff817a3fd0 rpcbindv2 GETPORT a:call_status q:none
> Mar 18 22:45:57 au kernel: [86061.778847] 234 0281 -110 ffff8801f414b000 ffff880005382000 0 ffffffff817a3fd0 rpcbindv2 GETPORT a:call_connect_status q:none
> Mar 18 22:45:57 au kernel: [86061.778851] 233 0281 -110 ffff8802c9945800 ffff880050382000 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_bind_status q:none
> Mar 18 22:45:57 au kernel: [86061.778855] 235 0281 -110 ffff8802c9945800 ffff880050382150 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_bind_status q:none
> Mar 18 22:45:57 au kernel: [86061.778858] 236 0281 -110 ffff8802c9945800 ffff8800503822a0 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_bind_status q:none
> Mar 18 22:45:57 au kernel: [86061.778861] 237 0281 -110 ffff8802c9945800 ffff8800503823f0 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_bind_status q:none
> Mar 18 22:45:57 au kernel: [86061.778864] 238 0281 -110 ffff8802c9945800 ffff880050382540 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_bind_status q:none
> Mar 18 22:45:57 au kernel: [86061.778867] 239 0281 -110 ffff8802c9945800 ffff880050382690 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_bind_status q:none
> Mar 18 22:45:57 au kernel: [86061.778870] 240 0281 -110 ffff8802c9945800 ffff8800503827e0 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_bind_status q:none
> Mar 18 22:45:57 au kernel: [86061.778874] 241 0281 -110 ffff8802c9945800 ffff880050382930 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_bind_status q:none
> Mar 18 22:45:57 au kernel: [86061.778877] 242 0281 -110 ffff8802c9945800 ffff880050382a80 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_bind_status q:none
> Mar 18 22:45:57 au kernel: [86061.778880] 243 0281 -110 ffff8802c9945800 ffff880050382bd0 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_bind_status q:none
> Mar 18 22:45:57 au kernel: [86061.778884] 244 0281 -110 ffff8802c9945800 ffff880050382d20 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_bind_status q:none
> Mar 18 22:45:57 au kernel: [86061.778887] 245 0281 -110 ffff8802c9945800 ffff880050382e70 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_bind_status q:none
> Mar 18 22:45:57 au kernel: [86061.778890] 246 0281 -110 ffff8802c9945800 ffff880050382fc0 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_bind_status q:none
> Mar 18 22:45:57 au kernel: [86061.778893] 247 0281 -110 ffff8802c9945800 ffff880050383110 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_bind_status q:none
> Mar 18 22:45:57 au kernel: [86061.778896] 248 0281 -110 ffff8802c9945800 ffff880050383260 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_bind_status q:none
> Mar 18 22:45:57 au kernel: [86061.778899] 249 0281 -110 ffff8802c9945800 ffff8800503833b0 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_bind_status q:none
> Mar 18 22:45:57 au kernel: [86061.778903] 250 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
> Mar 18 22:45:57 au kernel: [86061.778906] 251 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
> Mar 18 22:45:57 au kernel: [86061.778909] 252 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
> Mar 18 22:45:57 au kernel: [86061.778912] 253 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
> Mar 18 22:45:57 au kernel: [86061.778915] 254 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
> Mar 18 22:45:57 au kernel: [86061.778919] 255 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
> Mar 18 22:45:57 au kernel: [86061.778922] 256 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
> Mar 18 22:45:57 au kernel: [86061.778925] 257 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
> Mar 18 22:45:57 au kernel: [86061.778928] 258 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
> Mar 18 22:45:57 au kernel: [86061.778931] 259 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
> Mar 18 22:45:57 au kernel: [86061.778935] 260 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
> Mar 18 22:45:57 au kernel: [86061.778938] 261 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
> Mar 18 22:45:57 au kernel: [86061.778941] 262 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
> Mar 18 22:45:57 au kernel: [86061.778945] 263 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
> Mar 18 22:45:57 au kernel: [86061.778948] 264 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
> Mar 18 22:45:57 au kernel: [86061.778951] 265 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
> Mar 18 22:45:57 au kernel: [86061.778954] 266 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
> Mar 18 22:45:57 au kernel: [86061.778958] 267 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
> Mar 18 22:45:57 au kernel: [86061.778961] 268 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
> Mar 18 22:45:57 au kernel: [86061.778964] 269 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 UNLOCK_RES a:call_reserveresult q:xprt_backlog
> Mar 18 22:45:57 au kernel: [86061.778967] 272 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 LOCK_RES a:call_reserveresult q:xprt_backlog
> Mar 18 22:45:57 au kernel: [86061.778971] 275 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 LOCK_RES a:call_reserveresult q:xprt_backlog
> Mar 18 22:45:57 au kernel: [86061.778974] 278 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 LOCK_RES a:call_reserveresult q:xprt_backlog
> Mar 18 22:45:57 au kernel: [86061.778977] 281 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 LOCK_RES a:call_reserveresult q:xprt_backlog
> Mar 18 22:45:57 au kernel: [86061.778980] 284 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 LOCK_RES a:call_reserveresult q:xprt_backlog
> Mar 18 22:45:57 au kernel: [86061.778983] 287 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 LOCK_RES a:call_reserveresult q:xprt_backlog
> Mar 18 22:45:57 au kernel: [86061.778987] 290 0281 -11 ffff8802c9945800 (null) 0 ffffffff81721440 lockdv4 LOCK_RES a:call_reserveresult q:xprt_backlog
> Mar 18 22:45:57 au kernel: [86061.778990] 232 0281 -110 ffff8803d3bc6000 ffff8803d5eaa000 0 ffffffff817a3fd0 rpcbindv2 GETPORT a:call_connect_status q:none
> Mar 18 22:45:57 au kernel: [86061.778993] 230 0281 -13 ffff8801ef5d0200 (null) 0 ffffffff817211a0 nfs4_cbv1 CB_NULL a:call_refreshresult q:none
> Mar 18 22:45:57 au kernel: [86061.778997] 180 0281 -13 ffff880341e36800 (null) 0 ffffffff817211a0 nfs4_cbv1 CB_NULL a:call_refreshresult q:none
> Mar 18 22:45:57 au kernel: [86061.779000] 179 0281 -13 ffff88027dc01000 (null) 0 ffffffff817211a0 nfs4_cbv1 CB_NULL a:call_refreshresult q:none
> Mar 18 22:45:57 au kernel: [86061.779003] 178 0281 -13 ffff8801bddac600 (null) 0 ffffffff817211a0 nfs4_cbv1 CB_NULL a:call_refreshresult q:none
> Mar 18 22:45:57 au kernel: [86061.779008] 174 0381 -5 ffff88041c51de00 (null) 0 ffffffff817211a0 nfs4_cbv1 CB_NULL a:rpc_exit_task q:none
>
> I think I saw hanging rpciods once in a while but since we use kerberos we
> have it very often.
>
>
> My theorie may be completely bogus and there is another reason, I'm not
> familiar with the code.
>
> Regards,
> --
> Wolfgang Walter
> Studentenwerk München
> Anstalt des öffentlichen Rechts

2011-03-21 23:35:43

by Myklebust, Trond

[permalink] [raw]
Subject: Re: problem with nfs4: rpciod seems to loop in rpc_shutdown_client forever

On Mon, 2011-03-21 at 19:28 -0400, J. Bruce Fields wrote:
> On Fri, Mar 18, 2011 at 11:49:21PM +0100, Wolfgang Walter wrote:
> > Hello,
> >
> > I have a problem with our nfs-server (stable 2.6.32.33 but also with
> > .31 or .32 and probably older ones): sometimes
> > one or more rpciod get stuck. I used
> >
> > rpcdebug -m rpc -s all
> >
> > I get messages as the following one about every second:
> >
> > Mar 18 11:15:37 au kernel: [44640.906793] RPC: killing all tasks for client ffff88041c51de00
> > Mar 18 11:15:38 au kernel: [44641.906793] RPC: killing all tasks for client ffff88041c51de00
> > Mar 18 11:15:39 au kernel: [44642.906795] RPC: killing all tasks for client ffff88041c51de00
> > Mar 18 11:15:40 au kernel: [44643.906793] RPC: killing all tasks for client ffff88041c51de00
> > Mar 18 11:15:41 au kernel: [44644.906795] RPC: killing all tasks for client ffff88041c51de00
> > Mar 18 11:15:42 au kernel: [44645.906794] RPC: killing all tasks for client ffff88041c51de00
> >
> > and I get this messages:
> >
> > Mar 18 22:45:57 au kernel: [86061.779008] 174 0381 -5 ffff88041c51de00 (null) 0 ffffffff817211a0 nfs4_cbv1 CB_NULL a:rpc_exit_task q:none
> >
> > My theorie is this one:
> >
> > * this async task is runnable but does not progress (calling rpc_exit_task).
> > * this is because the same rpciod which handles this task loops in
> > rpc_shutdown_client waiting for this task to go away.
> > * because rpc_shutdown_client is called from an async rpc, too
>
> Off hand I don't see any place where rpc_shutdown_client() is called
> from rpciod; do you?

The only case I could think of would be if we're still calling mntput()
from some RPC callback. In principle we should only be doing that from
the rpc_call_ops->rpc_callback() from within the nfsiod thread rather
than rpciod.

Is it possible this might be another instance of the nfs_commit_inode()
busy-loop?

--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com

2011-03-22 14:52:27

by Wolfgang Walter

[permalink] [raw]
Subject: Re: problem with nfs4: rpciod seems to loop in rpc_shutdown_client forever

Am Dienstag, 22. März 2011 schrieb J. Bruce Fields:
> On Fri, Mar 18, 2011 at 11:49:21PM +0100, Wolfgang Walter wrote:
> > Hello,
> >
> > I have a problem with our nfs-server (stable 2.6.32.33 but also with
> > .31 or .32 and probably older ones): sometimes
> > one or more rpciod get stuck. I used
> >
> > rpcdebug -m rpc -s all
> >
> > I get messages as the following one about every second:
> >
> > Mar 18 11:15:37 au kernel: [44640.906793] RPC: killing all tasks
> > for client ffff88041c51de00 Mar 18 11:15:38 au kernel: [44641.906793]
> > RPC: killing all tasks for client ffff88041c51de00 Mar 18 11:15:39
> > au kernel: [44642.906795] RPC: killing all tasks for client
> > ffff88041c51de00 Mar 18 11:15:40 au kernel: [44643.906793] RPC:
> > killing all tasks for client ffff88041c51de00 Mar 18 11:15:41 au kernel:
> > [44644.906795] RPC: killing all tasks for client ffff88041c51de00
> > Mar 18 11:15:42 au kernel: [44645.906794] RPC: killing all tasks
> > for client ffff88041c51de00
> >
> > and I get this messages:
> >
> > Mar 18 22:45:57 au kernel: [86061.779008] 174 0381 -5
> > ffff88041c51de00 (null) 0 ffffffff817211a0 nfs4_cbv1 CB_NULL
> > a:rpc_exit_task q:none
> >
> > My theorie is this one:
> >
> > * this async task is runnable but does not progress (calling
> > rpc_exit_task). * this is because the same rpciod which handles this task
> > loops in rpc_shutdown_client waiting for this task to go away.
> > * because rpc_shutdown_client is called from an async rpc, too
>
> Off hand I don't see any place where rpc_shutdown_client() is called
> from rpciod; do you?

I'm not familiar with the code.

But could it be that this is in fs/nfsd/nfs4state.c ?

Just a guess because 2.6.38 does not have this problem and in 2.6.38 it seems
to have a workqueue of its own.

>
> > At the beginning is is always one or more tasks as above.
> >
> > Once a rpciod hangs more an more other tasks hang forever:
> >
> > Mar 18 22:45:57 au kernel: [86061.778809] -pid- flgs status -client-
> > --rqstp- -timeout ---ops-- Mar 18 22:45:57 au kernel: [86061.778819]
> > 300 0281 -13 ffff8801ef5d0600 (null) 0 ffffffff817211a0
> > nfs4_cbv1 CB_NULL a:call_refreshresult q:none Mar 18 22:45:57 au kernel:
> > [86061.778823] 289 0281 0 ffff880142a49800 ffff8802a1dde000
> > 0 ffffffff817a3fd0 rpcbindv2 GETPORT a:call_status q:none Mar 18 22:45:57
> > au kernel: [86061.778827] 286 0281 0 ffff880349f57e00
> > ffff88010affe000 0 ffffffff817a3fd0 rpcbindv2 GETPORT
> > a:call_status q:none Mar 18 22:45:57 au kernel: [86061.778830] 283 0281
> > 0 ffff88041d19ac00 ffff880418650000 0 ffffffff817a3fd0
> > rpcbindv2 GETPORT a:call_status q:none
>
> There's a lot of these GETPORT calls. Is portmap/rpcbind down?

No, it is running.

I think that these getports get scheduled as tasks for the hanging rpciod.


Regards,
--
Wolfgang Walter
Studentenwerk München
Anstalt des öffentlichen Rechts

2011-03-22 15:01:53

by J. Bruce Fields

[permalink] [raw]
Subject: Re: problem with nfs4: rpciod seems to loop in rpc_shutdown_client forever

On Tue, Mar 22, 2011 at 03:52:21PM +0100, Wolfgang Walter wrote:
> Am Dienstag, 22. März 2011 schrieb J. Bruce Fields:
> > On Fri, Mar 18, 2011 at 11:49:21PM +0100, Wolfgang Walter wrote:
> > > Hello,
> > >
> > > I have a problem with our nfs-server (stable 2.6.32.33 but also with
> > > .31 or .32 and probably older ones): sometimes
> > > one or more rpciod get stuck. I used
> > >
> > > rpcdebug -m rpc -s all
> > >
> > > I get messages as the following one about every second:
> > >
> > > Mar 18 11:15:37 au kernel: [44640.906793] RPC: killing all tasks
> > > for client ffff88041c51de00 Mar 18 11:15:38 au kernel: [44641.906793]
> > > RPC: killing all tasks for client ffff88041c51de00 Mar 18 11:15:39
> > > au kernel: [44642.906795] RPC: killing all tasks for client
> > > ffff88041c51de00 Mar 18 11:15:40 au kernel: [44643.906793] RPC:
> > > killing all tasks for client ffff88041c51de00 Mar 18 11:15:41 au kernel:
> > > [44644.906795] RPC: killing all tasks for client ffff88041c51de00
> > > Mar 18 11:15:42 au kernel: [44645.906794] RPC: killing all tasks
> > > for client ffff88041c51de00
> > >
> > > and I get this messages:
> > >
> > > Mar 18 22:45:57 au kernel: [86061.779008] 174 0381 -5
> > > ffff88041c51de00 (null) 0 ffffffff817211a0 nfs4_cbv1 CB_NULL
> > > a:rpc_exit_task q:none
> > >
> > > My theorie is this one:
> > >
> > > * this async task is runnable but does not progress (calling
> > > rpc_exit_task). * this is because the same rpciod which handles this task
> > > loops in rpc_shutdown_client waiting for this task to go away.
> > > * because rpc_shutdown_client is called from an async rpc, too
> >
> > Off hand I don't see any place where rpc_shutdown_client() is called
> > from rpciod; do you?
>
> I'm not familiar with the code.
>
> But could it be that this is in fs/nfsd/nfs4state.c ?
>
> Just a guess because 2.6.38 does not have this problem and in 2.6.38 it seems
> to have a workqueue of its own.

Well, spotted, yes it's true that 2.6.32 had called put_nfs4_client()
from an rpc_call_done callback, that put_nfs4_client() can end up
calling rpc_shutdown_client, and that that's since been fixed....

If someone wants to backport the fix to 2.6.32.y....

Actually I think it might be sufficient just to apply
147efd0dd702ce2f1ab44449bd70369405ef68fd ? But I haven't tried.

--b.

commit 147efd0dd702ce2f1ab44449bd70369405ef68fd
Author: J. Bruce Fields <[email protected]>
Date: Sun Feb 21 17:41:19 2010 -0800

nfsd4: shutdown callbacks on expiry

Once we've expired the client, there's no further purpose to the
callbacks; go ahead and shut down the callback client rather than
waiting for the last reference to go.

Signed-off-by: J. Bruce Fields <[email protected]>

diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index efef7f2..9ce5831 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -697,9 +697,6 @@ shutdown_callback_client(struct nfs4_client *clp)
static inline void
free_client(struct nfs4_client *clp)
{
- shutdown_callback_client(clp);
- if (clp->cl_cb_xprt)
- svc_xprt_put(clp->cl_cb_xprt);
if (clp->cl_cred.cr_group_info)
put_group_info(clp->cl_cred.cr_group_info);
kfree(clp->cl_principal);
@@ -752,6 +749,9 @@ expire_client(struct nfs4_client *clp)
se_perclnt);
release_session(ses);
}
+ shutdown_callback_client(clp);
+ if (clp->cl_cb_xprt)
+ svc_xprt_put(clp->cl_cb_xprt);
put_nfs4_client(clp);
}

2011-03-23 07:54:42

by Wolfgang Walter

[permalink] [raw]
Subject: Re: problem with nfs4: rpciod seems to loop in rpc_shutdown_client forever

Am Dienstag, 22. März 2011 schrieb J. Bruce Fields:
> On Tue, Mar 22, 2011 at 03:52:21PM +0100, Wolfgang Walter wrote:
> > Am Dienstag, 22. März 2011 schrieb J. Bruce Fields:
> > > On Fri, Mar 18, 2011 at 11:49:21PM +0100, Wolfgang Walter wrote:
> > > > Hello,
> > > >
> > > > I have a problem with our nfs-server (stable 2.6.32.33 but also with
> > > > .31 or .32 and probably older ones): sometimes
> > > > one or more rpciod get stuck. I used
> > > >
> > > > rpcdebug -m rpc -s all
> > > >
> > > > I get messages as the following one about every second:
> > > >
> > > > Mar 18 11:15:37 au kernel: [44640.906793] RPC: killing all
> > > > tasks for client ffff88041c51de00 Mar 18 11:15:38 au kernel:
> > > > [44641.906793] RPC: killing all tasks for client
> > > > ffff88041c51de00 Mar 18 11:15:39 au kernel: [44642.906795] RPC:
> > > > killing all tasks for client ffff88041c51de00 Mar 18 11:15:40 au
> > > > kernel: [44643.906793] RPC: killing all tasks for client
> > > > ffff88041c51de00 Mar 18 11:15:41 au kernel: [44644.906795] RPC:
> > > > killing all tasks for client ffff88041c51de00 Mar 18 11:15:42 au
> > > > kernel: [44645.906794] RPC: killing all tasks for client
> > > > ffff88041c51de00
> > > >
> > > > and I get this messages:
> > > >
> > > > Mar 18 22:45:57 au kernel: [86061.779008] 174 0381 -5
> > > > ffff88041c51de00 (null) 0 ffffffff817211a0 nfs4_cbv1 CB_NULL
> > > > a:rpc_exit_task q:none
> > > >
> > > > My theorie is this one:
> > > >
> > > > * this async task is runnable but does not progress (calling
> > > > rpc_exit_task). * this is because the same rpciod which handles this
> > > > task loops in rpc_shutdown_client waiting for this task to go away. *
> > > > because rpc_shutdown_client is called from an async rpc, too
> > >
> > > Off hand I don't see any place where rpc_shutdown_client() is called
> > > from rpciod; do you?
> >
> > I'm not familiar with the code.
> >
> > But could it be that this is in fs/nfsd/nfs4state.c ?
> >
> > Just a guess because 2.6.38 does not have this problem and in 2.6.38 it
> > seems to have a workqueue of its own.
>
> Well, spotted, yes it's true that 2.6.32 had called put_nfs4_client()
> from an rpc_call_done callback, that put_nfs4_client() can end up
> calling rpc_shutdown_client, and that that's since been fixed....
>
> If someone wants to backport the fix to 2.6.32.y....
>
> Actually I think it might be sufficient just to apply
> 147efd0dd702ce2f1ab44449bd70369405ef68fd ? But I haven't tried.
>
> --b.
>
> commit 147efd0dd702ce2f1ab44449bd70369405ef68fd
> Author: J. Bruce Fields <[email protected]>
> Date: Sun Feb 21 17:41:19 2010 -0800
>
> nfsd4: shutdown callbacks on expiry
>
> Once we've expired the client, there's no further purpose to the
> callbacks; go ahead and shut down the callback client rather than
> waiting for the last reference to go.
>
> Signed-off-by: J. Bruce Fields <[email protected]>
>
> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> index efef7f2..9ce5831 100644
> --- a/fs/nfsd/nfs4state.c
> +++ b/fs/nfsd/nfs4state.c
> @@ -697,9 +697,6 @@ shutdown_callback_client(struct nfs4_client *clp)
> static inline void
> free_client(struct nfs4_client *clp)
> {
> - shutdown_callback_client(clp);
> - if (clp->cl_cb_xprt)
> - svc_xprt_put(clp->cl_cb_xprt);
> if (clp->cl_cred.cr_group_info)
> put_group_info(clp->cl_cred.cr_group_info);
> kfree(clp->cl_principal);
> @@ -752,6 +749,9 @@ expire_client(struct nfs4_client *clp)
> se_perclnt);
> release_session(ses);
> }
> + shutdown_callback_client(clp);
> + if (clp->cl_cb_xprt)
> + svc_xprt_put(clp->cl_cb_xprt);
> put_nfs4_client(clp);
> }

I'll test it this weekend.

I use 2.6.38 on this server for now and probably will stay with it. But having
a working longterm-kernel to fall back is important for me :-).

Regards,
--
Wolfgang Walter
Studentenwerk München
Anstalt des öffentlichen Rechts