2009-11-25 07:36:09

by Andrew Morton

[permalink] [raw]
Subject: Fw: Deadlock regression in v2.6.31.6



Begin forwarded message:

Date: Tue, 24 Nov 2009 01:24:53 +0100
From: "Stephen R. van den Berg" <[email protected]>
To: [email protected]
Subject: Deadlock regression in v2.6.31.6


During a Debian package update for the man-db package (regenerating the
man index), the kernel reproducibly entered into a deadlock which
resulted in the kernel getting stuck in rpciod (the kernel version).
This is happening on a system which has an NFS-mounted root-filesystem
(no other filesystems, except for tmpfs on /tmp).

The problem vanishes as soon as I run v2.6.31.5 (neither kernel contains
any significant modules).

Kernel trace at the point of deadlock:
---------------------------------------------------------------------------
Pid: 258, comm: rpciod/0 Not tainted (2.6.31.6-00004-gd334df2 #140)
EIP: 0060:[<c0216eae>] EFLAGS: 00000246 CPU: 0
Using defaults from ksymoops -t elf32-i386 -a i386
EAX: cf890000 EBX: cfa66400 ECX: c03834c4 EDX: cfa66534
ESI: cfa90420 EDI: cf407280 EBP: cf891ef0 ESP: cf891ee8
DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068
CR0: 8005003b CR2: b75a00a4 CR3: 0e8f3000 CR4: 00000090
Call Trace:
[<c03f3129>] ? 0xc03f3129
[<c03f4435>] 0xc03f4435
[<c039a23f>] 0xc039a23f
[<c0335bf2>] 0xc0335bf2
[<c03f3c04>] 0xc03f3c04
[<c03f2cd4>] 0xc03f2cd4
[<c03f0d49>] 0xc03f0d49
[<c03f5eeb>] 0xc03f5eeb
[<c03f60a7>] 0xc03f60a7
[<c021df90>] 0xc021df90
[<c03f609c>] ? 0xc03f609c
[<c0220669>] ? 0xc0220669
[<c021dee5>] ? 0xc021dee5
[<c02203d0>] 0xc02203d0
[<c0220368>] ? 0xc0220368
[<c0202d63>] 0xc0202d63
Pid: 258, comm: rpciod/0 Not tainted (2.6.31.6-00004-gd334df2 #140)
EIP: 0060:[<c0390d6c>] EFLAGS: 00000246 CPU: 0
EAX: ffffff9d EBX: cfa90420 ECX: 00000000 EDX: cfa15000
ESI: 97817dd4 EDI: 97817dd4 EBP: cf891f10 ESP: cf891e78
DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068
CR0: 8005003b CR2: b75a00a4 CR3: 0e8f3000 CR4: 00000090
Call Trace:
[<c0336d75>] ? 0xc0336d75
[<c03f38f8>] ? 0xc03f38f8
[<c0216ead>] ? 0xc0216ead
[<c039a79e>] 0xc039a79e
[<c021de23>] ? 0xc021de23
[<c021e177>] ? 0xc021e177
[<c0335b96>] 0xc0335b96
[<c03f413c>] 0xc03f413c
[<c0216ead>] ? 0xc0216ead
[<c03f5f0b>] ? 0xc03f5f0b
[<c03f420b>] 0xc03f420b
[<c021df90>] 0xc021df90
[<c03f41f7>] ? 0xc03f41f7
[<c0220669>] ? 0xc0220669
[<c021dee5>] ? 0xc021dee5
[<c02203d0>] 0xc02203d0
[<c0220368>] ? 0xc0220368
[<c0202d63>] 0xc0202d63
Warning (Oops_read): Code line not seen, dumping what data is available


>>EIP; c0216eae <local_bh_enable+c/e> <=====

>>ECX; c03834c4 <tcp_shutdown+0/43>

Trace; c03f3129 <xprt_disconnect_done+39/3e>
Trace; c03f4435 <xs_tcp_state_change+1cd/1d4>
Trace; c039a23f <inet_shutdown+b3/c2>
Trace; c0335bf2 <kernel_sock_shutdown+9/b>
Trace; c03f3c04 <xs_tcp_send_request+103/117>
Trace; c03f2cd4 <xprt_transmit+142/214>
Trace; c03f0d49 <call_transmit+1b2/1ed>
Trace; c03f5eeb <__rpc_execute+6e/1d2>
Trace; c03f60a7 <rpc_async_schedule+b/d>
Trace; c021df90 <worker_thread+ab/109>
Trace; c03f609c <rpc_async_schedule+0/d>
Trace; c0220669 <autoremove_wake_function+0/2f>
Trace; c021dee5 <worker_thread+0/109>
Trace; c02203d0 <kthread+68/6d>
Trace; c0220368 <kthread+0/6d>
Trace; c0202d63 <kernel_thread_helper+7/10>

>>EIP; c0390d6c <tcp_v4_connect+2a0/3c1> <=====

>>ESI; 97817dd4 <phys_startup_32+97617dd4/c0000000>
>>EDI; 97817dd4 <phys_startup_32+97617dd4/c0000000>

Trace; c0336d75 <kernel_sendmsg+28/37>
Trace; c03f38f8 <xs_send_kvec+6d/77>
Trace; c0216ead <local_bh_enable+b/e>
Trace; c039a79e <inet_stream_connect+85/1e6>
Trace; c021de23 <queue_work+e/10>
Trace; c021e177 <queue_delayed_work+c/1e>
Trace; c0335b96 <kernel_connect+d/12>
Trace; c03f413c <xs_tcp_setup_socket+1f9/29e>
Trace; c0216ead <local_bh_enable+b/e>
Trace; c03f5f0b <__rpc_execute+8e/1d2>
Trace; c03f420b <xs_tcp_connect_worker4+14/16>
Trace; c021df90 <worker_thread+ab/109>
Trace; c03f41f7 <xs_tcp_connect_worker4+0/16>
Trace; c0220669 <autoremove_wake_function+0/2f>
Trace; c021dee5 <worker_thread+0/109>
Trace; c02203d0 <kthread+68/6d>
Trace; c0220368 <kthread+0/6d>
Trace; c0202d63 <kernel_thread_helper+7/10>
---------------------------------------------------------------------------

strace log of the userspace processes at the time of deadlock:
---------------------------------------------------------------------------
[pid 3911] brk(0x8d0d000) = 0x8d0d000
[pid 3911] close(7) = 0
[pid 3911] munmap(0xb75bc000, 524288) = 0
[pid 3911] brk(0x8d0b000) = 0x8d0b000
[pid 3911] brk(0x8d07000) = 0x8d07000
[pid 3911] write(1, "\\F[]\ncommand for more information\\&. This command corresponds to the GetPrinterData() MS Platform SDK function\\&.\n.RE\n.PP\ngetdataex\n.RS 4\nGet printer driver data with keyname\n.RE\n.PP\ngetdriver <printername>\n.RS 4\nRetrieve the printer driver information (such as driver file, config file, dependent fi"..., 4041) = 4041
[pid 3911] exit_group(0) = ?
Process 3911 detached
Process 3912 attached
[pid 3314] <... clone resumed> child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xb7672728) = 3912
[pid 3912] rt_sigaction(SIGHUP, NULL, {0x8058790, [], 0}, 8) = 0
[pid 3912] rt_sigaction(SIGHUP, {0x8058790, [], 0}, NULL, 8) = 0
[pid 3912] rt_sigaction(SIGINT, NULL, {0x8058790, [], 0}, 8) = 0
[pid 3912] rt_sigaction(SIGINT, {0x8058790, [], 0}, NULL, 8) = 0
[pid 3912] rt_sigaction(SIGTERM, NULL, {0x8058790, [], 0}, 8) = 0
[pid 3912] rt_sigaction(SIGTERM, {0x8058790, [], 0}, NULL, 8) = 0
[pid 3912] dup2(8, 0) = 0
[pid 3912] close(8) = 0
[pid 3912] dup2(9, 1 <unfinished ...>
[pid 3314] close(8 <unfinished ...>
[pid 3912] <... dup2 resumed> ) = 1
[pid 3912] close(9) = 0
[pid 3912] close(7) = 0
[pid 3912] dup(0) = 7
[pid 3912] rt_sigprocmask(SIG_BLOCK, [CHLD], [CHLD], 8) = 0
[pid 3912] rt_sigprocmask(SIG_SETMASK, [CHLD], NULL, 8) = 0
[pid 3912] pipe([8, 9]) = 0
[pid 3912] rt_sigprocmask(SIG_BLOCK, [CHLD], [CHLD], 8) = 0
[pid 3912] clone(Process 3913 attached
child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xb7672728) = 3913
[pid 3913] rt_sigaction(SIGHUP, NULL, {0x8058790, [], 0}, 8) = 0
[pid 3913] rt_sigaction(SIGHUP, {0x8058790, [], 0}, NULL, 8) = 0
[pid 3913] rt_sigaction(SIGINT, NULL, {0x8058790, [], 0}, 8) = 0
[pid 3913] rt_sigaction(SIGINT, {0x8058790, [], 0}, NULL, 8) = 0
[pid 3913] rt_sigaction(SIGTERM, NULL, {0x8058790, [], 0}, 8) = 0
[pid 3913] rt_sigaction(SIGTERM, {0x8058790, [], 0}, NULL, 8) = 0
[pid 3913] dup2(7, 0) = 0
[pid 3913] close(7) = 0
[pid 3913] dup2(9, 1) = 1
[pid 3913] close(9) = 0
[pid 3913] close(8) = 0
[pid 3913] close(7) = -1 EBADF (Bad file descriptor)
[pid 3913] dup(0) = 7
[pid 3913] fcntl64(7, F_GETFL) = 0 (flags O_RDONLY)
[pid 3913] fstat64(7, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
[pid 3913] mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb77da000
[pid 3913] _llseek(7, 0, 0xbff30a18, SEEK_CUR) = -1 ESPIPE (Illegal seek)
[pid 3913] read(7, ".\\\" Title: rpcclient\n.\\\" Author: [see the \"AUTHOR\" section]\n.\\\" Generator: DocBook XSL Stylesheets v1.74.0 <http://docbook.sf.net/>\n.\\\" Date: 09/30/2009\n.\\\" Manual: User Commands\n.\\\" Source: Samba 3.4\n.\\\" Language: English\n.\\\"\n.TH \"RPCCLIENT\" \"1\" \"09/30/2009\" \"Samba 3\\&.4\" \"User C"..., 16384) = 16384
[pid 3913] _llseek(7, 0, 0xbff30a0c, SEEK_CUR) = -1 ESPIPE (Illegal seek)
[pid 3913] fstat64(1, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
[pid 3913] mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb77d9000
[pid 3913] write(1, ".\\\" Title: rpcclient\n.\\\" Author: [see the \"AUTHOR\" section]\n.\\\" Generator: DocBook XSL Stylesheets v1.74.0 <http://docbook.sf.net/>\n.\\\" Date: 09/30/2009\n.\\\" Manual: User Commands\n.\\\" Source: Samba 3.4\n.\\\" Language: English\n.\\\"\n.TH \"RPCCLIENT\" \"1\" \"09/30/2009\" \"Samba 3\\&.4\" \"User C"..., 4096) = 4096
[pid 3913] write(1, "\n.PP\nThis tool is part of the\n\\fBsamba\\fR(7)\nsuite\\&.\n.PP\n\\FCrpcclient\\F[]\nis a utility initially developed to test MS\\-RPC functionality in Samba itself\\&. It has undergone several stages of development and stability\\&. Many system administrators have now written scripts around it to manage Windows"..., 4096) = 4096
[pid 3913] write(1, ".RS 4\nSets the SMB username or username and password\\&.\n.sp\nIf %password is not specified, the user will be prompted\\&. The client will first check the\n\\fBUSER\\fR\nenvironment variable, then the\n\\fBLOGNAME\\fR\nvariable and if either exists, the string is uppercased\\&. If these environmental variables "..., 4096) = 4096
[pid 3913] read(7, "\\F[]\ncommand for more information\\&. This command corresponds to the GetPrinterData() MS Platform SDK function\\&.\n.RE\n.PP\ngetdataex\n.RS 4\nGet printer driver data with keyname\n.RE\n.PP\ngetdriver <printername>\n.RS 4\nRetrieve the printer driver information (such as driver file, config file, dependent fi"..., 4096) = 4041
[pid 3913] read(7, "", 4096) = 0
[pid 3913] write(1, "RE\n.PP\nenumdomgroups\n.RS 4\nEnumerate domain groups\n.RE\n.PP\nenumalsgroups\n.RS 4\nEnumerate alias groups\n.RE\n.PP\ncreatedomuser\n.RS 4\nCreate domain user\n.RE\n.PP\nsamlookupnames\n.RS 4\nLook up names\n.RE\n.PP\nsamlookuprids\n.RS 4\nLook up names\n.RE\n.PP\ndeletedomuser\n.RS 4\nDelete domain user\n.RE\n.PP\nsamquerysec"..., 4096) = 4096
[pid 3913] close(7) = 0
[pid 3913] munmap(0xb77da000, 4096) = 0
[pid 3913] write(1, "\\F[]\ncommand for more information\\&. This command corresponds to the GetPrinterData() MS Platform SDK function\\&.\n.RE\n.PP\ngetdataex\n.RS 4\nGet printer driver data with keyname\n.RE\n.PP\ngetdriver <printername>\n.RS 4\nRetrieve the printer driver information (such as driver file, config file, dependent fi"..., 4041) = 4041
[pid 3913] exit_group(0) = ?
Process 3913 detached
[pid 3314] <... close resumed> ) = 0
[pid 3912] close(7 <unfinished ...>
[pid 3314] close(9) = 0
[pid 3314] rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
[pid 3314] --- SIGCHLD (Child exited) @ 0 (0) ---
[pid 3314] waitpid(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG) = 3911
[pid 3314] waitpid(-1, 0xbff31b08, WNOHANG) = 0
[pid 3314] sigreturn() = ? (mask now [])
[pid 3314] read(7, <unfinished ...>
[pid 3912] <... close resumed> ) = 0
[pid 3912] close(9) = 0
[pid 3912] rt_sigprocmask(SIG_SETMASK, [CHLD], NULL, 8) = 0
[pid 3912] read(8, ".\\\" Title: rpcclient\n.\\\" Author: [see the \"AUTHOR\" section]\n.\\\" Generator: DocBook XSL Stylesheets v1.74.0 <http://docbook.sf.net/>\n.\\\" Date: 09/30/2009\n.\\\" Manual: User Commands\n.\\\" Source: Samba 3.4\n.\\\" Language: English\n.\\\"\n.TH \"RPCCLIENT\" \"1\" \"09/30/2009\" \"Samba 3\\&.4\" \"User C"..., 4096) = 4096
[pid 3912] open("/usr/lib/gconv/gconv-modules.cache", O_RDONLY) = 7
[pid 3912] fstat64(7, {st_mode=S_IFREG|0644, st_size=26048, ...}) = 0
[pid 3912] mmap2(NULL, 26048, PROT_READ, MAP_SHARED, 7, 0) = 0xb7635000
[pid 3912] close(7) = 0
[pid 3912] brk(0x8d10000) = 0x8d10000
[pid 3912] read(8, "\n.PP\nThis tool is part of the\n\\fBsamba\\fR(7)\nsuite\\&.\n.PP\n\\FCrpcclient\\F[]\nis a utility initially developed to test MS\\-RPC functionality in Samba itself\\&. It has undergone several stages of development and stability\\&. Many system administrators have now written scripts around it to manage Windows"..., 61440) = 16329
[pid 3912] read(8, "", 45111) = 0
[pid 3912] brk(0x8d40000) = 0x8d40000
[pid 3912] fstat64(1, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
[pid 3912] mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb77da000
[pid 3912] write(1, ".\\\" Title: rpcclient\n.\\\" Author: [see the \"AUTHOR\" section]\n.\\\" Generator: DocBook XSL Stylesheets v1.74.0 <http://docbook.sf.net/>\n.\\\" Date: 09/30/2009\n.\\\" Manual: User Commands\n.\\\" Source: Samba 3.4\n.\\\" Language: English\n.\\\"\n.TH \"RPCCLIENT\" \"1\" \"09/30/2009\" \"Samba 3\\&.4\" \"User C"..., 16384) = 16384
[pid 3314] <... read resumed> ".\\\" Title: rpcclient\n.\\\" Author: [see the \"AUTHOR\" section]\n.\\\" Generator: DocBook XSL Stylesheets v1.74.0 <http://docbook.sf.net/>\n.\\\" Date: 09/30/2009\n.\\\" Manual: User Commands\n.\\\" Source: Samba 3.4\n.\\\" Language: English\n.\\\"\n.TH \"RPCCLIENT\" \"1\" \"09/30/2009\" \"Samba 3\\&.4\" \"User C"..., 1024) = 1024
[pid 3314] read(7, "oupper \\\\$*\n.el \\{\\\n\\\\$*\n.\\}\n..\n.\\\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n.\\\" SH - level-one heading that works better for non-TTY output\n.\\\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n.de1 SH\n.if t \\{\\\n.sp 1\n.\\}\n.sp \\\\n[PD]u\n.nr an-level 1\n.set-an-"..., 1024) = 1024
[pid 3314] read(7, "dent \\\\n[IN]\n.fi\n.in \\\\n[IN]u\n.ti \\\\n[SN]u\n.it 1 an-trap\n.nr an-no-space-flag 1\n.nr an-break-flag 1\n.ps \\\\n[PS-SS]u\n.ps +2\n.ft B\n.ne (2v + 1u)\n.if \\\\n[.$] \\&\\\\$*\n..\n.\\\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n.\\\" BB/BE - put background/screen (filled box) around block of te"..., 1024) = 1024
[pid 3314] read(7, "~~~~~~~~~~~~~~~~~~~~~~~~\n.de BM\n.if t \\{\\\n.br\n.ll -2n\n.gcolor red\n.di BX\n.\\}\n..\n.de EM\n.if t \\{\\\n.br\n.di\n.ll\n.gcolor\n.nr BH \\\\n(dn\n.ne \\\\n(BHu\n\\M[\\\\$1]\\D'P -.75n 0 0 \\\\n(BHu -(\\\\n[.i]u - \\\\n(INu - .75n) 0 0 -\\\\n(BHu'\\M[]\n.in 0\n.nf\n.BX\n.in\n.fi\n.\\}\n..\n.\\\" ----------------------------------------------"..., 1024) = 1024
[pid 3314] read(7, "\n.PP\nThis tool is part of the\n\\fBsamba\\fR(7)\nsuite\\&.\n.PP\n\\FCrpcclient\\F[]\nis a utility initially developed to test MS\\-RPC functionality in Samba itself\\&. It has undergone several stages of development and stability\\&. Many system administrators have now written scripts around it to manage Windows"..., 1024) = 1024
[pid 3314] read(7, "is parameter will force the client to assume that the server is on the machine with the specified IP address and the NetBIOS name component of the resource being connected to will be ignored\\&.\n.sp\nThere is no default for this parameter\\&. If not supplied, it will be determined automatically by the "..., 1024) = 1024
[pid 3314] read(7, "xtremely cryptic\\&.\n.sp\nNote that specifying this parameter here will override the\n\\m[blue]\\fBlog level\\fR\\m[]\nparameter in the\n\\FCsmb\\&.conf\\F[]\nfile\\&.\n.RE\n.PP\n\\-V\n.RS 4\nPrints the program version number\\&.\n.RE\n.PP\n\\-s <configuration file>\n.RS 4\nThe file specified contains the configuration detail"..., 1024) = 1024
[pid 3314] read(7, "that does not require a password\\&.\n.sp\nUnless a password is specified on the command line or this parameter is specified, the client will request a password\\&.\n.sp\nIf a password is specified on the command line and this option is also defined the password on the command line will be silently ingnor"..., 1024) = 1024
[pid 3314] read(7, ".RS 4\nSets the SMB username or username and password\\&.\n.sp\nIf %password is not specified, the user will be prompted\\&. The client will first check the\n\\fBUSER\\fR\nenvironment variable, then the\n\\fBLOGNAME\\fR\nvariable and if either exists, the string is uppercased\\&. If these environmental variables "..., 1024) = 1024
[pid 3314] read(7, "ry NetBIOS name>\n.RS 4\nThis option allows you to override the NetBIOS name that Samba uses for itself\\&. This is identical to setting the\n\\m[blue]\\fBnetbios name\\fR\\m[]\nparameter in the\n\\FCsmb\\&.conf\\F[]\nfile\\&. However, a command line setting will take precedence over settings in\n\\FCsmb\\&.conf\\F[]\\"..., 1024) = 1024
[pid 3314] read(7, "tions\n.RS 4\nTCP socket options to set on the client socket\\&. See the socket options parameter in the\n\\FCsmb\\&.conf\\F[]\nmanual page for the list of valid options\\&.\n.RE\n.PP\n\\-h|\\-\\-help\n.RS 4\nPrint a summary of command line options\\&.\n.RE\n.SH \"COMMANDS\"\n.SS \"LSARPC\"\n.PP\nlsaquery\n.RS 4\nQuery info pol"..., 1024) = 1024
[pid 3314] read(7, "saquerysecobj\n.RS 4\nQuery LSA security object\n.RE\n.SS \"LSARPC\\-DS\"\n.PP\ndsroledominfo\n.RS 4\nGet Primary Domain Information\n.RE\n.PP\n.PP\n\\fIDFS\\fR\n.PP\ndfsexist\n.RS 4\nQuery DFS support\n.RE\n.PP\ndfsadd\n.RS 4\nAdd a DFS share\n.RE\n.PP\ndfsremove\n.RS 4\nRemove a DFS share\n.RE\n.PP\ndfsgetinfo\n.RS 4\nQuery DFS shar"..., 1024) = 1024
[pid 3314] read(7, "RE\n.PP\nenumdomgroups\n.RS 4\nEnumerate domain groups\n.RE\n.PP\nenumalsgroups\n.RS 4\nEnumerate alias groups\n.RE\n.PP\ncreatedomuser\n.RS 4\nCreate domain user\n.RE\n.PP\nsamlookupnames\n.RS 4\nLook up names\n.RE\n.PP\nsamlookuprids\n.RS 4\nLook up names\n.RE\n.PP\ndeletedomuser\n.RS 4\nDelete domain user\n.RE\n.PP\nsamquerysec"..., 1024) = 1024
[pid 3314] read(7, ":\\e\nHelp File Name:\\e\nLanguage Monitor Name:\\e\nDefault Data Type:\\e\nComma Separated list of Files\n.EB lightgray adjust-for-leading-newline\n.if t \\{\\\n.sp 1\n.\\}\n.fi\n.fam\n.ps +1\n.if n \\{\\\n.RE\n.\\}\n.sp\nAny empty fields should be enter as the string \"NULL\"\\&.\n.sp\nSamba does not need to support the concept"..., 1024) = 1024
[pid 3314] read(7, "the printer driver must already be installed on the server (see\n\\FCadddriver\\F[]) and the\n\\fIport\\fRmust be a valid port name (see\n\\FCenumports\\F[]\\&.\n.RE\n.PP\ndeldriver\n.RS 4\nDelete the specified printer driver for all architectures\\&. This does not delete the actual driver files from the server, on"..., 1024) = 1024
[pid 3314] read(7, "bs <printer>\n.RS 4\nList the jobs and status of a given printer\\&. This command corresponds to the MS Platform SDK EnumJobs() function\n.RE\n.PP\nenumkey\n.RS 4\nEnumerate printer keys\n.RE\n.PP\nenumports [level]\n.RS 4\nExecutes an EnumPorts() call using the specified info level\\&. Currently only info levels"..., 1024) = 1024
[pid 3314] read(7, <unfinished ...>
[pid 3912] read(8, "", 65536) = 0
[pid 3912] read(8, "", 65536) = 0
[pid 3912] close(8) = 0
[pid 3912] waitpid(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0) = 3913
[pid 3912] write(1, "\\F[]\ncommand for more information\\&. This command corresponds to the GetPrinterData() MS Platform SDK function\\&.\n.RE\n.PP\ngetdataex\n.RS 4\nGet printer driver data with keyname\n.RE\n.PP\ngetdriver <printername>\n.RS 4\nRetrieve the printer driver information (such as driver file, config file, dependent fi"..., 4041) = 4041
[pid 3314] <... read resumed> "\\F[]\ncommand for more information\\&. This command corresponds to the GetPrinterData() MS Platform SDK function\\&.\n.RE\n.PP\ngetdataex\n.RS 4\nGet printer driver data with keyname\n.RE\n.PP\ngetdriver <printername>\n.RS 4\nRetrieve the printer driver information (such as driver file, config file, dependent fi"..., 1024) = 1024
[pid 3314] read(7, "irectory\n.RE\n.PP\nopenprinter <printername>\n.RS 4\nExecute an OpenPrinterEx() and ClosePrinter() RPC against a given printer\\&.\n.RE\n.PP\nsetdriver <printername> <drivername>\n.RS 4\nExecute a SetPrinter() command to update the printer driver associated with an installed printer\\&. The printer driver must"..., 1024) = 1024
[pid 3314] read(7, "4\nQuery Sam Deltas\n.RE\n.PP\nsamlogon\n.RS 4\nSam Logon\n.RE\n.SS \"GENERAL COMMANDS\"\n.PP\ndebuglevel\n.RS 4\nSet the current debug level used to log information\\&.\n.RE\n.PP\nhelp (?)\n.RS 4\nPrint a listing of all known commands or extended help on a particular command\\&.\n.RE\n.PP\nquit (exit)\n.RS 4\nExit\n\\FCrpccli"..., 1024) = 1024
[pid 3314] read(7, "implementation is also a bit rough, and as more of the services are understood, it can even result in versions of\n\\fBsmbd\\fR(8)\nand\n\\fBrpcclient\\fR(1)\nthat are incompatible for some commands or services\\&. Additionally, the developers are sending reports to Microsoft, and problems found or reported "..., 1024) = 969
[pid 3314] read(7, <unfinished ...>
[pid 3912] exit_group(0) = ?
Process 3912 detached
[pid 3314] <... read resumed> "", 1024) = 0
[pid 3314] close(7) = 0
[pid 3314] waitpid(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0) = 3912
[pid 3314] --- SIGCHLD (Child exited) @ 0 (0) ---
[pid 3314] sigreturn() = ? (mask now [])
[pid 3314] waitpid(-1, 0xbff31fe8, WNOHANG) = -1 ECHILD (No child processes)
[pid 3314] lseek(5, 129260, SEEK_SET) = 129260
[pid 3314] write(5, "rpcclient\0", 10) = 10
[pid 3314] write(5, "-\t1\t1\t1254554345\tA\t-\t-\tgz\ttool for executing client side MS-RPC functions\0", 74) = 74
[pid 3314] lseek(5, 65536, SEEK_SET) = 65536
[pid 3314] write(5, "\4\0\0\0\20\0\0\0\325\313\1\0.\0\0\0\5\t\2\0009\0\0\0&\244\1\0\337s\0\0!\f\2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\275\1\0\0\377\377\377\377\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\377\377\377\377\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0]\34L-squi\0\200\1\0\21\0\0\0+\0\0\0\377\377\377\377\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\377\377\377\377\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\377\377\377\377\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\377\377\377\377\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0&@\332/git-<\200\1\0\20\0\0\0J\0\0\0kA\16yedit\226\200\1\0\t\0\0\0001\0\0\0c\34\2221sane\320\200\1\0\v\0\0\0P\0\0\0\377\377\377\377\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\377\377\377\377\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 32768) = 32768
[pid 3314] lstat64("/usr/share/man/man1/netkit-rsh.1.gz", {st_mode=S_IFREG|0644, st_size=2227, ...}) = 0
[pid 3314] lseek(5, 118788, SEEK_SET) = 118788
[pid 3314] read(5, "netkit-rsh\0-\t1\t1\t1187257038\tA\t-\t-\tgz\tremote shell\0", 50) = 50
[pid 3314] lstat64("/usr/share/man/man1/wbinfo.1.gz", {st_mode=S_IFREG|0644, st_size=3872, ...}) = 0
[pid 3314] stat64("/usr/share/man/man1/wbinfo.1.gz", {st_mode=S_IFREG|0644, st_size=3872, ...}) = 0
[pid 3314] stat64("/usr/share/man/man1/wbinfo.1.gz", {st_mode=S_IFREG|0644, st_size=3872, ...}) = 0
[pid 3314] rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
[pid 3314] rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
[pid 3314] open("/usr/share/man/man1/wbinfo.1.gz", O_RDONLY|O_LARGEFILE) = 7
[pid 3314] pipe([8, 9]) = 0
[pid 3314] rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
[pid 3314] clone(Process 3914 attached
child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xb7672728) = 3914
[pid 3914] rt_sigaction(SIGHUP, NULL, {0x8058790, [], 0}, 8) = 0
[pid 3914] rt_sigaction(SIGHUP, {0x8058790, [], 0}, NULL, 8) = 0
[pid 3914] rt_sigaction(SIGINT, NULL, {0x8058790, [], 0}, 8) = 0
[pid 3914] rt_sigaction(SIGINT, {0x8058790, [], 0}, NULL, 8) = 0
[pid 3914] rt_sigaction(SIGTERM, NULL, {0x8058790, [], 0}, 8) = 0
[pid 3914] rt_sigaction(SIGTERM, {0x8058790, [], 0}, NULL, 8) = 0
[pid 3914] dup2(7, 0) = 0
[pid 3914] close(7) = 0
[pid 3914] dup2(9, 1
---------------------------------------------------------------------------

It reproducibly ends at the dup2(9,1), exact same spot (even after a reboot).

The problem is gone now though, after backing down to 2.6.31.5.
--
Sincerely,
Stephen R. van den Berg.


2009-11-25 21:58:13

by Stephen R. van den Berg

[permalink] [raw]
Subject: Re: Fw: Deadlock regression in v2.6.31.6

On Wed, Nov 25, 2009 at 15:31, Trond Myklebust
<[email protected]> wrote:
> On Wed, 2009-11-25 at 09:56 +0100, Stephen R. van den Berg wrote:
>> The kernel hangs at this point, the only way to get out of there is
>> using SysBreak.

> This just means that the RPC client is waiting for a reply from the NFS
> server.

Even if it is, I'd say that the kernel has no business "busywaiting"
for that response. It should yield to other processes. So even if
the response does get lost, the kernel shouldn't become locked.

> Does 'netstat -t' show that there is an active TCP connection to the
> server's nfs port?
> Does wireshark show that the client should have received a reply?

I'll do a packettrace and let you know.
--
Sincerely,
Stephen R. van den Berg.

2009-11-25 23:11:50

by Stephen R. van den Berg

[permalink] [raw]
Subject: Re: Fw: Deadlock regression in v2.6.31.6

On Wed, Nov 25, 2009 at 15:31, Trond Myklebust
<[email protected]> wrote:
> On Wed, 2009-11-25 at 09:56 +0100, Stephen R. van den Berg wrote:
>> RPC: =A09697 setting alarm for 60000 ms
>> RPC: =A09697 __rpc_wake_up_task (now 7827)
>> RPC: =A09697 disabling timer
>> RPC: =A09697 removed from queue cfa72d88 "xprt_pending"
>> RPC: =A0 =A0 =A0 __rpc_wake_up_task done
>> RPC: =A09697 __rpc_execute flags=3D0x1 cf849c44

>> The kernel hangs at this point, the only way to get out of there is
>> using SysBreak.

> This just means that the RPC client is waiting for a reply from the N=
=46S
> server.

> Does 'netstat -t' show that there is an active TCP connection to the
> server's nfs port?
> Does wireshark show that the client should have received a reply?

Got a pcap file captured by tcpdump on a Linux router in between the tw=
o.
The NFS server is a unfs3 0.9.22+dfsg-2 (Debian) running in 32-bit
userspace on a server using a 2.6.31.5 Linux kernel in 64-bit
kernelspace on HP hardware using a hardware accellerated network card
which splits up TCP packets by itself before putting them on the wire.
The pcap is the result that is found on the wire by capturing it on an
intermediate machine.
The nfs daemon doesn't show any log messages.

The pcap file can (temporarily) be found at:
http://master.icecube.cuci.nl/linuxkerntcpnfs-linuxclient.pcap.gz

A sneak preview of the last remaining packets is:
----------------------------------------------
23:53:47.028266 IP (tos 0x0, ttl 64, id 14117, offset 0, flags [DF],
proto TCP (6), length 168)
1.2.3.167.1912647778 > 1.2.3.151.2049: 124 lookup fh
Unknown/00090000824B680000000000824B68000000000007E00F0CAF49AAA10000000=
=46
"netkit-rsh.1.gz"
23:53:47.028642 IP (tos 0x0, ttl 63, id 29190, offset 0, flags [DF],
proto TCP (6), length 284)
1.2.3.151.2049 > 1.2.3.167.1912647778: reply ok 240 lookup fh
Unknown/00090000514A680000000000514A68000000000008E00F0CAF49AAA17300000=
0
REG 1 ids 1/0 sz 0 nlink 33188 rdev 4096/0 fsid 0 fileid 90000000000
a/m/ctime 6834769.1254731170 0.1187257038 0.1254731185 post dattr:
23:53:47.029104 IP (tos 0x0, ttl 64, id 14118, offset 0, flags [DF],
proto TCP (6), length 164)
1.2.3.167.1929424994 > 1.2.3.151.2049: 120 lookup fh
Unknown/00090000824B680000000000824B68000000000007E00F0CAF49AAA10000000=
B
"wbinfo.1.gz"
23:53:47.029502 IP (tos 0x0, ttl 63, id 29191, offset 0, flags [DF],
proto TCP (6), length 284)
1.2.3.151.2049 > 1.2.3.167.1929424994: reply ok 240 lookup fh
Unknown/00090000C2D9650000000000C2D965000000000008E00F0CAF49AAA17E00000=
0
REG 1 ids 1/0 sz 0 nlink 33188 rdev 4096/0 fsid 0 fileid 90000000000
a/m/ctime 6674882.1259187953 0.1254554345 0.1257462034 post dattr:
23:53:47.029946 IP (tos 0x0, ttl 64, id 14119, offset 0, flags [DF],
proto TCP (6), length 152)
1.2.3.167.1946202210 > 1.2.3.151.2049: 108 getattr fh
Unknown/00090000C2D9650000000000C2D965000000000008E00F0CAF49AAA17E00000=
0
23:53:47.030258 IP (tos 0x0, ttl 63, id 29192, offset 0, flags [DF],
proto TCP (6), length 156)
1.2.3.151.2049 > 1.2.3.167.1946202210: reply ok 112 getattr REG
100644 ids 0/0 sz 3872
23:53:47.030537 IP (tos 0x0, ttl 64, id 14120, offset 0, flags [DF],
proto TCP (6), length 156)
1.2.3.167.1962979426 > 1.2.3.151.2049: 112 access fh
Unknown/00090000C2D9650000000000C2D965000000000008E00F0CAF49AAA17E00000=
0
7e000000
23:53:47.030853 IP (tos 0x0, ttl 63, id 29193, offset 0, flags [DF],
proto TCP (6), length 164)
1.2.3.151.2049 > 1.2.3.167.1962979426: reply ok 120 access attr:
REG 100644 ids 0/0 sz 3872 nlink 1 rdev 0/0 fsid 900 fileid 65d9c2
a/m/ctime 1259187953.000000 1254554345.000000 1257462034.000000 c 0001
23:53:47.032528 IP (tos 0x0, ttl 64, id 14121, offset 0, flags [DF],
proto TCP (6), length 164)
1.2.3.167.1979756642 > 1.2.3.151.2049: 120 read fh
Unknown/00090000C2D9650000000000C2D965000000000008E00F0CAF49AAA17E00000=
0
0 bytes @ 9079256848778919936
23:53:47.033322 IP (tos 0x0, ttl 63, id 29194, offset 0, flags [DF],
proto TCP (6), length 1500)
1.2.3.151.2049 > 1.2.3.167.1979756642: reply ok 1456 read REG
100644 ids 0/0 sz 3872 nlink 1 rdev 0/0 fsid 900 fileid 65d9c2
a/m/ctime 1259187953.000000 1254554345.000000 1257462034.000000 3872
bytes EOF
23:53:47.033447 IP (tos 0x0, ttl 63, id 29195, offset 0, flags [DF],
proto TCP (6), length 1500)
1.2.3.151.2049 > 1.2.3.167.739: Flags [.], cksum 0x3c5c (correct),
seq 40440:41900, ack 305937, win 65535, length 1460
23:53:47.033540 IP (tos 0x0, ttl 63, id 29196, offset 0, flags [DF],
proto TCP (6), length 1120)
1.2.3.151.2049 > 1.2.3.167.739: Flags [P.], cksum 0x75cc
(correct), seq 41900:42980, ack 305937, win 65535, length 1080
23:53:47.034112 IP (tos 0x0, ttl 64, id 14122, offset 0, flags [DF],
proto TCP (6), length 40)
1.2.3.167.739 > 1.2.3.151.2049: Flags [.], cksum 0x0c81 (correct),
seq 305937, ack 41900, win 65535, length 0
23:53:47.068991 IP (tos 0x0, ttl 64, id 14123, offset 0, flags [DF],
proto TCP (6), length 40)
1.2.3.167.739 > 1.2.3.151.2049: Flags [.], cksum 0x0849 (correct),
seq 305937, ack 42980, win 65535, length 0
23:53:47.069180 IP (tos 0x0, ttl 63, id 29197, offset 0, flags [DF],
proto TCP (6), length 48)
1.2.3.151.2049 > 1.2.3.167.739: Flags [P.], cksum 0xd905
(correct), seq 42980:42988, ack 305937, win 65535, length 8
23:53:47.069335 IP (tos 0x0, ttl 64, id 14124, offset 0, flags [DF],
proto TCP (6), length 40)
1.2.3.167.739 > 1.2.3.151.2049: Flags [.], cksum 0x0841 (correct),
seq 305937, ack 42988, win 65535, length 0
23:53:47.069393 IP (tos 0x0, ttl 64, id 14125, offset 0, flags [DF],
proto TCP (6), length 40)
1.2.3.167.739 > 1.2.3.151.2049: Flags [F.], cksum 0x0840
(correct), seq 305937, ack 42988, win 65535, length 0
23:53:47.069589 IP (tos 0x0, ttl 63, id 29198, offset 0, flags [DF],
proto TCP (6), length 40)
1.2.3.151.2049 > 1.2.3.167.739: Flags [F.], cksum 0x083f
(correct), seq 42988, ack 305938, win 65535, length 0
23:53:47.069747 IP (tos 0x0, ttl 64, id 14126, offset 0, flags [DF],
proto TCP (6), length 40)
1.2.3.167.739 > 1.2.3.151.2049: Flags [.], cksum 0x083f (correct),
seq 305938, ack 42989, win 65535, length 0
---------------------------------------------------

1.2.3.167 is the Linux client kernel which locks up, 1.2.3.151 is the
unfs server.
It looks like the client terminates the TCP connection. The server
confirms it, the client then sends a final acknowledge. At that point
the client kernel locks up in the infinite loop.
--=20
Sincerely,
Stephen R. van den Berg.

2009-11-25 08:56:54

by Stephen R. van den Berg

[permalink] [raw]
Subject: Re: Fw: Deadlock regression in v2.6.31.6

> The problem vanishes as soon as I run v2.6.31.5 (neither kernel contains
> any significant modules).

I did a bisect, and it turns out that the problem is there in 2.6.31.5 as well.
The traces are still valid. This is on an NFS mounted root partition
(NFSv3 over TCP), no other filesystems mounted (except a tmpfs here or
there). I turned on some debugging in net/sunrpc/sched.c, and the
following happens when I execute "apt-get --reinstall install man-db"
(it happens everytime, so it is very reproducible):

RPC: 9697 setting alarm for 60000 ms
RPC: 9697 __rpc_wake_up_task (now 7827)
RPC: 9697 disabling timer
RPC: 9697 removed from queue cfa72d88 "xprt_pending"
RPC: __rpc_wake_up_task done
RPC: 9697 __rpc_execute flags=0x1 cf849c44
RPC: 9697 sleep_on(queue "xprt_pending" time 7828)
RPC: 9697 added to queue cfa72d88 "xprt_pending"
RPC: 9697 setting alarm for 60000 ms
RPC: 9697 __rpc_wake_up_task (now 7830)
RPC: 9697 disabling timer
RPC: 9697 removed from queue cfa72d88 "xprt_pending"
RPC: __rpc_wake_up_task done
RPC: 9697 __rpc_execute flags=0x1 cf849c44
RPC: 9697 sleep_on(queue "xprt_pending" time 7831)
RPC: 9697 added to queue cfa72d88 "xprt_pending"
RPC: 9697 setting alarm for 60000 ms
RPC: 9697 __rpc_wake_up_task (now 7833)
RPC: 9697 disabling timer
RPC: 9697 removed from queue cfa72d88 "xprt_pending"
RPC: __rpc_wake_up_task done
RPC: 9697 __rpc_execute flags=0x1 cf849c44
RPC: 9697 sleep_on(queue "xprt_pending" time 7835)
RPC: 9697 added to queue cfa72d88 "xprt_pending"
RPC: 9697 setting alarm for 60000 ms
RPC: 9697 __rpc_wake_up_task (now 7836)
RPC: 9697 disabling timer
RPC: 9697 removed from queue cfa72d88 "xprt_pending"
RPC: __rpc_wake_up_task done
RPC: 9697 __rpc_execute flags=0x1 cf849c44
RPC: 9697 sleep_on(queue "xprt_pending" time 7838)
RPC: 9697 added to queue cfa72d88 "xprt_pending"
RPC: 9697 setting alarm for 60000 ms
RPC: 9697 __rpc_wake_up_task (now 7839)
RPC: 9697 disabling timer
RPC: 9697 removed from queue cfa72d88 "xprt_pending"
RPC: __rpc_wake_up_task done
RPC: 9697 __rpc_execute flags=0x1 cf849c44
RPC: 9697 sleep_on(queue "xprt_pending" time 7841)
RPC: 9697 added to queue cfa72d88 "xprt_pending"
RPC: 9697 setting alarm for 60000 ms
RPC: 9697 __rpc_wake_up_task (now 7842)
RPC: 9697 disabling timer
RPC: 9697 removed from queue cfa72d88 "xprt_pending"
RPC: __rpc_wake_up_task done
RPC: 9697 __rpc_execute flags=0x1 cf849c44
RPC: 9697 sleep_on(queue "xprt_pending" time 7844)
RPC: 9697 added to queue cfa72d88 "xprt_pending"
RPC: 9697 setting alarm for 60000 ms
RPC: 9697 __rpc_wake_up_task (now 7845)
RPC: 9697 disabling timer
RPC: 9697 removed from queue cfa72d88 "xprt_pending"
RPC: __rpc_wake_up_task done

Ad infinitum.
The cf849c44 is the task parameter which I printed as well.
It looks like an endless loop in the statemachine.
The kernel hangs at this point, the only way to get out of there is
using SysBreak.
I tried debugging it further, but I got lost in the statemachine (I think).
--
Sincerely,
Stephen R. van den Berg.

2009-11-25 09:00:18

by Stephen R. van den Berg

[permalink] [raw]
Subject: Re: Fw: Deadlock regression in v2.6.31.6

On Wed, Nov 25, 2009 at 09:56, Stephen R. van den Berg <[email protected]> wrote:
> (it happens everytime, so it is very reproducible):

If anyone wants more debugging info or wants me to try a fix, just let
me know, the problem can be reproduced at will and I have the kernel
source ready for modifications. The last trace was relative to
2.6.31.6. It did not trigger any DO_BUG() checks (I enabled
RPC_DEBUG).
--
Sincerely,
Stephen R. van den Berg.

2009-11-25 14:31:54

by Trond Myklebust

[permalink] [raw]
Subject: Re: Fw: Deadlock regression in v2.6.31.6

On Wed, 2009-11-25 at 09:56 +0100, Stephen R. van den Berg wrote:
> > The problem vanishes as soon as I run v2.6.31.5 (neither kernel contains
> > any significant modules).
>
> I did a bisect, and it turns out that the problem is there in 2.6.31.5 as well.

This makes sense. There have been no RPC level changes between 2.6.31.5
and 2.6.31.6.

> The traces are still valid. This is on an NFS mounted root partition
> (NFSv3 over TCP), no other filesystems mounted (except a tmpfs here or
> there). I turned on some debugging in net/sunrpc/sched.c, and the
> following happens when I execute "apt-get --reinstall install man-db"
> (it happens everytime, so it is very reproducible):
>
> RPC: 9697 setting alarm for 60000 ms
> RPC: 9697 __rpc_wake_up_task (now 7827)
> RPC: 9697 disabling timer
> RPC: 9697 removed from queue cfa72d88 "xprt_pending"
> RPC: __rpc_wake_up_task done
> RPC: 9697 __rpc_execute flags=0x1 cf849c44
> RPC: 9697 sleep_on(queue "xprt_pending" time 7828)
> RPC: 9697 added to queue cfa72d88 "xprt_pending"
> RPC: 9697 setting alarm for 60000 ms
> RPC: 9697 __rpc_wake_up_task (now 7830)
> RPC: 9697 disabling timer
> RPC: 9697 removed from queue cfa72d88 "xprt_pending"
> RPC: __rpc_wake_up_task done
> RPC: 9697 __rpc_execute flags=0x1 cf849c44
> RPC: 9697 sleep_on(queue "xprt_pending" time 7831)
> RPC: 9697 added to queue cfa72d88 "xprt_pending"
> RPC: 9697 setting alarm for 60000 ms
> RPC: 9697 __rpc_wake_up_task (now 7833)
> RPC: 9697 disabling timer
> RPC: 9697 removed from queue cfa72d88 "xprt_pending"
> RPC: __rpc_wake_up_task done
> RPC: 9697 __rpc_execute flags=0x1 cf849c44
> RPC: 9697 sleep_on(queue "xprt_pending" time 7835)
> RPC: 9697 added to queue cfa72d88 "xprt_pending"
> RPC: 9697 setting alarm for 60000 ms
> RPC: 9697 __rpc_wake_up_task (now 7836)
> RPC: 9697 disabling timer
> RPC: 9697 removed from queue cfa72d88 "xprt_pending"
> RPC: __rpc_wake_up_task done
> RPC: 9697 __rpc_execute flags=0x1 cf849c44
> RPC: 9697 sleep_on(queue "xprt_pending" time 7838)
> RPC: 9697 added to queue cfa72d88 "xprt_pending"
> RPC: 9697 setting alarm for 60000 ms
> RPC: 9697 __rpc_wake_up_task (now 7839)
> RPC: 9697 disabling timer
> RPC: 9697 removed from queue cfa72d88 "xprt_pending"
> RPC: __rpc_wake_up_task done
> RPC: 9697 __rpc_execute flags=0x1 cf849c44
> RPC: 9697 sleep_on(queue "xprt_pending" time 7841)
> RPC: 9697 added to queue cfa72d88 "xprt_pending"
> RPC: 9697 setting alarm for 60000 ms
> RPC: 9697 __rpc_wake_up_task (now 7842)
> RPC: 9697 disabling timer
> RPC: 9697 removed from queue cfa72d88 "xprt_pending"
> RPC: __rpc_wake_up_task done
> RPC: 9697 __rpc_execute flags=0x1 cf849c44
> RPC: 9697 sleep_on(queue "xprt_pending" time 7844)
> RPC: 9697 added to queue cfa72d88 "xprt_pending"
> RPC: 9697 setting alarm for 60000 ms
> RPC: 9697 __rpc_wake_up_task (now 7845)
> RPC: 9697 disabling timer
> RPC: 9697 removed from queue cfa72d88 "xprt_pending"
> RPC: __rpc_wake_up_task done
>
> Ad infinitum.
> The cf849c44 is the task parameter which I printed as well.
> It looks like an endless loop in the statemachine.
> The kernel hangs at this point, the only way to get out of there is
> using SysBreak.
> I tried debugging it further, but I got lost in the statemachine (I think).

This just means that the RPC client is waiting for a reply from the NFS
server.

Does 'netstat -t' show that there is an active TCP connection to the
server's nfs port?
Does wireshark show that the client should have received a reply?

Trond