2005-05-08 11:33:53

by Haakon Riiser

[permalink] [raw]
Subject: read(2) hangs on the client side

I have noticed that there is a specific case that can lock up the
client while doing read(2), and it seems to be a race condition that
only occures in a very specific situation. I will try to describe
this in as much detail as possible, since it is unlikely that you'll
be able to reproduce the bug for yourselves:

A big file (~ 250 MB) is shared on the NFS server. The NFS server
also acts as a Samba server, so that Windows machines can use it.
One of the Windows machines is running the eMule P2P client,
and it makes some of the Samba-hosted files available on the
eMule network.

The hang has _only_ happened when I try to access a shared file
via NFS while it is _simultaneously_ being accessed (via Samba)
by the eMule machine. To reproduce the hang from the NFS client's
side, I use this C program:

#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>

int main(int argc, char *argv[])
{
for (;;) {
char buf[4096];
int fd = open(argv[1], O_RDONLY);
read(fd, buf, 4096);
close(fd);
}
return 0;
}

It always hangs in the read() call, usually in the first iteration,
and after it does, _nothing_ can kill it, not even SIGKILL,
and until the next reboot, _any_ NFS operation -- even stat() --
on the accessed file will now hang.

I have no idea how Samba + eMule's access patterns look like,
but I know with 100 % certainty that it is the cause. If I move
the file in question out of eMule's shared directory, I can never
hang the NFS client no matter how long I run the above program.
If I move it back in, it hangs almost instantly. :-(

Note that only the Linux NFS client machine is seemingly affected by
this -- both the NFS/Samba server and the eMule-running Samba client
are doing just fine while the hang happens on the Linux client.

Some system info:

NFS client:
Slackware 10.1
Linux 2.6.11
nfs-utils 1.0.7
glibc 2.3.4
util-linux 2.12p

NFS server:
Fedora Core 3 (fully updated)
Linux 2.6.11-1.14_FC3
nfs-utils-1.0.6-52
glibc-2.3.5-0.fc3.1
Samba 3.0.15pre2-1
util-linux-2.12a-24.2

Any help in further analysis would be greatly appreciated!

--
Haakon


-------------------------------------------------------
This SF.Net email is sponsored by: NEC IT Guy Games.
Get your fingers limbered up and give it your best shot. 4 great events, 4
opportunities to win big! Highest score wins.NEC IT Guy Games. Play to
win an NEC 61 plasma display. Visit http://www.necitguy.com/?r=20
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs


2005-05-14 22:39:19

by Haakon Riiser

[permalink] [raw]
Subject: Re: read(2) hangs on the client side

Trond,

>> Should I run tcpdump on the server or the client, or perhaps it
>> doesn't matter?
>
> Please try on the client first.

Done. I tcpdump'd all data (-s0) to file, and opened it up in
Ethereal to make some sense of it. I think these packets are the
ones of interest to you: (10.0.0.2 is the NFS server and 10.0.0.4
is the NFS client)

-------------------------------------------------------------------------------
No. Time Source Destination Protocol Info
51 8.020185 10.0.0.4 10.0.0.2 NFS V3 READ Call (Reply In 52), FH:0x2322d19a Offset:0 Len:4096

Frame 51 (190 bytes on wire, 190 bytes captured)
Ethernet II, Src: 00:01:02:fa:af:5f, Dst: 00:50:04:55:8c:44
Internet Protocol, Src Addr: 10.0.0.4 (10.0.0.4), Dst Addr: 10.0.0.2 (10.0.0.2)
Transmission Control Protocol, Src Port: 799 (799), Dst Port: nfsd (2049), Seq: 2048, Ack: 6256, Len: 124
Remote Procedure Call, Type:Call XID:0x70cf1c9d
Network File System, READ Call FH:0x2322d19a Offset:0 Len:4096
Program Version: 3
V3 Procedure: READ (6)
file
offset: 0
count: 4096

No. Time Source Destination Protocol Info
52 8.020731 10.0.0.2 10.0.0.4 NFS V3 READ Reply (Call In 51) Error:Unknown error:10008

Frame 52 (186 bytes on wire, 186 bytes captured)
Ethernet II, Src: 00:50:04:55:8c:44, Dst: 00:01:02:fa:af:5f
Internet Protocol, Src Addr: 10.0.0.2 (10.0.0.2), Dst Addr: 10.0.0.4 (10.0.0.4)
Transmission Control Protocol, Src Port: nfsd (2049), Dst Port: 799 (799), Seq: 6256, Ack: 2172, Len: 120
Remote Procedure Call, Type:Reply XID:0x70cf1c9d
Network File System, READ Reply Error:Unknown error:10008
Program Version: 3
V3 Procedure: READ (6)
Status: ERR_JUKEBOX (10008)
file_attributes

No. Time Source Destination Protocol Info
53 8.059968 10.0.0.4 10.0.0.2 TCP 799 > nfsd [ACK] Seq=2172 Ack=6376 Win=17376 Len=0 TSV=4294774881 TSER=1928447303

Frame 53 (66 bytes on wire, 66 bytes captured)
Ethernet II, Src: 00:01:02:fa:af:5f, Dst: 00:50:04:55:8c:44
Internet Protocol, Src Addr: 10.0.0.4 (10.0.0.4), Dst Addr: 10.0.0.2 (10.0.0.2)
Transmission Control Protocol, Src Port: 799 (799), Dst Port: nfsd (2049), Seq: 2172, Ack: 6376, Len: 0
-------------------------------------------------------------------------------

As you can see in the above, I'm running NFS over TCP, but I have
also tried UDP, and it still hangs. Anyway, have you seen this
before? ERR_JUKEBOX (10008) has only three hits on Google, and all
of them point to the patches to Ethereal that made it recognize this
error. :-)

--
Haakon


-------------------------------------------------------
This SF.Net email is sponsored by Oracle Space Sweepstakes
Want to be the first software developer in space?
Enter now for the Oracle Space Sweepstakes!
http://ads.osdn.com/?ad_id=7393&alloc_id=16281&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-05-14 23:38:32

by Trond Myklebust

[permalink] [raw]
Subject: Re: read(2) hangs on the client side

su den 15.05.2005 Klokka 00:39 (+0200) skreiv Haakon Riiser:
> Frame 52 (186 bytes on wire, 186 bytes captured)
> Ethernet II, Src: 00:50:04:55:8c:44, Dst: 00:01:02:fa:af:5f
> Internet Protocol, Src Addr: 10.0.0.2 (10.0.0.2), Dst Addr: 10.0.0.4 (10.0.0.4)
> Transmission Control Protocol, Src Port: nfsd (2049), Dst Port: 799 (799), Seq: 6256, Ack: 2172, Len: 120
> Remote Procedure Call, Type:Reply XID:0x70cf1c9d
> Network File System, READ Reply Error:Unknown error:10008
> Program Version: 3
> V3 Procedure: READ (6)
> Status: ERR_JUKEBOX (10008)
> file_attributes
>
> No. Time Source Destination Protocol Info
> 53 8.059968 10.0.0.4 10.0.0.2 TCP 799 > nfsd [ACK] Seq=2172 Ack=6376 Win=17376 Len=0 TSV=4294774881 TSER=1928447303
>
> Frame 53 (66 bytes on wire, 66 bytes captured)
> Ethernet II, Src: 00:01:02:fa:af:5f, Dst: 00:50:04:55:8c:44
> Internet Protocol, Src Addr: 10.0.0.4 (10.0.0.4), Dst Addr: 10.0.0.2 (10.0.0.2)
> Transmission Control Protocol, Src Port: 799 (799), Dst Port: nfsd (2049), Seq: 2172, Ack: 6376, Len: 0
> -------------------------------------------------------------------------------
>
> As you can see in the above, I'm running NFS over TCP, but I have
> also tried UDP, and it still hangs. Anyway, have you seen this
> before? ERR_JUKEBOX (10008) has only three hits on Google, and all
> of them point to the patches to Ethereal that made it recognize this
> error. :-)

Why didn't you try the NFSv3 RFC? 8-)

NFS3ERR_JUKEBOX
The server initiated the request, but was not able to
complete it in a timely fashion. The client should wait
and then try the request with a new RPC transaction ID.
For example, this error should be returned from a server
that supports hierarchical storage and receives a request
to process a file that has been migrated. In this case,
the server should start the immigration process and
respond to client with this error.

In other words, this is an error used by the server to say "I'm busy,
and cannot retrieve the data you want. Please try again later".

Are you absolutely sure that you managed to disable oplocks on the samba
server? 'cos this is the error an NFS server will usually return when it
is waiting for the samba server (or an NFSv4 server) to break the lease
on a file so that it can be opened.

Cheers,
Trond



-------------------------------------------------------
This SF.Net email is sponsored by Oracle Space Sweepstakes
Want to be the first software developer in space?
Enter now for the Oracle Space Sweepstakes!
http://ads.osdn.com/?ad_id=7393&alloc_id=16281&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-05-15 06:21:45

by Haakon Riiser

[permalink] [raw]
Subject: Re: read(2) hangs on the client side

Trond,

> Why didn't you try the NFSv3 RFC? 8-)
>
> NFS3ERR_JUKEBOX
> The server initiated the request, but was not able to
> complete it in a timely fashion. The client should wait
> and then try the request with a new RPC transaction ID.
> For example, this error should be returned from a server
> that supports hierarchical storage and receives a request
> to process a file that has been migrated. In this case,
> the server should start the immigration process and
> respond to client with this error.
>
> In other words, this is an error used by the server to say "I'm busy,
> and cannot retrieve the data you want. Please try again later".
>
> Are you absolutely sure that you managed to disable oplocks on the samba
> server? 'cos this is the error an NFS server will usually return when it
> is waiting for the samba server (or an NFSv4 server) to break the lease
> on a file so that it can be opened.

I did set 'kernel oplocks = no' globally and restarted Samba, but I
didn't look for locked files with smbstatus. I'll try again and
investigate a little more this time. In the meantime, here's what
smbstatus shows:

Locked files:
Pid DenyMode Access R/W Oplock Name
--------------------------------------------------------
32475 DENY_NONE 0x20089 RDONLY EXCLUSIVE+BATCH foo Sun May 15 08:13:23 2005

This lock goes on and off all the time, which probably explains why
this error doesn't happen every single time I try to open the file.
But why does ERR_JUKEBOX cause the client to hang? This lock _is_
soon released, so shouldn't NFS wake up once that happens?

--
Haakon


-------------------------------------------------------
This SF.Net email is sponsored by Oracle Space Sweepstakes
Want to be the first software developer in space?
Enter now for the Oracle Space Sweepstakes!
http://ads.osdn.com/?ad_id=7393&alloc_id=16281&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-05-15 06:40:29

by Trond Myklebust

[permalink] [raw]
Subject: Re: read(2) hangs on the client side

su den 15.05.2005 Klokka 08:21 (+0200) skreiv Haakon Riiser:
> I did set 'kernel oplocks = no' globally and restarted Samba, but I
> didn't look for locked files with smbstatus. I'll try again and
> investigate a little more this time. In the meantime, here's what
> smbstatus shows:
>
> Locked files:
> Pid DenyMode Access R/W Oplock Name
> --------------------------------------------------------
> 32475 DENY_NONE 0x20089 RDONLY EXCLUSIVE+BATCH foo Sun May 15 08:13:23 2005
>
> This lock goes on and off all the time, which probably explains why
> this error doesn't happen every single time I try to open the file.
> But why does ERR_JUKEBOX cause the client to hang? This lock _is_
> soon released, so shouldn't NFS wake up once that happens?

I suspect there were hanging locks. Some of those lease code changes
look pretty suspicious, so I think I want to have a chat with Andy & co
as soon as I get back to Michigan.

In the mean time, please try rebooting the server, and just turn off the
lease code using 'sysctl -w fs.leases-enable=0'.

Cheers,
Trond

Cheers,
Trond



-------------------------------------------------------
This SF.Net email is sponsored by Oracle Space Sweepstakes
Want to be the first software developer in space?
Enter now for the Oracle Space Sweepstakes!
http://ads.osdn.com/?ad_id=7393&alloc_id=16281&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-05-15 06:55:04

by Haakon Riiser

[permalink] [raw]
Subject: Re: read(2) hangs on the client side

Trond,

I actually managed to fix the hang with a third oplock option in
smb.conf that I missed before: 'oplocks = no'. The last time, I
tried toggling fake oplocks and kernel oplocks, but neither of those
made any difference. The plain 'oplocks' parameter did though!
I've been running my test program for almost 25 minutes, and still
no hang.

Thanks for your persistence in tracking this down. I'll have
oplocks disabled for now.

> I suspect there were hanging locks. Some of those lease code changes
> look pretty suspicious, so I think I want to have a chat with Andy & co
> as soon as I get back to Michigan.

So this is, after all, a bug in NFS? Strange that it hasn't hit
more people. Searching for similar problems on Google Groups turned
up nothing.

--
Haakon


-------------------------------------------------------
This SF.Net email is sponsored by Oracle Space Sweepstakes
Want to be the first software developer in space?
Enter now for the Oracle Space Sweepstakes!
http://ads.osdn.com/?ad_id=7393&alloc_id=16281&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-05-15 07:00:31

by Trond Myklebust

[permalink] [raw]
Subject: Re: read(2) hangs on the client side

su den 15.05.2005 Klokka 08:54 (+0200) skreiv Haakon Riiser:

> > I suspect there were hanging locks. Some of those lease code changes
> > look pretty suspicious, so I think I want to have a chat with Andy & co
> > as soon as I get back to Michigan.
>
> So this is, after all, a bug in NFS? Strange that it hasn't hit
> more people. Searching for similar problems on Google Groups turned
> up nothing.

I suspect rather that it is a bug in the VFS' lease code. That code was
jiggled around a bit during the NFSv4 delegation code development.

Cheers,
Trond



-------------------------------------------------------
This SF.Net email is sponsored by Oracle Space Sweepstakes
Want to be the first software developer in space?
Enter now for the Oracle Space Sweepstakes!
http://ads.osdn.com/?ad_id=7393&alloc_id=16281&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-05-08 12:54:42

by Trond Myklebust

[permalink] [raw]
Subject: Re: read(2) hangs on the client side

su den 08.05.2005 Klokka 13:33 (+0200) skreiv Haakon Riiser:

> I have no idea how Samba + eMule's access patterns look like,
> but I know with 100 % certainty that it is the cause. If I move
> the file in question out of eMule's shared directory, I can never
> hang the NFS client no matter how long I run the above program.
> If I move it back in, it hangs almost instantly. :-(

Are oplocks enabled on the Samba server?

Cheers,
Trond



-------------------------------------------------------
This SF.Net email is sponsored by: NEC IT Guy Games.
Get your fingers limbered up and give it your best shot. 4 great events, 4
opportunities to win big! Highest score wins.NEC IT Guy Games. Play to
win an NEC 61 plasma display. Visit http://www.necitguy.com/?r=20
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-05-08 13:21:43

by Haakon Riiser

[permalink] [raw]
Subject: Re: read(2) hangs on the client side

Trond,

> su den 08.05.2005 Klokka 13:33 (+0200) skreiv Haakon Riiser:
>
> > I have no idea how Samba + eMule's access patterns look like,
> > but I know with 100 % certainty that it is the cause. If I move
> > the file in question out of eMule's shared directory, I can never
> > hang the NFS client no matter how long I run the above program.
> > If I move it back in, it hangs almost instantly. :-(
>
> Are oplocks enabled on the Samba server?

Probably. I haven't touched the oplock setting, but according
to smb.conf(5), kernel oplocks are enabled by default, and they
translate to no-ops when the kernel doesn't support them. How do
I determine if I have kernel support?

--
Haakon


-------------------------------------------------------
This SF.Net email is sponsored by: NEC IT Guy Games.
Get your fingers limbered up and give it your best shot. 4 great events, 4
opportunities to win big! Highest score wins.NEC IT Guy Games. Play to
win an NEC 61 plasma display. Visit http://www.necitguy.com/?r=20
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-05-08 13:59:42

by Trond Myklebust

[permalink] [raw]
Subject: Re: read(2) hangs on the client side

su den 08.05.2005 Klokka 15:21 (+0200) skreiv Haakon Riiser:

> > Are oplocks enabled on the Samba server?
>
> Probably. I haven't touched the oplock setting, but according
> to smb.conf(5), kernel oplocks are enabled by default, and they
> translate to no-ops when the kernel doesn't support them. How do
> I determine if I have kernel support?

Kernel support for oplocks has been in Linux since version 2.4 - see the
"lease" locks on the fcntl manpage. They haven't been maintained for
quite some while, though until Andy started working on the NFSv4 server
delegation code (in 2.6.11 or so).

Anyhow, try turning oplocks off in smb.conf and see if the hangs
disappear.

Cheers,
Trond



-------------------------------------------------------
This SF.Net email is sponsored by: NEC IT Guy Games.
Get your fingers limbered up and give it your best shot. 4 great events, 4
opportunities to win big! Highest score wins.NEC IT Guy Games. Play to
win an NEC 61 plasma display. Visit http://www.necitguy.com/?r=20
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-05-08 14:38:04

by Haakon Riiser

[permalink] [raw]
Subject: Re: read(2) hangs on the client side

Trond,

> Anyhow, try turning oplocks off in smb.conf and see if the hangs
> disappear.

Still hangs. I also tried enabling fake oplocks, but that didn't
make any difference either. Just so you know: I made sure to stop
all Samba processes (service smb stop) so I /know/ that the new
oplock settings were in effect.

--
Haakon


-------------------------------------------------------
This SF.Net email is sponsored by: NEC IT Guy Games.
Get your fingers limbered up and give it your best shot. 4 great events, 4
opportunities to win big! Highest score wins.NEC IT Guy Games. Play to
win an NEC 61 plasma display. Visit http://www.necitguy.com/?r=20
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-05-09 10:38:48

by Trond Myklebust

[permalink] [raw]
Subject: Re: read(2) hangs on the client side

su den 08.05.2005 Klokka 16:37 (+0200) skreiv Haakon Riiser:
> Trond,
>
> > Anyhow, try turning oplocks off in smb.conf and see if the hangs
> > disappear.
>
> Still hangs. I also tried enabling fake oplocks, but that didn't
> make any difference either. Just so you know: I made sure to stop
> all Samba processes (service smb stop) so I /know/ that the new
> oplock settings were in effect.

Do you have a tcpdump of the traffic?

Cheers,
Trond



-------------------------------------------------------
This SF.Net email is sponsored by: NEC IT Guy Games.
Get your fingers limbered up and give it your best shot. 4 great events, 4
opportunities to win big! Highest score wins.NEC IT Guy Games. Play to
win an NEC 61 plasma display. Visit http://www.necitguy.com/?r=20
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-05-09 14:13:58

by Haakon Riiser

[permalink] [raw]
Subject: Re: read(2) hangs on the client side

Trond,

> Do you have a tcpdump of the traffic?

Should I run tcpdump on the server or the client, or perhaps it
doesn't matter?

--
Haakon


-------------------------------------------------------
This SF.Net email is sponsored by: NEC IT Guy Games.
Get your fingers limbered up and give it your best shot. 4 great events, 4
opportunities to win big! Highest score wins.NEC IT Guy Games. Play to
win an NEC 61 plasma display. Visit http://www.necitguy.com/?r=20
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-05-11 14:24:03

by Trond Myklebust

[permalink] [raw]
Subject: Re: read(2) hangs on the client side

m=C3=A5 den 09.05.2005 Klokka 16:13 (+0200) skreiv Haakon Riiser:
> Trond,
>=20
> > Do you have a tcpdump of the traffic?
>=20
> Should I run tcpdump on the server or the client, or perhaps it
> doesn't matter?

Please try on the client first.

Cheers,
Trond



-------------------------------------------------------
This SF.Net email is sponsored by Oracle Space Sweepstakes
Want to be the first software developer in space?
Enter now for the Oracle Space Sweepstakes!
http://ads.osdn.com/?ad_id=7393&alloc_id=16281&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs