2002-04-07 16:58:21

by Steven N. Hirsch

[permalink] [raw]
Subject: 2.4.19-pre5-ac3 NFS problems

All,

I'm not sure exactly what has been integrated into Alan's pre5-ac3 kernel,
but there are serious problems with NFS over TCP. Twice in a row I've had
locked processes on the client when attempting to lock a mail spool on the
server. Required reboot on both ends to clear :-(.

FWIW, I had been running for almost a month prior with 2.4.19-pre2 +
Trond's 2.4.18_NFS_ALL using NFS over TCP and saw no problems. I moved to
the new kernel ONLY on the server. After reverting back, all seems stable
again.

What is the current status of the various 2.4.x patches floating around?

Steve



_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs


2002-04-08 02:34:53

by NeilBrown

[permalink] [raw]
Subject: Re: 2.4.19-pre5-ac3 NFS problems

On Sunday April 7, [email protected] wrote:
> All,
>
> I'm not sure exactly what has been integrated into Alan's pre5-ac3 kernel,
> but there are serious problems with NFS over TCP. Twice in a row I've had
> locked processes on the client when attempting to lock a mail spool on the
> server. Required reboot on both ends to clear :-(.
>
> FWIW, I had been running for almost a month prior with 2.4.19-pre2 +
> Trond's 2.4.18_NFS_ALL using NFS over TCP and saw no problems. I moved to
> the new kernel ONLY on the server. After reverting back, all seems stable
> again.
>
> What is the current status of the various 2.4.x patches floating around?

2.4.19-pre5-ac3 has my TCP (and SMP) patches that are in 2.5, but
aren't ready for 2.4.real yet as they haven't had enough
testing... thanks for doing some testing.

How repeatable is the problem? Simple locking seems to work for me,
so presumably it is some particular combination or load..

Are you in a position you get it to fail again, or would that be
inconvenient?

I am interest to know if
"netstat -t"
shows anything on the input queue for the lockd connection: quite
possibly the connection to port 32768.

The only change that I can imagine might cause the client to hang is
the flow control that I added to the RPC layer: It won't accept a
request unless it is sure there will be room on the output queue for
the response.
For lockd, it makes extremely large estimates for the response size (I
was a bit lazy) which shouldn't be a problem except that it might slow
down lock requests if there are lots and lots of them, but maybe it
is.

Would you be able to try a patch that makes more realistic estimates
of lockd response sizes?

Are you using NFSv2 or NFSv3?

NeilBrown


_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-04-08 11:04:45

by Steven N. Hirsch

[permalink] [raw]
Subject: Re: 2.4.19-pre5-ac3 NFS problems

On Mon, 8 Apr 2002, Neil Brown wrote:

> On Sunday April 7, [email protected] wrote:
> > All,
> >
> > I'm not sure exactly what has been integrated into Alan's pre5-ac3 kernel,
> > but there are serious problems with NFS over TCP. Twice in a row I've had
> > locked processes on the client when attempting to lock a mail spool on the
> > server. Required reboot on both ends to clear :-(.
> >
> > FWIW, I had been running for almost a month prior with 2.4.19-pre2 +
> > Trond's 2.4.18_NFS_ALL using NFS over TCP and saw no problems. I moved to
> > the new kernel ONLY on the server. After reverting back, all seems stable
> > again.
> >
> > What is the current status of the various 2.4.x patches floating around?
>
> 2.4.19-pre5-ac3 has my TCP (and SMP) patches that are in 2.5, but
> aren't ready for 2.4.real yet as they haven't had enough
> testing... thanks for doing some testing.
>
> How repeatable is the problem? Simple locking seems to work for me,
> so presumably it is some particular combination or load..

It seems fairly easy to trip. I was able to hang it two or three times in
a row by simply attempting to open a non-default mail folder with pine.
Pine relies (I think) on trickery with lock files, rather than flock().

> Are you in a position you get it to fail again, or would that be
> inconvenient?

No problem. I'll try to make some time this evening for testing.

> I am interest to know if
> "netstat -t"
> shows anything on the input queue for the lockd connection: quite
> possibly the connection to port 32768.
>
> The only change that I can imagine might cause the client to hang is
> the flow control that I added to the RPC layer: It won't accept a
> request unless it is sure there will be room on the output queue for
> the response.
> For lockd, it makes extremely large estimates for the response size (I
> was a bit lazy) which shouldn't be a problem except that it might slow
> down lock requests if there are lots and lots of them, but maybe it
> is.
>
> Would you be able to try a patch that makes more realistic estimates
> of lockd response sizes?
>
> Are you using NFSv2 or NFSv3?

This was with v3 mounts. Also, client was 2.4.19-pre2 + Tronds 2.4.18
NFS_ALL patch. The _server_ was using ac3.

Steve


_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-04-11 11:30:54

by Steven N. Hirsch

[permalink] [raw]
Subject: Re: 2.4.19-pre5-ac3 NFS problems

On Mon, 8 Apr 2002, Neil Brown wrote:

> On Sunday April 7, [email protected] wrote:

> > I'm not sure exactly what has been integrated into Alan's pre5-ac3 kernel,
> > but there are serious problems with NFS over TCP. Twice in a row I've had
> > locked processes on the client when attempting to lock a mail spool on the
> > server. Required reboot on both ends to clear :-(.
> >

> I am interest to know if
> "netstat -t"
> shows anything on the input queue for the lockd connection: quite
> possibly the connection to port 32768.

Neil,

I finally got the chance to bang on this. I'm not seeing anything
queueing on port 32768, but during the period when 'pine' hangs there
seems to be a large number number of requests on port 32771 server-side.
At no point does the backlog fall below about 160. What should I be
seeing?

The problem is triggered 100% of the time when I try to open a secondary
mail folder on the server. It looks like a zero-length lock file is being
created, but the process simply hangs at that point. My incoming folder
is on the client and there are never any problems with that.

Another point: There is very little (to zero) activity on the network at
the time of the hangs.

> The only change that I can imagine might cause the client to hang is
> the flow control that I added to the RPC layer: It won't accept a
> request unless it is sure there will be room on the output queue for
> the response.
> For lockd, it makes extremely large estimates for the response size (I
> was a bit lazy) which shouldn't be a problem except that it might slow
> down lock requests if there are lots and lots of them, but maybe it
> is.
>
> Would you be able to try a patch that makes more realistic estimates
> of lockd response sizes?

Sure, fire away. This is all with NFSv3 over tcp.

Steve



_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-04-11 11:47:00

by Steven N. Hirsch

[permalink] [raw]
Subject: Re: 2.4.19-pre5-ac3 NFS problems

On Thu, 11 Apr 2002, Steven N. Hirsch wrote:

> On Mon, 8 Apr 2002, Neil Brown wrote:
>
> > On Sunday April 7, [email protected] wrote:
>
> > > I'm not sure exactly what has been integrated into Alan's pre5-ac3 kernel,
> > > but there are serious problems with NFS over TCP. Twice in a row I've had
> > > locked processes on the client when attempting to lock a mail spool on the
> > > server. Required reboot on both ends to clear :-(.
> > >
>
> > I am interest to know if
> > "netstat -t"
> > shows anything on the input queue for the lockd connection: quite
> > possibly the connection to port 32768.
>
> Neil,
>
> I finally got the chance to bang on this. I'm not seeing anything
> queueing on port 32768, but during the period when 'pine' hangs there
> seems to be a large number number of requests on port 32771 server-side.
> At no point does the backlog fall below about 160. What should I be
> seeing?

Following up on my own followup.. When I restart the server with
2.4.19-pre2 + Trond's older TCP server patch (which I had been using for
some weeks incident-free), I'm never seeing anything backed up on high
port numbers. With your TCP server patch, it's permanently backlogged at
numbers which are never < 160.

Hopefully this provides a clue to underlying problem?

Steve



_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-04-12 02:53:02

by NeilBrown

[permalink] [raw]
Subject: Re: 2.4.19-pre5-ac3 NFS problems

On Thursday April 11, [email protected] wrote:
>
> Neil,
>
> I finally got the chance to bang on this. I'm not seeing anything
> queueing on port 32768, but during the period when 'pine' hangs there
> seems to be a large number number of requests on port 32771 server-side.
> At no point does the backlog fall below about 160. What should I be
> seeing?

I'm perplexed...
Is 32771 a local port or a remote port on the server?

What I would expect you do see on the server if you do
netstat -t
is something like:
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0 elfman.orchestra.c:2049 dulcimer.orchestra.:800 ESTABLISHED

The local address is "2049".
The remote address could be anything. My 2.4 Linux clients seem to
choose 800, but anything is reasonable.

I would expect the see the Recv-Q and Send-Q jump around a lot when IO
is happening, but settle to zero whenever it is quiet.

If the connection that you are looking at is the NFS connection
(i.e. the other end is 2049) then the fact that it never gets below
160 suggests that I've got my accounting wrong somewhere.
The simplest option would be that sk_reserved goes negative, but I
cannot see what would cause that...

NeilBrown

_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-04-12 11:42:51

by Steven N. Hirsch

[permalink] [raw]
Subject: Re: 2.4.19-pre5-ac3 NFS problems

On Fri, 12 Apr 2002, Neil Brown wrote:

> On Thursday April 11, [email protected] wrote:
> >
> > Neil,
> >
> > I finally got the chance to bang on this. I'm not seeing anything
> > queueing on port 32768, but during the period when 'pine' hangs there
> > seems to be a large number number of requests on port 32771 server-side.
> > At no point does the backlog fall below about 160. What should I be
> > seeing?
>
> I'm perplexed...
> Is 32771 a local port or a remote port on the server?

It's the local port on the server. The foreign end point is the client
with the hung 'pine' instance, port 799, 800, whatever. All I can say is
to reiterate: whatever is going on between client:800 and server:32771
looks different on a correctly-operating NFSv3 TCP connection - it always
shows up as zero. In the broken case, it never falls below 160 and
typically is up at 240.

> What I would expect you do see on the server if you do
> netstat -t
> is something like:
> Active Internet connections (w/o servers)
> Proto Recv-Q Send-Q Local Address Foreign Address State
> tcp 0 0 elfman.orchestra.c:2049 dulcimer.orchestra.:800 ESTABLISHED
>
> The local address is "2049".
> The remote address could be anything. My 2.4 Linux clients seem to
> choose 800, but anything is reasonable.
>
> I would expect the see the Recv-Q and Send-Q jump around a lot when IO
> is happening, but settle to zero whenever it is quiet.

That's why I flagged the '160' minimum value as being suspicious.

> If the connection that you are looking at is the NFS connection
> (i.e. the other end is 2049) then the fact that it never gets below
> 160 suggests that I've got my accounting wrong somewhere.
> The simplest option would be that sk_reserved goes negative, but I
> cannot see what would cause that...

It's definitely NOT the NFS connection. I can see that as well, and it
shows as zero.

Let me know what other information I can get for you?

Steve



_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs