2003-11-27 12:01:24

by Douglas Furlong

[permalink] [raw]
Subject: NFS server not responding

Good day all.

I am running in to excessive amounts of NFS errors as below.

kernel: nfs: server neon not responding, still trying
kernel: nfs: server neon OK

I was hoping that some of you may be able to provide me with some
assistance.

First The Hardware
------------------
Neon: FileServer
Disks: 4xSATA connected to a HighPoint RAID controller. I am using their
drivers but using Linux software raid (md0). this stores the bulk of the
data.
1xATA connected to on-board IDE, this has the rest of the OS on it.
Network Card: 3c905 (more details can be obtained if needed).
OS: Redhat9 + all current updates + statd version 1.0.6 (from sf.net)
Authentication/User Details: Via an OpenLDAP server
Memory: 512MB
CPU: XP2800

Wibbit: Workstation
Disks: Normal ATA disk.
Network Card: 3c905 I believe.
OS: Fedora Core1 (was previously RedHat9 suffering the same problems)
Authentication/User Details: Via an OpenLDAP server
Memory: 512MB
CPU: XP2200

Network: Switched 10/100. Fileserver connected to a HP switch,
workstations connected to the HP switch via smaller 5port switches.


The Software
------------

Server
------
A bit more about the software.

The server is using an LDAP server (on the same physical network,
separate IP network) to authenticate uses credentials. nscd is running
and working on this machine.
I have exported several directory structures including home drives from
this machine.

/etc/exports
/mnt/raid/ISO/ 192.168.0.1/255.255.255.0(ro,sync)
/mnt/raid/home 192.168.0.1/255.255.255.0(rw,sync)
/mnt/raid/Operations 192.168.0.1/255.255.255.0(rw,sync)
/mnt/raid/Systems 192.168.0.1/255.255.255.0(rw,sync)
/mnt/raid/CustomerServices 192.168.0.1/255.255.255.0(rw,sync)
/mnt/raid/cvs 192.168.0.1/255.255.255.0(rw,sync)
/opt 192.168.0.1/255.255.255.0(rw,sync)
# For testing using iozone
/mnt/raid/test 192.168.0.150(rw,sync,no_root_squash)

I have upgraded the version of statd due to a problem reported on a
newsgroup referring to a problem with RedHat's patches. I am not sure if
it was causing the problem, but I was (am) running out of idea's. The
patch was with regards to statd dropping root privileges.

Clients
-------
All of my testing is being done from my client, however I have about 16
Linux desktops with their home directories mounted off of Neon, and
numerous applications that are mounted off of Neon (oh plus the data).

/etc/fstab
# NFS Mounts
neon:/mnt/raid/home /home nfs
wsize=8192,rsize=8192,intr,hard 0 0
neon:/mnt/raid/ISO/ /mnt/neon/iso nfs
wsize=8192,rsize=8192,intr,hard 0 0
neon:/opt /opt nfs
wsize=8192,rsize=8192,intr,hard 0 0

# NFS Mount for testing
neon:/mnt/raid/test /mnt/neon/test nfs
rw,hard,intr,rsize=8192,wsize=8192 0 0

I have started nfslock on both the clients and server, as well as nfs.

Usability
---------
When my users are working on their Linux machines, they notice from time
to time that they get intermittent "freezing" where applications stop
responding, unable to switch desktops or error messages from evolution
saying it cant store data.
All of these freezes co-inside with error messages like the below
appearing in the /var/log/messages
kernel: nfs: server neon not responding, still trying
kernel: nfs: server neon OK
The above can be repeated hundreds of times over the course of several
hours.

I had attempted to set up a network install of open office, but this
caused the machines to become 100% unusable due to OpenOffice tying up
the system. Setting the mount option to soft, prevented this, however
OpenOffice was not usable (would not start).

However I am able to run Pheonix and aMSN off of the NFS server, but I
do find at times that there is a delay opening/closing the browser. I
believe this is once again down to NFS time outs.

Below is a cat of the nfsd file in /proc/net/rpc, I am not sure what the
th value should be, but I think those numbers are quite high.

[root@neon rpc]# cat nfsd
rc 70031 9018069 27954571
fh 10717 36541222 0 278580 494554
io 3860485896 4234117935
th 32 73218 6754.760 3694.770 2485.590 1861.300 1778.710 906.570 689.360
588.490 494.790 5316.810
ra 64 4680995 22399 14758 7499 4804 4549 2906 2844 2000 2174 306976
net 37042672 37042672 0 0
rpc 37042671 1 1 0 0
proc2 18 2 330 0 0 244 0 1306091 0 0 0 0 0 0 0 0 0 17 25
proc3 22 2 16164612 257385 4123444 1202703 5040 3745880 7412118 526581
2427 5126 108 398040 2342 350136 133820 68430 20129 37392 11528 0
1268719

Does any one have any hints or suggestions that I could take away and
work with?

Cheers

doug



-------------------------------------------------------
This SF.net email is sponsored by: SF.net Giveback Program.
Does SourceForge.net help you be more productive? Does it
help you create better code? SHARE THE LOVE, and help us help
YOU! Click Here: http://sourceforge.net/donate/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs


2003-11-27 16:30:53

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFS server not responding

>>>>> " " == Douglas Furlong <[email protected]> writes:

> Good day all. I am running in to excessive amounts of NFS
> errors as below.

> kernel: nfs: server neon not responding, still trying kernel:
> nfs: server neon OK

Two suggestions:

1) Bump the number of threads on the server. 32 is probably a bit
low.

2) The value retrans=3 used as the default by the Linux 'mount'
program is rather low compared to that used on other OSes. I
suggest you bump it to at least 5 on all your clients.

Cheers,
Trond


-------------------------------------------------------
This SF.net email is sponsored by: SF.net Giveback Program.
Does SourceForge.net help you be more productive? Does it
help you create better code? SHARE THE LOVE, and help us help
YOU! Click Here: http://sourceforge.net/donate/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2003-11-27 19:07:23

by Douglas Furlong

[permalink] [raw]
Subject: Re: NFS server not responding

On Thu, 2003-11-27 at 16:30, Trond Myklebust wrote:
> >>>>> " " == Douglas Furlong <[email protected]> writes:
>
> > Good day all. I am running in to excessive amounts of NFS
> > errors as below.
>
> > kernel: nfs: server neon not responding, still trying kernel:
> > nfs: server neon OK
>
> Two suggestions:
>
> 1) Bump the number of threads on the server. 32 is probably a bit
> low.

I have upped this to 64 now. Is there a rule of thumb with regards to
the number of people connecting/amount of system resources?

> 2) The value retrans=3 used as the default by the Linux 'mount'
> program is rather low compared to that used on other OSes. I
> suggest you bump it to at least 5 on all your clients.

I have increased the retrans value to 10 now. This appears to have
resolved the problems to a greater extent.

What sort of things would be causing the client to have to re-transmit
so often?

Client rpc stats:
calls retrans authrefrsh
3431978 15968 0

Those retrans numbers, as a percentage of the calls, does it seem
appropriate?

Thanks for the tips.

Doug



-------------------------------------------------------
This SF.net email is sponsored by: SF.net Giveback Program.
Does SourceForge.net help you be more productive? Does it
help you create better code? SHARE THE LOVE, and help us help
YOU! Click Here: http://sourceforge.net/donate/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2003-11-27 20:02:51

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFS server not responding

>>>>> " " == Douglas Furlong <[email protected]> writes:

> What sort of things would be causing the client to have to
> re-transmit so often?

UDP is not a reliable transport protocol. Packets can get dropped by
switches, and by the server itself, in which case the client's only
option is to time out and retransmit.
TCP offers reliability, but at the price of a slight protocol
overhead.

> Client rpc stats: calls retrans authrefrsh 3431978 15968 0

> Those retrans numbers, as a percentage of the calls, does it
> seem appropriate?

Yep. 0.5% seems reasonable enough.

Cheers,
Trond


-------------------------------------------------------
This SF.net email is sponsored by: SF.net Giveback Program.
Does SourceForge.net help you be more productive? Does it
help you create better code? SHARE THE LOVE, and help us help
YOU! Click Here: http://sourceforge.net/donate/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2003-11-28 08:54:27

by Juergen Sauer

[permalink] [raw]
Subject: Re: NFS server not responding

=2D----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Am Donnerstag, 27. November 2003 13:00 schrieb Douglas Furlong:
> Good day all.
>=20
> I am running in to excessive amounts of NFS errors as below.
>=20
> kernel: nfs: server neon not responding, still trying
> kernel: nfs: server neon OK
>=20
> I was hoping that some of you may be able to provide me with some
> assistance.

Hight Doug, Hi Trond,

Is there any NVIDIA Stuff in the client/server ?
The Nvidia drivers vor the graphics and the nvidia netdrivers
are broken for 2.4.[20|21|22] !
The nv.o module for Xfree86 breaks IMHO the NFS.

I sent a bug report to the Nvidia maintainers.

Refer to the NFS Server not responding threads in the last month.

mfG
J=FCrgen
automatiX Linux Support Crew
=2D --=20
J=FCrgen Sauer - AutomatiX GmbH, +49-4209-4699, [email protected] **
** Das Linux Systemhaus - Service - Support - Server - L=F6sungen **
** http://www.automatix.de ICQ: #344389676 **
=2D----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (GNU/Linux)

iD8DBQE/xwtOW7UKI9EqarERAszVAKC8J1ljR66RKR3YAMQXBxdezNi6zwCbB/9H
Rgeni0oil7CgF+xDGRUn61s=3D
=3D3tlk
=2D----END PGP SIGNATURE-----



-------------------------------------------------------
This SF.net email is sponsored by: SF.net Giveback Program.
Does SourceForge.net help you be more productive? Does it
help you create better code? SHARE THE LOVE, and help us help
YOU! Click Here: http://sourceforge.net/donate/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2003-11-28 09:37:36

by Douglas Furlong

[permalink] [raw]
Subject: Re: NFS server not responding

On Fri, 2003-11-28 at 08:46, Juergen Sauer wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>=20
> Am Donnerstag, 27. November 2003 13:00 schrieb Douglas Furlong:
> > Good day all.
> >=20
> > I am running in to excessive amounts of NFS errors as below.
> >=20
> > kernel: nfs: server neon not responding, still trying
> > kernel: nfs: server neon OK
> >=20
> > I was hoping that some of you may be able to provide me with some
> > assistance.
>=20
> Hight Doug, Hi Trond,
>=20
> Is there any NVIDIA Stuff in the client/server ?
> The Nvidia drivers vor the graphics and the nvidia netdrivers
> are broken for 2.4.[20|21|22] !
> The nv.o module for Xfree86 breaks IMHO the NFS.
>=20
> I sent a bug report to the Nvidia maintainers.
>=20
> Refer to the NFS Server not responding threads in the last month.

Morning J=FCrgen

I do indeed have an nvidia card in the machine which is a nVidia
Corporation NV11 [GeForce2 MX/MX 400] (rev b2).

However on an lsmod the nv.o driver is not being loaded (the machine
goes to run level 3 not 5.

Am I right in saying this should mean the problem does not exist?

I will go and have a look at those archives (only been on the list for a
week).

Doug



-------------------------------------------------------
This SF.net email is sponsored by: SF.net Giveback Program.
Does SourceForge.net help you be more productive? Does it
help you create better code? SHARE THE LOVE, and help us help
YOU! Click Here: http://sourceforge.net/donate/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2003-11-28 10:24:15

by Juergen Sauer

[permalink] [raw]
Subject: Re: NFS server not responding

=2D----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Am Freitag, 28. November 2003 10:37 schrieb Douglas Furlong:
MKoin doug,

> I do indeed have an nvidia card in the machine which is a nVidia
> Corporation NV11 [GeForce2 MX/MX 400] (rev b2).
=20
> However on an lsmod the nv.o driver is not being loaded (the machine
> goes to run level 3 not 5.
Is the kernel module "nvidia" loaded ?
Mathew McNally reported the same thing, problems with Nvidia Graphic distur=
bs
NFS/Network.

Perhaps I should be more exactly, my client here has an ASUS A7N8X
board. This have Nvidia NFORCE2 Chipset, Nvidia Network Chip,=20
and a AGP-Nvidia Gforce 4X Card.
Using 2.4.18-XFS all is fine, except the speed of the IDE System,
Using 2.4.22-XFS mostly IDE Speed is fine, System runs fine,=20
except "NFS Server not responding", those errors are fast comming and
going. nfsstat shows a lot of retrans.

> Am I right in saying this should mean the problem does not exist?
The Problem exists - definitely.
But it's possible to configure that it does not hurt too much, by lowering
rsize=3D4096,wsize=3D4096 I got a compromise between speed and "NFS server =
=2E..".

I think the only solution is to send bugreports to NVIDIA.
(Shit closed source, in OSS we had already fixed this junk).
=20
> I will go and have a look at those archives (only been on the list for a
> week).

mfG
J=FCrgen
automatiX Linux Support Crew
=2D --=20
J=FCrgen Sauer - AutomatiX GmbH, +49-4209-4699, [email protected] **
** Das Linux Systemhaus - Service - Support - Server - L=F6sungen **
** http://www.automatix.de ICQ: #344389676 **
=2D----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (GNU/Linux)

iD8DBQE/xx9lW7UKI9EqarERApFLAKCFe23cAKIQ4uKMKXf5jP2WkAEXMACgs55Z
X3UtibcJjpctpqhjDlwyaU8=3D
=3DIyCZ
=2D----END PGP SIGNATURE-----



-------------------------------------------------------
This SF.net email is sponsored by: SF.net Giveback Program.
Does SourceForge.net help you be more productive? Does it
help you create better code? SHARE THE LOVE, and help us help
YOU! Click Here: http://sourceforge.net/donate/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2003-11-28 10:49:01

by Douglas Furlong

[permalink] [raw]
Subject: Re: NFS server not responding

On Fri, 2003-11-28 at 10:11, Juergen Sauer wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Am Freitag, 28. November 2003 10:37 schrieb Douglas Furlong:
> MKoin doug,
>
> > I do indeed have an nvidia card in the machine which is a nVidia
> > Corporation NV11 [GeForce2 MX/MX 400] (rev b2).
>
> > However on an lsmod the nv.o driver is not being loaded (the machine
> > goes to run level 3 not 5.
> Is the kernel module "nvidia" loaded ?
> Mathew McNally reported the same thing, problems with Nvidia Graphic disturbs
> NFS/Network.
>
> Perhaps I should be more exactly, my client here has an ASUS A7N8X
> board. This have Nvidia NFORCE2 Chipset, Nvidia Network Chip,
> and a AGP-Nvidia Gforce 4X Card.
> Using 2.4.18-XFS all is fine, except the speed of the IDE System,
> Using 2.4.22-XFS mostly IDE Speed is fine, System runs fine,
> except "NFS Server not responding", those errors are fast comming and
> going. nfsstat shows a lot of retrans.

What in your opinion is a lot of retransmissions? Today I am seeing
around 0.7%.
Client rpc stats:
calls retrans authrefrsh
23695 166 0

I am using the standard kernel provided by redhat for this machine.

>
> > Am I right in saying this should mean the problem does not exist?
> The Problem exists - definitely.
> But it's possible to configure that it does not hurt too much, by lowering
> rsize=4096,wsize=4096 I got a compromise between speed and "NFS server ...".
>
> I think the only solution is to send bugreports to NVIDIA.
> (Shit closed source, in OSS we had already fixed this junk).
>
> > I will go and have a look at those archives (only been on the list for a
> > week).
Neither the open-source nor closed source drivers appear to be loaded,
are these drivers only loaded when going to runlevel 5 (or starting x
manually?)?

I have also started to mount the NFS volumes using TCP, I am no longer
getting the NFS server not responding error messages.

However, if this is just hiding a problem i would like to find out for
future reference.

doug



-------------------------------------------------------
This SF.net email is sponsored by: SF.net Giveback Program.
Does SourceForge.net help you be more productive? Does it
help you create better code? SHARE THE LOVE, and help us help
YOU! Click Here: http://sourceforge.net/donate/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2003-11-28 12:29:02

by Bogdan Costescu

[permalink] [raw]
Subject: Re: NFS server not responding

On Fri, 28 Nov 2003, Douglas Furlong wrote:

> On Fri, 2003-11-28 at 10:11, Juergen Sauer wrote:
> > Using 2.4.18-XFS all is fine, except the speed of the IDE System,
> > Using 2.4.22-XFS mostly IDE Speed is fine, System runs fine,

There's a big time and code difference between 2.4.18 and 2.4.22.

> What in your opinion is a lot of retransmissions? Today I am seeing
> around 0.7%.

I also see something like 0.8-1% retransmissions and these messages on
newly installed Fedora Core 1 on some cluster nodes, using default r/wsize
(8192). As I'm using root-NFS, the node is quite useless when this
situation happens. I'm sure that the network is not the problem in my
case. The nodes used to run various kernels between 2.4.9 and 2.4.18, now
running the FC1 kernel recompiled with config changed to add root FS on
NFS and IP autoconfig and include 3c59x driver in kernel.
The NFS server was recently upgraded to a faster CPU and disk system. It
used to run whatever kernel updates Red Hat released and now it's also
running FC1 with its default kernel (2.4.22-based).

So far, I haven't had time to take a look at the conditions when this
happens. One sure way to trigger it is however to leave the default Red
Hat cron jobs enabled on several tens of time-synchronized nodes all
having the root FS exported from a single server - the "slocate" daily
cron job will create serious NFS activity. However this did not happen
with the older setup (RH kernels on server and 2.4.9-2.4.18 kernels on
clients).

The load on the server when simultaneously rebooting several tens of nodes
goes up to 10-12, while previously it was 3-5. IMHO, this points more to a
slower/less-efficient NFS daemon or to a more agressive client (but which
gives up easier afterwards as seen from the logged messages).

> I am using the standard kernel provided by redhat for this machine.

Might the Red Hat kernel be the problem ? I can't test for the moment
other kernels...

> > But it's possible to configure that it does not hurt too much, by lowering
> > rsize=4096,wsize=4096 I got a compromise between speed and "NFS server ...".

Or as Trond suggested increase "retrans"; I'm actually booting these node
with "intr,v3,timeo=15,retrans=7" on the kernel command line and the
messages don't appear as often as with the default values. I haven't got
any clue as to how to choose the values, only that the documentation said
"increase".

> Neither the open-source nor closed source drivers appear to be loaded,
> are these drivers only loaded when going to runlevel 5 (or starting x
> manually?)?

Also in my case there's no NVIDIA at all (AMD chipset, 3C905C NIC, cheap
ATI graphics which is used only in text mode).

--
Bogdan Costescu

IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: [email protected]



-------------------------------------------------------
This SF.net email is sponsored by: SF.net Giveback Program.
Does SourceForge.net help you be more productive? Does it
help you create better code? SHARE THE LOVE, and help us help
YOU! Click Here: http://sourceforge.net/donate/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2003-11-28 12:36:22

by Bogdan Costescu

[permalink] [raw]
Subject: Re: NFS server not responding

On Fri, 28 Nov 2003, Juergen Sauer wrote:

> Mathew McNally reported the same thing, problems with Nvidia Graphic disturbs
> NFS/Network.

I haven't seen this report, but you probably mean "disturbs Nvidia-based
network" as probably in a combo with Nvidia chipset, NIC and graphics
chip. I can tell that various computers that I take care of with Nvidia
video cards (but no other Nvidia components) have no network problem when
the _video_ Nvidia driver is used; the NICs used in these computers are
3C905C, E1000 and SIS900something.

AFAIK, on the netdev mailing list there were some messages about a new
open-source driver (forcedeth) for the Nvidia NICs. Try to look it up...

--
Bogdan Costescu

IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: [email protected]



-------------------------------------------------------
This SF.net email is sponsored by: SF.net Giveback Program.
Does SourceForge.net help you be more productive? Does it
help you create better code? SHARE THE LOVE, and help us help
YOU! Click Here: http://sourceforge.net/donate/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2003-11-28 16:56:35

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFS server not responding

>>>>> " " == Bogdan Costescu <[email protected]> writes:


> I also see something like 0.8-1% retransmissions and these
> messages on newly installed Fedora Core 1 on some cluster
> nodes, using default r/wsize (8192). As I'm using root-NFS, the
> node is quite useless when this situation happens. I'm sure

Huh? Why should a 1% retransmission make a noticable difference? Be
realistic: we're talking about a delay of 100ms on 1/100 requests...
I get ~2% retransmission rate when I do UDP loopback mounts without
seeing any problems at all: it still compares well to the same mount
using TCP.

Now it may be that the Fedora kernel has some other crap in it that is
screwing up interrupts & other such things (NAPI perhaps?). Has
anybody that is seeing these problems made a comparison with an
equivalent stock Marcelo kernel?

Cheers,
Trond


-------------------------------------------------------
This SF.net email is sponsored by: SF.net Giveback Program.
Does SourceForge.net help you be more productive? Does it
help you create better code? SHARE THE LOVE, and help us help
YOU! Click Here: http://sourceforge.net/donate/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2003-11-28 18:43:29

by Bogdan Costescu

[permalink] [raw]
Subject: Re: NFS server not responding

On 28 Nov 2003, Trond Myklebust wrote:

> Huh? Why should a 1% retransmission make a noticable difference?

I think that I wasn't too clear in my previous message, I did not mean to
suggest that the 2 things (retransmission rate and "server not
responding") are strongly correlated, rather I provided another data
point and compared with another setup using older kernels, but the same
hardware. For example, one node has:

Client rpc stats:
calls retrans authrefrsh
4211065 38746 0

> uptime
19:07:14 up 9 days, 5:59, 1 user, load average: 1.00, 1.00, 1.00

and

> dmesg | grep -i "not responding" | wc -l
45

Probably about half of the "not responding" messages were generated by the
previously mentioned "slocate" cron job before I disabled it and another
4-5 by another NFS server with user data that was unavailable at some
point. But the rest were generated at various times when the NFS server
was not so busy. It's clear from what I've seen until now that if only one
client is generating massive NFS traffic, the server can cope with it well
and the client is not displaying the "not responding" messages; I've tried
to manually run the "slocate" cron job and other stress-tests and did not
get any such message. But I do get them when several tens of nodes do it
and, again, this did not happen with older kernels.

As I mentioned I cannot get more details at the moment, as I'm in the
middle of a big software and hardware update. With the current settings,
things seem to work so people can continue their work and I'll debug
these problems later... hopefully ;-)

> I get ~2% retransmission rate when I do UDP loopback mounts without
> seeing any problems at all: it still compares well to the same mount
> using TCP.

I don't think that we disagree here :-)
I don't see anything wrong with having some retransmissions, unless they
amount to several tens of percents of the total number of calls. The small
percentage of retransmissions doesn't bother me; the large number of "not
responding" messages does... I know, I can always increase the parameters
"retrans" and "timeo" parameters to something very big, but I didn't need
to do it before...

> Now it may be that the Fedora kernel has some other crap in it that is
> screwing up interrupts & other such things (NAPI perhaps?).

NAPI for 3c59x that is used in this node doesn't exist. You can take my
word for it :-)
But I cannot say anything about the rest... OTOH testing with a vanilla
kernel on Fedora might break some things, especially threaded
applications, as glibc expects NPTL support in kernel; the answer to this
question on the Red Hat lists doesn't get clearer than that.

> Has anybody that is seeing these problems made a comparison with an
> equivalent stock Marcelo kernel?

I know that I can't claim anything until I do that comparison. But at the
moment, it's not possible. That's why I said that is just another data
point...

--
Bogdan Costescu

IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: [email protected]



-------------------------------------------------------
This SF.net email is sponsored by: SF.net Giveback Program.
Does SourceForge.net help you be more productive? Does it
help you create better code? SHARE THE LOVE, and help us help
YOU! Click Here: http://sourceforge.net/donate/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2003-11-30 20:05:42

by seth vidal

[permalink] [raw]
Subject: Re: NFS server not responding


> The load on the server when simultaneously rebooting several tens of nodes
> goes up to 10-12, while previously it was 3-5. IMHO, this points more to a
> slower/less-efficient NFS daemon or to a more agressive client (but which
> gives up easier afterwards as seen from the logged messages).


it's the red hat kernel.
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=100680

I bet it's related to that problem.

-sv




-------------------------------------------------------
This SF.net email is sponsored by: SF.net Giveback Program.
Does SourceForge.net help you be more productive? Does it
help you create better code? SHARE THE LOVE, and help us help
YOU! Click Here: http://sourceforge.net/donate/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2003-12-04 17:16:21

by Steve Dickson

[permalink] [raw]
Subject: Re: NFS server not responding

Trond Myklebust wrote:

>Nobody on this list is directly responsible for the Fedore Core 1
>kernel, ......
>
>
This not completely accurate.... I am responsible for FC1...
And it is my goal to keep FC1 (and beyond) as stable as possible...

>I have no problems on *any* of the machines in the test-rigs I have at
>my disposition when using a standard 2.4.23 kernel. For the record,
>those few that I have used with the Fedora kernel have been fine too
>(though I haven't made any detailed tests of that)
>
>
I not seen this either... FC1 does have the latest retrans improvements
# 03/10/11 [email protected] 1.1148.17.3
# UDP round trip timer fix. ...

# 03/10/11 [email protected] 1.1148.17.2
# A request cannot be used as part of the RTO estimation ...

# 03/07/08 [email protected] 1.1003.1.58
# Back out some congestion control changes that were causing trouble...

With the only difference being FC1 has a longer RTO_MIN
-#define RPC_RTO_MIN (HZ/30)
+#define RPC_RTO_MIN (HZ/10)

And these patches did show a noticeable improvement in bring down the
number of retrans at least in my testing (w/out being overly verbose)

>So now, what have you tried in order to diagnose this problem?
>
>
ifconfig will show if the driver is dropping frames. netstat -s will show
if there are udp fragmentation issues and ethereal is good at showing
whether the ip stack or network are dropping the packets....

SteveD.




-------------------------------------------------------
This SF.net email is sponsored by: SF.net Giveback Program.
Does SourceForge.net help you be more productive? Does it
help you create better code? SHARE THE LOVE, and help us help
YOU! Click Here: http://sourceforge.net/donate/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2003-12-04 17:36:45

by Steve Dickson

[permalink] [raw]
Subject: Re: NFS server not responding

Trond Myklebust wrote:

>Now it may be that the Fedora kernel has some other crap in it that is
>screwing up interrupts & other such things (NAPI perhaps?). Has
>anybody that is seeing these problems made a comparison with an
>equivalent stock Marcelo kernel?
>
>
For the recored here is the "crap" that's in FC1 and not in the stock
kernel:

From Trond's Tree:
linux-2.4.x-rdplus.dif
linux-2.4.x-cto.dif
linux-2.4.x-pathconf.dif

From -ac1 tree:
kmap() calls changed to kmap_atomic() calls

Patches posted to this list:
linux-2.4.21-nfs-accesscache.patch - reduces the number of otw
ACCESS calls
linux-2.4.20-nfs-ia64-EIO.patch - increase RPC_RTO_MIN to HZ/30

And here are the patches that are in the stock kernel
and not in FC1 (yet)...

# 03/10/11 [email protected] 1.1148.17.6
# Make the client act correctly if the RPC server's asserts
# that it does not support a given program, version or
# procedure call.

# 03/10/11 [email protected] 1.1148.17.1
# Fix a deadlock in the NFS asynchronous write code.

SteveD.




-------------------------------------------------------
This SF.net email is sponsored by: SF.net Giveback Program.
Does SourceForge.net help you be more productive? Does it
help you create better code? SHARE THE LOVE, and help us help
YOU! Click Here: http://sourceforge.net/donate/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2003-12-04 18:40:04

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFS server not responding

>>>>> " " == Steve Dickson <[email protected]> writes:

> linux-2.4.20-nfs-ia64-EIO.patch - increase RPC_RTO_MIN to
> HZ/30

Err.. That's a decrease...

You are setting the minimum timeout value at 1/30th of a second
instead of 1/10th of a second.
(FYI: HZ is the frequency of the timer interrupt. It tells you how many
jiffies make up 1 second.)


This might indeed explain why people are seeing an increase in resends
and 'server not responding' messages.

Cheers,
Trond


-------------------------------------------------------
This SF.net email is sponsored by: SF.net Giveback Program.
Does SourceForge.net help you be more productive? Does it
help you create better code? SHARE THE LOVE, and help us help
YOU! Click Here: http://sourceforge.net/donate/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2003-12-04 19:10:48

by Steve Dickson

[permalink] [raw]
Subject: Re: NFS server not responding

--- linux-2.4.22/net/sunrpc/timer.c.org 2003-12-04 10:47:01.000000000 -0500
+++ linux-2.4.22/net/sunrpc/timer.c 2003-12-04 13:46:01.000000000 -0500
@@ -8,7 +8,7 @@

#define RPC_RTO_MAX (60*HZ)
#define RPC_RTO_INIT (HZ/5)
-#define RPC_RTO_MIN (HZ/30)
+#define RPC_RTO_MIN (HZ/10)

void
rpc_init_rtt(struct rpc_rtt *rt, long timeo)


Attachments:
linux-2.4.33-nfs-rtomin.patch (336.00 B)

2003-12-04 20:55:28

by seth vidal

[permalink] [raw]
Subject: Re: NFS server not responding

> Well... it was in increase at the time I posted the patch.
> If remember correctly... there was actually some discussion
> that 1/30th of a second was a bit too long...
>
> >You are setting the minimum timeout value at 1/30th of a second
> >instead of 1/10th of a second.
> >
> >
> Right... I did miss this "minor" detail when I did the port....
>
> >This might indeed explain why people are seeing an increase in resends
> >and 'server not responding' messages.
> >
> >
> Most definitely... Here is the patch that should take care of the problem...
>

Steve,
Do you know if this change was in place when the 7.X kernels went from
2.4.18 to 2.4.20?

We started noticing a lot of nfs pain on udp connections when we went
from 2.4.18 to 2.4.20 from rh's kernels.
(in addition to the kscand nightmare)

-sv




-------------------------------------------------------
This SF.net email is sponsored by: SF.net Giveback Program.
Does SourceForge.net help you be more productive? Does it
help you create better code? SHARE THE LOVE, and help us help
YOU! Click Here: http://sourceforge.net/donate/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2003-12-04 21:23:34

by Steve Dickson

[permalink] [raw]
Subject: Re: NFS server not responding

seth vidal wrote:

> Do you know if this change was in place when the 7.X kernels went from
>2.4.18 to 2.4.20?
>
>
No it was not...

In early kernels (i.e. pre 2.4.20) RPC_RTO_MIN is not relative to HZ
#define RPC_RTO_MIN (2)

which causes the min timeout to be too small (especially on ia64 archs).
So my patch (to the 2.4.20 kernels) made RPC_RTO_MIN relative to HZ
and increase the timeout a bit

#define RPC_RTO_MIN (HZ/30)

Trond's patch increases the min even more (which a good thing, imho)

#define RPC_RTO_MIN (HZ/10)


SteveD.



-------------------------------------------------------
This SF.net email is sponsored by: SF.net Giveback Program.
Does SourceForge.net help you be more productive? Does it
help you create better code? SHARE THE LOVE, and help us help
YOU! Click Here: http://sourceforge.net/donate/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2003-12-05 15:51:02

by Bogdan Costescu

[permalink] [raw]
Subject: Re: NFS server not responding

On 4 Dec 2003, Trond Myklebust wrote:

> This might indeed explain why people are seeing an increase in resends
> and 'server not responding' messages.

It does not explain my case though... Going from <=2.4.18 to the FC1
kernel means an increase in timeout value.

But still no time to go deeper...

--
Bogdan Costescu

IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: [email protected]



-------------------------------------------------------
This SF.net email is sponsored by: SF.net Giveback Program.
Does SourceForge.net help you be more productive? Does it
help you create better code? SHARE THE LOVE, and help us help
YOU! Click Here: http://sourceforge.net/donate/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2003-12-09 19:47:06

by Steve Dickson

[permalink] [raw]
Subject: Re: NFS server not responding


This is happening on a Fedora Core kernel, right?

If so, Could you send me the exact steps you do to cause this
to happen....

SteveD.

Kyle Rose wrote:

>Got this oops using the SFS (http://www.fs.net/) userspace NFS client.
>(Basically, sfscd acts as an NFSv3 server so the SFS guys don't have
>to maintain separate kernel modules for every OS they want to support:
>instead, they use the kernel's native NFSv3 client support to populate
>the required mount points.)
>
>Dec 2 21:50:33 nausicaa kernel: Unable to handle kernel paging request at virtual address fffe4000
>Dec 2 21:50:33 nausicaa kernel: printing eip:
>Dec 2 21:50:33 nausicaa kernel: f8cf8896
>Dec 2 21:50:33 nausicaa kernel: *pde = 00003067
>Dec 2 21:50:33 nausicaa kernel: *pte = 00000000
>Dec 2 21:50:33 nausicaa kernel: Oops: 0000 [#1]
>Dec 2 21:50:33 nausicaa kernel: CPU: 1
>Dec 2 21:50:33 nausicaa kernel: EIP: 0060:[__crc_xfrm_state_register_afinfo+3825223/3984503] Tainted: PF
>Dec 2 21:50:33 nausicaa kernel: EFLAGS: 00210246
>Dec 2 21:50:33 nausicaa kernel: EIP is at nfs3_xdr_readdirres+0xf6/0x210 [nfs]
>Dec 2 21:50:33 nausicaa kernel: eax: fffe3ff8 ebx: fffe3fdc ecx: 00000002 edx: fffe4000
>Dec 2 21:50:33 nausicaa kernel: esi: fffe4000 edi: 00000017 ebp: fffe3000 esp: c3d9dae4
>Dec 2 21:50:33 nausicaa kernel: ds: 007b es: 007b ss: 0068
>Dec 2 21:50:33 nausicaa kernel: Process ls (pid: 3445, threadinfo=c3d9c000 task=f6d5b900)
>Dec 2 21:50:33 nausicaa kernel: Stack: c19eeee0 00000003 00000000 c3d9db88 dacff0d4 dacff110 dacff078 f8988131
>Dec 2 21:50:33 nausicaa kernel: dacff078 e06b847c c3d9dc78 c3d9dbe4 f8cf87a0 c3d9c000 c3d9db88 ffffe000
>Dec 2 21:50:33 nausicaa kernel: c3d9dc04 f898bc9e c3d9db88 00000090 00000090 c3d9c000 00000000 f6d5b900
>Dec 2 21:50:33 nausicaa kernel: Call Trace:
>Dec 2 21:50:33 nausicaa kernel: [__crc_xfrm_state_register_afinfo+218850/3984503] call_decode+0xf1/0x210 [sunrpc]
>Dec 2 21:50:33 nausicaa kernel: [__crc_xfrm_state_register_afinfo+3824977/3984503] nfs3_xdr_readdirres+0x0/0x210 [nfs]
>Dec 2 21:50:33 nausicaa kernel: [__crc_xfrm_state_register_afinfo+234063/3984503] __rpc_execute+0x21e/0x310 [sunrpc]
>Dec 2 21:50:33 nausicaa kernel: [default_wake_function+0/32] default_wake_function+0x0/0x20
>Dec 2 21:50:33 nausicaa kernel: [__crc_xfrm_state_register_afinfo+215983/3984503] rpc_call_sync+0x7e/0xc0 [sunrpc]
>Dec 2 21:50:33 nausicaa kernel: [__crc_xfrm_state_register_afinfo+230769/3984503] rpc_run_timer+0x0/0x80 [sunrpc]
>Dec 2 21:50:33 nausicaa kernel: [__crc_xfrm_state_register_afinfo+3811947/3984503] nfs3_rpc_wrapper+0x3a/0x90 [nfs]
>Dec 2 21:50:33 nausicaa kernel: [__crc_xfrm_state_register_afinfo+3817856/3984503] nfs3_proc_readdir+0x14f/0x1c0 [nfs]
>Dec 2 21:50:33 nausicaa kernel: [kmem_flagcheck+6/48] kmem_flagcheck+0x6/0x30
>Dec 2 21:50:33 nausicaa kernel: [invalidate_mapping_pages+93/256] invalidate_mapping_pages+0x5d/0x100
>Dec 2 21:50:33 nausicaa kernel: [radix_tree_insert+161/192] radix_tree_insert+0xa1/0xc0
>Dec 2 21:50:33 nausicaa kernel: [__crc_xfrm_state_register_afinfo+3761862/3984503] nfs_readdir_filler+0xa5/0x160 [nfs]
>Dec 2 21:50:33 nausicaa kernel: [read_cache_page+114/560] read_cache_page+0x72/0x230
>Dec 2 21:50:33 nausicaa kernel: [__crc_xfrm_state_register_afinfo+3762807/3984503] nfs_readdir+0x186/0x730 [nfs]
>Dec 2 21:50:33 nausicaa kernel: [__crc_xfrm_state_register_afinfo+3761697/3984503] nfs_readdir_filler+0x0/0x160 [nfs]
>Dec 2 21:50:33 nausicaa kernel: [__crc_xfrm_state_register_afinfo+3813565/3984503] nfs3_proc_access+0x11c/0x150 [nfs]
>Dec 2 21:50:33 nausicaa kernel: [buffered_rmqueue+195/336] buffered_rmqueue+0xc3/0x150
>Dec 2 21:50:33 nausicaa kernel: [__crc_xfrm_state_register_afinfo+3825505/3984503] nfs3_decode_dirent+0x0/0x250 [nfs]
>Dec 2 21:50:33 nausicaa kernel: [vfs_readdir+126/128] vfs_readdir+0x7e/0x80
>Dec 2 21:50:33 nausicaa kernel: [filldir64+0/272] filldir64+0x0/0x110
>Dec 2 21:50:33 nausicaa kernel: [sys_getdents64+111/169] sys_getdents64+0x6f/0xa9
>Dec 2 21:50:33 nausicaa kernel: [filldir64+0/272] filldir64+0x0/0x110
>Dec 2 21:50:33 nausicaa kernel: [syscall_call+7/11] syscall_call+0x7/0xb
>Dec 2 21:50:33 nausicaa kernel:
>Dec 2 21:50:33 nausicaa kernel: Code: 8b 48 08 8d 50 0c 85 c9 74 07 8d 50 60 39 f2 77 3e 8b 02 83
>Dec 2 21:50:33 nausicaa kernel: <6>note: ls[3445] exited with preempt_count 1
>Dec 2 21:50:33 nausicaa kernel: bad: scheduling while atomic!
>Dec 2 21:50:33 nausicaa kernel: Call Trace:
>Dec 2 21:50:33 nausicaa kernel: [schedule+1554/1568] schedule+0x612/0x620
>Dec 2 21:50:33 nausicaa kernel: [reiserfs_commit_write+355/480] reiserfs_commit_write+0x163/0x1e0
>Dec 2 21:50:33 nausicaa kernel: [block_prepare_write+52/80] block_prepare_write+0x34/0x50
>Dec 2 21:50:33 nausicaa kernel: [generic_file_aio_write_nolock+1564/2976] generic_file_aio_write_nolock+0x61c/0xba0
>Dec 2 21:50:33 nausicaa kernel: [sock_def_readable+125/128] sock_def_readable+0x7d/0x80
>Dec 2 21:50:33 nausicaa kernel: [udp_queue_rcv_skb+449/704] udp_queue_rcv_skb+0x1c1/0x2c0
>Dec 2 21:50:33 nausicaa kernel: [ip_local_deliver+169/480] ip_local_deliver+0xa9/0x1e0
>Dec 2 21:50:33 nausicaa kernel: [ip_rcv+806/1110] ip_rcv+0x326/0x456
>Dec 2 21:50:33 nausicaa kernel: [generic_file_write_nolock+126/160] generic_file_write_nolock+0x7e/0xa0
>Dec 2 21:50:33 nausicaa kernel: [vt_console_print+97/752] vt_console_print+0x61/0x2f0
>Dec 2 21:50:33 nausicaa last message repeated 3 times
>Dec 2 21:50:33 nausicaa kernel: [generic_file_write+92/128] generic_file_write+0x5c/0x80
>Dec 2 21:50:33 nausicaa kernel: [reiserfs_file_write+1898/1905] reiserfs_file_write+0x76a/0x771
>Dec 2 21:50:33 nausicaa kernel: [printk+350/400] printk+0x15e/0x190
>Dec 2 21:50:33 nausicaa kernel: [__print_symbol+300/368] __print_symbol+0x12c/0x170
>Dec 2 21:50:33 nausicaa kernel: [__print_symbol+63/368] __print_symbol+0x3f/0x170
>Dec 2 21:50:33 nausicaa kernel: [syscall_call+7/11] syscall_call+0x7/0xb
>Dec 2 21:50:33 nausicaa kernel: [recalc_task_prio+142/432] recalc_task_prio+0x8e/0x1b0
>Dec 2 21:50:33 nausicaa kernel: [vt_console_print+97/752] vt_console_print+0x61/0x2f0
>Dec 2 21:50:33 nausicaa kernel: [process_timeout+0/16] process_timeout+0x0/0x10
>Dec 2 21:50:33 nausicaa kernel: [do_acct_process+639/656] do_acct_process+0x27f/0x290
>Dec 2 21:50:33 nausicaa kernel: [acct_process+67/96] acct_process+0x43/0x60
>Dec 2 21:50:33 nausicaa kernel: [do_exit+117/944] do_exit+0x75/0x3b0
>Dec 2 21:50:33 nausicaa kernel: [do_page_fault+0/1268] do_page_fault+0x0/0x4f4
>Dec 2 21:50:33 nausicaa kernel: [die+225/240] die+0xe1/0xf0
>Dec 2 21:50:33 nausicaa kernel: [do_page_fault+611/1268] do_page_fault+0x263/0x4f4
>Dec 2 21:50:33 nausicaa kernel: [udp_sendmsg+429/2160] udp_sendmsg+0x1ad/0x870
>Dec 2 21:50:33 nausicaa kernel: [__crc_xfrm_state_register_afinfo+267265/3984503] xdr_sendpages+0xe0/0x2b0 [sunrpc]
>Dec 2 21:50:33 nausicaa kernel: [do_page_fault+0/1268] do_page_fault+0x0/0x4f4
>Dec 2 21:50:33 nausicaa kernel: [error_code+45/56] error_code+0x2d/0x38
>Dec 2 21:50:33 nausicaa kernel: [__crc_xfrm_state_register_afinfo+3825223/3984503] nfs3_xdr_readdirres+0xf6/0x210 [nfs]
>Dec 2 21:50:33 nausicaa kernel: [__crc_xfrm_state_register_afinfo+218850/3984503] call_decode+0xf1/0x210 [sunrpc]
>Dec 2 21:50:33 nausicaa kernel: [__crc_xfrm_state_register_afinfo+3824977/3984503] nfs3_xdr_readdirres+0x0/0x210 [nfs]
>Dec 2 21:50:33 nausicaa kernel: [__crc_xfrm_state_register_afinfo+234063/3984503] __rpc_execute+0x21e/0x310 [sunrpc]
>Dec 2 21:50:33 nausicaa kernel: [default_wake_function+0/32] default_wake_function+0x0/0x20
>Dec 2 21:50:33 nausicaa kernel: [__crc_xfrm_state_register_afinfo+215983/3984503] rpc_call_sync+0x7e/0xc0 [sunrpc]
>Dec 2 21:50:33 nausicaa kernel: [__crc_xfrm_state_register_afinfo+230769/3984503] rpc_run_timer+0x0/0x80 [sunrpc]
>Dec 2 21:50:33 nausicaa kernel: [__crc_xfrm_state_register_afinfo+3811947/3984503] nfs3_rpc_wrapper+0x3a/0x90 [nfs]
>Dec 2 21:50:33 nausicaa kernel: [__crc_xfrm_state_register_afinfo+3817856/3984503] nfs3_proc_readdir+0x14f/0x1c0 [nfs]
>Dec 2 21:50:33 nausicaa kernel: [kmem_flagcheck+6/48] kmem_flagcheck+0x6/0x30
>Dec 2 21:50:33 nausicaa kernel: [invalidate_mapping_pages+93/256] invalidate_mapping_pages+0x5d/0x100
>Dec 2 21:50:33 nausicaa kernel: [radix_tree_insert+161/192] radix_tree_insert+0xa1/0xc0
>Dec 2 21:50:33 nausicaa kernel: [__crc_xfrm_state_register_afinfo+3761862/3984503] nfs_readdir_filler+0xa5/0x160 [nfs]
>Dec 2 21:50:33 nausicaa kernel: [read_cache_page+114/560] read_cache_page+0x72/0x230
>Dec 2 21:50:33 nausicaa kernel: [__crc_xfrm_state_register_afinfo+3762807/3984503] nfs_readdir+0x186/0x730 [nfs]
>Dec 2 21:50:33 nausicaa kernel: [__crc_xfrm_state_register_afinfo+3761697/3984503] nfs_readdir_filler+0x0/0x160 [nfs]
>Dec 2 21:50:33 nausicaa kernel: [__crc_xfrm_state_register_afinfo+3813565/3984503] nfs3_proc_access+0x11c/0x150 [nfs]
>Dec 2 21:50:33 nausicaa kernel: [buffered_rmqueue+195/336] buffered_rmqueue+0xc3/0x150
>Dec 2 21:50:33 nausicaa kernel: [__crc_xfrm_state_register_afinfo+3825505/3984503] nfs3_decode_dirent+0x0/0x250 [nfs]
>Dec 2 21:50:33 nausicaa kernel: [vfs_readdir+126/128] vfs_readdir+0x7e/0x80
>Dec 2 21:50:33 nausicaa kernel: [filldir64+0/272] filldir64+0x0/0x110
>Dec 2 21:50:33 nausicaa kernel: [sys_getdents64+111/169] sys_getdents64+0x6f/0xa9
>Dec 2 21:50:33 nausicaa kernel: [filldir64+0/272] filldir64+0x0/0x110
>Dec 2 21:50:33 nausicaa kernel: [syscall_call+7/11] syscall_call+0x7/0xb
>Dec 2 21:50:33 nausicaa kernel:
>Dec 2 21:50:33 nausicaa kernel: bad: scheduling while atomic!
>Dec 2 21:50:33 nausicaa kernel: Call Trace:
>Dec 2 21:50:33 nausicaa kernel: [schedule+1554/1568] schedule+0x612/0x620
>Dec 2 21:50:33 nausicaa kernel: [zap_pmd_range+75/112] zap_pmd_range+0x4b/0x70
>Dec 2 21:50:33 nausicaa kernel: [free_pages_and_swap_cache+86/144] free_pages_and_swap_cache+0x56/0x90
>Dec 2 21:50:33 nausicaa kernel: [unmap_vmas+527/688] unmap_vmas+0x20f/0x2b0
>Dec 2 21:50:33 nausicaa kernel: [exit_mmap+222/528] exit_mmap+0xde/0x210
>Dec 2 21:50:33 nausicaa kernel: [mmput+98/176] mmput+0x62/0xb0
>Dec 2 21:50:33 nausicaa kernel: [do_exit+299/944] do_exit+0x12b/0x3b0
>Dec 2 21:50:33 nausicaa kernel: [do_page_fault+0/1268] do_page_fault+0x0/0x4f4
>Dec 2 21:50:33 nausicaa kernel: [die+225/240] die+0xe1/0xf0
>Dec 2 21:50:33 nausicaa kernel: [do_page_fault+611/1268] do_page_fault+0x263/0x4f4
>Dec 2 21:50:33 nausicaa kernel: [udp_sendmsg+429/2160] udp_sendmsg+0x1ad/0x870
>Dec 2 21:50:33 nausicaa kernel: [__crc_xfrm_state_register_afinfo+267265/3984503] xdr_sendpages+0xe0/0x2b0 [sunrpc]
>Dec 2 21:50:33 nausicaa kernel: [do_page_fault+0/1268] do_page_fault+0x0/0x4f4
>Dec 2 21:50:33 nausicaa kernel: [error_code+45/56] error_code+0x2d/0x38
>Dec 2 21:50:33 nausicaa kernel: [__crc_xfrm_state_register_afinfo+3825223/3984503] nfs3_xdr_readdirres+0xf6/0x210 [nfs]
>Dec 2 21:50:33 nausicaa kernel: [__crc_xfrm_state_register_afinfo+218850/3984503] call_decode+0xf1/0x210 [sunrpc]
>Dec 2 21:50:33 nausicaa kernel: [__crc_xfrm_state_register_afinfo+3824977/3984503] nfs3_xdr_readdirres+0x0/0x210 [nfs]
>Dec 2 21:50:33 nausicaa kernel: [__crc_xfrm_state_register_afinfo+234063/3984503] __rpc_execute+0x21e/0x310 [sunrpc]
>Dec 2 21:50:33 nausicaa kernel: [default_wake_function+0/32] default_wake_function+0x0/0x20
>Dec 2 21:50:33 nausicaa kernel: [__crc_xfrm_state_register_afinfo+215983/3984503] rpc_call_sync+0x7e/0xc0 [sunrpc]
>Dec 2 21:50:33 nausicaa kernel: [__crc_xfrm_state_register_afinfo+230769/3984503] rpc_run_timer+0x0/0x80 [sunrpc]
>Dec 2 21:50:33 nausicaa kernel: [__crc_xfrm_state_register_afinfo+3811947/3984503] nfs3_rpc_wrapper+0x3a/0x90 [nfs]
>Dec 2 21:50:33 nausicaa kernel: [__crc_xfrm_state_register_afinfo+3817856/3984503] nfs3_proc_readdir+0x14f/0x1c0 [nfs]
>Dec 2 21:50:33 nausicaa kernel: [kmem_flagcheck+6/48] kmem_flagcheck+0x6/0x30
>Dec 2 21:50:33 nausicaa kernel: [invalidate_mapping_pages+93/256] invalidate_mapping_pages+0x5d/0x100
>Dec 2 21:50:33 nausicaa kernel: [radix_tree_insert+161/192] radix_tree_insert+0xa1/0xc0
>Dec 2 21:50:33 nausicaa kernel: [__crc_xfrm_state_register_afinfo+3761862/3984503] nfs_readdir_filler+0xa5/0x160 [nfs]
>Dec 2 21:50:33 nausicaa kernel: [read_cache_page+114/560] read_cache_page+0x72/0x230
>Dec 2 21:50:33 nausicaa kernel: [__crc_xfrm_state_register_afinfo+3762807/3984503] nfs_readdir+0x186/0x730 [nfs]
>Dec 2 21:50:33 nausicaa kernel: [__crc_xfrm_state_register_afinfo+3761697/3984503] nfs_readdir_filler+0x0/0x160 [nfs]
>Dec 2 21:50:33 nausicaa kernel: [__crc_xfrm_state_register_afinfo+3813565/3984503] nfs3_proc_access+0x11c/0x150 [nfs]
>Dec 2 21:50:33 nausicaa kernel: [buffered_rmqueue+195/336] buffered_rmqueue+0xc3/0x150
>Dec 2 21:50:33 nausicaa kernel: [__crc_xfrm_state_register_afinfo+3825505/3984503] nfs3_decode_dirent+0x0/0x250 [nfs]
>Dec 2 21:50:33 nausicaa kernel: [vfs_readdir+126/128] vfs_readdir+0x7e/0x80
>Dec 2 21:50:33 nausicaa kernel: [filldir64+0/272] filldir64+0x0/0x110
>Dec 2 21:50:33 nausicaa kernel: [sys_getdents64+111/169] sys_getdents64+0x6f/0xa9
>Dec 2 21:50:33 nausicaa kernel: [filldir64+0/272] filldir64+0x0/0x110
>Dec 2 21:50:33 nausicaa kernel: [syscall_call+7/11] syscall_call+0x7/0xb
>
>I don't really have any other interesting information to share at the
>moment. I can reproduce this reliably by accessing an SFS share,
>waiting (say) 15 minutes, and then trying to access it again,
>presumably after it has timed out.
>
>I cannot reproduce this with vanilla NFS, but this is essentially
>irrelevant to the kernel's correctness: a userspace program should
>never be able to cause the kernel to panic, no matter how ill-behaved
>it is (short of mucking directly with /proc/k{core,mem}).
>
>Suggestions? SFS is basically unusable for me until this is fixed,
>which is unfortunate since I use it as my main file server. It
>probably has nothing to do with the server: 2.4 clients can access a
>2.6 server just fine. It may also have something to do with my
>particular setup, so I'm attaching my kernel config. My hardware
>platform is:
>
>AMD Dual Opteron 244
>Tyan Thunder K8W
>1GB 333MHz SDRAM
>
>Kernel is compiled with -march=athlon.
>
>Cheers,
>Kyle
>
>



-------------------------------------------------------
This SF.net email is sponsored by: SF.net Giveback Program.
Does SourceForge.net help you be more productive? Does it
help you create better code? SHARE THE LOVE, and help us help
YOU! Click Here: http://sourceforge.net/donate/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2003-12-09 20:09:31

by Kyle Rose

[permalink] [raw]
Subject: Re: NFS server not responding

Steve Dickson <[email protected]> writes:

> This is happening on a Fedora Core kernel, right?

No, this is a vanilla 2.6.0-test11. Upon review of my email, I can't
believe I didn't mention the kernel version. :)

> If so, Could you send me the exact steps you do to cause this
> to happen....

Certainly. Compile and reboot. NFS comes up, after which SFS comes
up:

/opt/sfs/bin/sfscd

Then, I log in as krose and

cd music

where music is a symlink to
sfs/kushana.valley-of-wind.krose.org/music, the first two parts of
which are a symlink to
/sfs/@kushana.valley-of-wind.krose.org,jc72upywax7dsvd7rwpbvrfwpq4j2w7e.
So, in effect, I end up in
/sfs/@kushana.valley-of-wind.krose.org,jc72upywax7dsvd7rwpbvrfwpq4j2w7e/music.

Then I type ls, and get a segfault and an oops in dmesg. (Sometimes,
it succeeds the first time, but always segfaults when I perform the
same steps a few minutes later.) After this, NFS and/or SFS appear to
be wedged in a bad state, because future requests to SFS don't work.
Stopping SFS is a no-go, either.

I'm not really sure what kind of detail you're looking for, so please
feel free to be more specific if you want/need more information.

Cheers,
Kyle


-------------------------------------------------------
This SF.net email is sponsored by: SF.net Giveback Program.
Does SourceForge.net help you be more productive? Does it
help you create better code? SHARE THE LOVE, and help us help
YOU! Click Here: http://sourceforge.net/donate/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2003-12-01 10:58:45

by Bogdan Costescu

[permalink] [raw]
Subject: Re: NFS server not responding

On Sun, 30 Nov 2003, seth vidal wrote:

> I bet it's related to that problem.

Nope... The NFS server for this cluster is a single AMD Athlon with 256MiB
RAM. I did not see any kscand or equivalent taking so much CPU as
described in the bug reports. When the load is high, all top users are
nfsd threads.
One of the reports however reminded me of the readahead discussion. I did
some tests some time ago with 2.4.20-based kernel and did not see much
difference, however I will try it now too and write back if I see some
advantage.

--
Bogdan Costescu

IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: [email protected]



-------------------------------------------------------
This SF.net email is sponsored by: SF.net Giveback Program.
Does SourceForge.net help you be more productive? Does it
help you create better code? SHARE THE LOVE, and help us help
YOU! Click Here: http://sourceforge.net/donate/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2003-12-02 14:37:57

by Douglas Furlong

[permalink] [raw]
Subject: Re: NFS server not responding

On Fri, 2003-11-28 at 16:56, Trond Myklebust wrote:
> >>>>> " " == Bogdan Costescu <[email protected]> writes:
>
>
> > I also see something like 0.8-1% retransmissions and these
> > messages on newly installed Fedora Core 1 on some cluster
> > nodes, using default r/wsize (8192). As I'm using root-NFS, the
> > node is quite useless when this situation happens. I'm sure
>
> Huh? Why should a 1% retransmission make a noticable difference? Be
> realistic: we're talking about a delay of 100ms on 1/100 requests...

If this was the case then i would agree that there is no problem at all,
but I am noticing delays of three or four seconds when opening up a new
mail in Evolution, or downloading new mail off of the IMAP server (which
get stored in the users home directory on the NFS server). When typing
in to a mail I will find the text freezes for several seconds, which is
fine for me (touch type with accuracy) but other people that are less
secure working on PC (read most people I deal with), they find this sort
of behaviour unacceptable (which i agree with).

I have found that all of these error's coincide with the NFS server not
responding error messages. Before making the changes to the retrans
values I was finding messages appearing as "blank" in evolution as the
initial download from the IMAP server would fail due to not being able
to write to disk, however evolution would think that it had, and would
just show empty emails (exceedingly annoying).
Now I am not receiving any error messages just moments when applications
"freeze", the rest of the system is fine, and I just have to give it a
few seconds and all is back to normal.

> I get ~2% retransmission rate when I do UDP loopback mounts without
> seeing any problems at all: it still compares well to the same mount
> using TCP.

I thought I had enabled this, but it turns out I have not, as I need to
enable NFS over TCP on the server (I think), I have not had a chance to
do that yet.

Douglas



-------------------------------------------------------
This SF.net email is sponsored by: SF.net Giveback Program.
Does SourceForge.net help you be more productive? Does it
help you create better code? SHARE THE LOVE, and help us help
YOU! Click Here: http://sourceforge.net/donate/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2003-12-02 15:37:22

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFS server not responding

>>>>> " " == Douglas Furlong <[email protected]> writes:

> If this was the case then i would agree that there is no
> problem at all, but I am noticing delays of three or four
> seconds when opening up a new mail in Evolution, or downloading
> new mail off of the IMAP server (which get stored in the users
> home directory on the NFS server). When typing in to a mail I
> will find the text freezes for several seconds, which is fine
> for me (touch type with accuracy) but other people that are
> less secure working on PC (read most people I deal with), they
> find this sort of behaviour unacceptable (which i agree with).

Nobody on this list is directly responsible for the Fedore Core 1
kernel, so whining about what is or isn't acceptable in it won't help.

I have no problems on *any* of the machines in the test-rigs I have at
my disposition when using a standard 2.4.23 kernel. For the record,
those few that I have used with the Fedora kernel have been fine too
(though I haven't made any detailed tests of that)

> I have found that all of these error's coincide with the NFS
> server not responding error messages. Before making the changes

That's no surprise, but a <1-2% retransmission frequency
_DOES_NOT_SUFFICE_ to explain an NFS server not responding messageq. If
those retransmissions are randomly distributed (as they should
normally be) then we're talking unnoticable delays.

If, OTOH, the retransmissions are all occurring at once, then that
might explain it ('cos retransmissions follow an exponential rule
w.r.t. timeouts). Such behaviour would indicate a serious bug, but
you still need to identify where: it could be a NIC driver bug, could
be a problem with the scheduler, it could be a hang somewhere,
somebody may be disabling interrupts for long periods of time...
...or it could be an external problem.

So now, what have you tried in order to diagnose this problem? Have
you looked at changing NICs, switches etc? Have you tried alternative
kernel builds w/o all the Fedora NPTL+scheduling stuff (e.g. stock
2.4.23)?

Cheers,
Trond


-------------------------------------------------------
This SF.net email is sponsored by: SF.net Giveback Program.
Does SourceForge.net help you be more productive? Does it
help you create better code? SHARE THE LOVE, and help us help
YOU! Click Here: http://sourceforge.net/donate/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs