2003-07-10 05:26:44

by Jamie Lokier

[permalink] [raw]
Subject: NFS client errors with 2.5.74?

I'm seeing quite a lot of NFS client errors with 2.5.74, connected to
a server running 2.4.20-18.9 (Red Hat 9's current kernel).

All of the errors that I've observed the form of a write() or close()
returning EIO. rsync seeems to have a particularly tough time -
could the unusual size of blocks which rsync writes be relevant?

There are some read errors too, as Mozilla failed to find my profile
claiming it couldn't read the file (when I restarted Mozilla, it found
it the second time), and Gnome Terminal was unable to read its
preferences file, but I didn't catch any specific read() errors.

I tried this command to see if the is a protocol error while running
Ethereal:

[jamie@dual jamie]$ cp .mirMail.bjl1/INBOX .mirMail.bjl1/JBOX
cp: closing `.mirMail.bjl1/JBOX': Input/output error

Ethereal shows a long series of NFS WRITEs followed by a single COMMIT
(as expected), and nothing after that. All of those replies had
status 0, OK - including the reply to the final COMMIT.

Using "md5sum" I checked on both the client and server, and the
contents of "JBOX" have been written correctly despite the
"Input/output error" from "cp".

When I looked at the packets from rsync commands that I saw getting
EIO from write(), I saw that all requests except LOOKUP returned with
status 0 as well, and the LOOKUPs that failed were simply checking
whether a ".nfs..." name exists before renaming another file to that
name.

So as far as I can tell, there is no problem with the packets being
sent and received. There is, however, a big problem with error
reporting on the client side.

A few extra notes:

- Running NFS v3.
- Kernel is vanilla 2.5.74, dual Athlon 1800MP, 768MB RAM
- Chipset: AMD-760MP/AMD-768. Board: Asus A7M266-D.

- I _think_ I noticed this problem once when running 2.4.20-18.9 on
the client, but generally it is not a problem. I have been
mounting my home directory over NFS for weeks with 2.4.20-18.9 as
client, so I would have noticed networking or server problems.

- Every so often, the client's kernel log gets:
kernel: nfs: server 192.168.1.1 not responding, timed out
I haven't caught one of those with Ethereal to find out what's
going on the wire. There are no other regular kernel messages.

Note! These are all logged at different times to the EIOs,
differet as in minutes away. The EIOs, btw, are reported
immediately; there is no pause waiting for a response from the
server.

Any idea about this? Is it a known problem?

Cheers,
-- Jamie


2003-07-10 05:53:08

by Jamie Lokier

[permalink] [raw]
Subject: Re: NFS client errors with 2.5.74?

Some more information:

The kernel messages "kernel: nfs: server 192.168.1.1 not responding,
timed out" do have some relationship with the EIO errors after all.

When I "ls" a directory for the first time (i.e. it's not in cache), I
get an EIO _every time_. It's the getdents64() call which returns EIO.

The second and subsequent times I list that directory, the listing is
fine. However if I pick another directory which isn't in cache yet,
getdents64() returns EIO.

A packet trace shows something interesting: duplicate requests.

In this case, I see four (4) READDIRPLUS requests with identical XIDs.
Ethereal says that all four are sent in 0.04 seconds.

Then I see four replies, of course with identical XIDs too. The
replies all have status OK. But four duplicate requests is mighty
suspicious.

-- Jamie

2003-07-10 09:10:58

by Felipe Alfaro Solana

[permalink] [raw]
Subject: Re: NFS client errors with 2.5.74?

On Thu, 2003-07-10 at 07:41, Jamie Lokier wrote:
> I'm seeing quite a lot of NFS client errors with 2.5.74, connected to
> a server running 2.4.20-18.9 (Red Hat 9's current kernel).
>
> All of the errors that I've observed the form of a write() or close()
> returning EIO. rsync seeems to have a particularly tough time -
> could the unusual size of blocks which rsync writes be relevant?
>
> There are some read errors too, as Mozilla failed to find my profile
> claiming it couldn't read the file (when I restarted Mozilla, it found
> it the second time), and Gnome Terminal was unable to read its
> preferences file, but I didn't catch any specific read() errors.
>
> I tried this command to see if the is a protocol error while running
> Ethereal:
>
> [jamie@dual jamie]$ cp .mirMail.bjl1/INBOX .mirMail.bjl1/JBOX
> cp: closing `.mirMail.bjl1/JBOX': Input/output error

Any chance you are using "hard" NFS mounts?

2003-07-10 10:00:51

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFS client errors with 2.5.74?

>>>>> " " == Jamie Lokier <[email protected]> writes:
> - Every so often, the client's kernel log gets:
> kernel: nfs: server 192.168.1.1 not responding, timed out

Sigh... I hate soft mounts... Have I said that before? 8-)

> Any idea about this? Is it a known problem?

I can never guarantee you perfect service with soft mounts (a 5 second
network patition/server congestion is all it takes) but I do have a
patch that just went into 2.4.22 that backs out some of the
Van Jacobson exponential backoff changes. This helps stabilize things
a lot.

I haven't yet had time to port that patch to 2.5.x, but the code
should be pretty much identical, so if you want to give it a go, then
here it is...

Cheers,
Trond

diff -u --recursive --new-file linux-2.4.21-20-rdplus/net/sunrpc/clnt.c linux-2.4.21-21-soft/net/sunrpc/clnt.c
--- linux-2.4.21-20-rdplus/net/sunrpc/clnt.c 2003-06-15 14:49:35.000000000 -0700
+++ linux-2.4.21-21-soft/net/sunrpc/clnt.c 2003-06-17 10:38:56.000000000 -0700
@@ -709,14 +709,14 @@

dprintk("RPC: %4d call_timeout (major)\n", task->tk_pid);
if (clnt->cl_softrtry) {
- if (clnt->cl_chatty && !task->tk_exit)
+ if (clnt->cl_chatty)
printk(KERN_NOTICE "%s: server %s not responding, timed out\n",
clnt->cl_protname, clnt->cl_server);
rpc_exit(task, -EIO);
return;
}

- if (clnt->cl_chatty && !(task->tk_flags & RPC_CALL_MAJORSEEN) && rpc_ntimeo(&clnt->cl_rtt) > 7) {
+ if (clnt->cl_chatty && !(task->tk_flags & RPC_CALL_MAJORSEEN)) {
task->tk_flags |= RPC_CALL_MAJORSEEN;
printk(KERN_NOTICE "%s: server %s not responding, still trying\n",
clnt->cl_protname, clnt->cl_server);
diff -u --recursive --new-file linux-2.4.21-20-rdplus/net/sunrpc/xprt.c linux-2.4.21-21-soft/net/sunrpc/xprt.c
--- linux-2.4.21-20-rdplus/net/sunrpc/xprt.c 2003-06-15 15:14:14.000000000 -0700
+++ linux-2.4.21-21-soft/net/sunrpc/xprt.c 2003-06-17 09:48:18.000000000 -0700
@@ -1046,21 +1046,6 @@
}

/*
- * Exponential backoff for UDP retries
- */
-static inline int
-xprt_expbackoff(struct rpc_task *task, struct rpc_rqst *req)
-{
- int backoff;
-
- req->rq_ntimeo++;
- backoff = min(rpc_ntimeo(&task->tk_client->cl_rtt), XPRT_MAX_BACKOFF);
- if (req->rq_ntimeo < (1 << backoff))
- return 1;
- return 0;
-}
-
-/*
* RPC receive timeout handler.
*/
static void
@@ -1073,14 +1058,7 @@
if (req->rq_received)
goto out;

- if (!xprt->nocong) {
- if (xprt_expbackoff(task, req)) {
- rpc_add_timer(task, xprt_timer);
- goto out_unlock;
- }
- rpc_inc_timeo(&task->tk_client->cl_rtt);
- xprt_adjust_cwnd(req->rq_xprt, -ETIMEDOUT);
- }
+ xprt_adjust_cwnd(req->rq_xprt, -ETIMEDOUT);
req->rq_nresend++;

dprintk("RPC: %4d xprt_timer (%s request)\n",
@@ -1090,7 +1068,6 @@
out:
task->tk_timeout = 0;
rpc_wake_up_task(task);
-out_unlock:
spin_unlock(&xprt->sock_lock);
}

@@ -1228,16 +1205,17 @@
return;
out_receive:
dprintk("RPC: %4d xmit complete\n", task->tk_pid);
+ spin_lock_bh(&xprt->sock_lock);
/* Set the task's receive timeout value */
if (!xprt->nocong) {
task->tk_timeout = rpc_calc_rto(&clnt->cl_rtt,
rpcproc_timer(clnt, task->tk_msg.rpc_proc));
- req->rq_ntimeo = 0;
+ task->tk_timeout <<= clnt->cl_timeout.to_retries
+ - req->rq_timeout.to_retries;
if (task->tk_timeout > req->rq_timeout.to_maxval)
task->tk_timeout = req->rq_timeout.to_maxval;
} else
task->tk_timeout = req->rq_timeout.to_current;
- spin_lock_bh(&xprt->sock_lock);
/* Don't race with disconnect */
if (!xprt_connected(xprt))
task->tk_status = -ENOTCONN;

2003-07-10 15:59:48

by Frank Cusack

[permalink] [raw]
Subject: Re: NFS client errors with 2.5.74?

On Thu, Jul 10, 2003 at 12:15:24PM +0200, Trond Myklebust wrote:
> >>>>> " " == Jamie Lokier <[email protected]> writes:
> > - Every so often, the client's kernel log gets:
> > kernel: nfs: server 192.168.1.1 not responding, timed out
>
> Sigh... I hate soft mounts... Have I said that before? 8-)

Is "timed out" the key that this is a soft mount? Trond, I would suggest
that whenever you log this, you also log a message "... do not report
a bug, use a hard mount" or some such. :-)

/fc