2003-07-10 10:19:06

by Michael Frank

[permalink] [raw]
Subject: NFS client errors with 2.5.74?

I chewed on that a while ago and was advised not to use "soft" mounts.

My 2.4<>2.4 setup worked with "soft" for ages, but it broke when
running 2.5.6x/7x on the server.

If you are using soft mounts, use "hard,intr" instead.

Regards
Michael

--
Powered by linux-2.5.74-mm3. Compiled with gcc-2.95-3 - mature and rock solid

My current linux related activities:
- 2.5 yenta_socket testing
- Test development and testing of swsusp for 2.4/2.5 and ACPI S3 of 2.5 kernel
- Everyday usage of 2.5 kernel

More info on 2.5 kernel: http://www.codemonkey.org.uk/post-halloween-2.5.txt
More info on swsusp: http://sourceforge.net/projects/swsusp/


2003-07-10 14:50:06

by Jamie Lokier

[permalink] [raw]
Subject: Re: NFS client errors with 2.5.74?

> > kernel: nfs: server 192.168.1.1 not responding, timed out

Trond Myklebust wrote:
> I can never guarantee you perfect service with soft mounts (a 5 second
> network patition/server congestion is all it takes)

There is _no_ congestion to speak of: I type "ls directory" and it
returns EIO apparently immediately, from an otherwise idle network and
unloaded server.

The "server 192.168.1.1 not responding, timed out" message also
appears immediately in this case - there is no 5 second delay.

There is no 0.7 second delay either (the default value of "timeo"
according to nfs(5)). So the retransmission logic is buggered.

> Sigh... I hate soft mounts... Have I said that before? 8-)

If I switch to a hard mount ("hard,intr") the EIO errors go away.

However, the protocol problem remains: multiple READDIRPLUS calls with
the same xid in a fraction of a second. Note: there is no 0.7 second
delay between these packets. According to Ethereal, it is between
0.01 and 0.1 seconds between duplicate requests.

There seems to be a transition from a state where calls with duplicate
xids are rare (but they do occur), to one where they occur on nearly
every request.

I have 768MB of RAM on the client, so I checked whether RAM being
filled makes a difference. Not really.

After mounting, if I do "ls -lR" then I see that duplicate xids are
rare for a while, then they become common. In this state, I still
have 400MB free (i.e. not even filled clean pagecache pages), so it is
not an absolute shortage of RAM which triggers this, but something else.

I suspect the request timeout logic is buggered, and sending retries
too quickly - 0.01 to 0.1 seconds rather than 0.7 upwards. It would
also explain why "soft" is failing quickly: if the timeout logic
thinks it has already sent the maximum number of retries in a very
short time, it will count it as a timeout even though the server is
quite fast in responding.

It's interesting that this state can be reached even when the network,
client and server are idle and I try "ls directory" for some uncached
directory. This shows it's not purely a question of congestion, but
that even a fast response is not good enough.

-- Jamie

2003-07-10 15:21:25

by Jamie Lokier

[permalink] [raw]
Subject: Re: NFS client errors with 2.5.74?

Trond Myklebust wrote:
> >>>>> " " == Jamie Lokier <[email protected]> writes:
>
>
> > There is no 0.7 second delay either (the default value of
> > "timeo" according to nfs(5)). So the retransmission logic is
> > buggered.
>
> No. manpage is buggered. We use round trip time estimation now...
>
>
> > However, the protocol problem remains: multiple READDIRPLUS
> > calls with the same xid in a fraction of a second. Note: there
> > is no 0.7 second delay between these packets. According to
> > Ethereal, it is between
> > 0.01 and 0.1 seconds between duplicate requests.
>
> Yup. That's what the RTO does...

So the problem state arises when the round trip time converges on some
value which is too small? And in this state, "soft" returns EIO
within some ridiculously small timeout?

It is definitely too small because I see that the bulk of readdir
requests are fine for the first few thousand, but nearly all of them
have one or more duplicates after that. There is no justification for
those duplicates - all the original requests are getting replies.

It's possible that the server is taking longer to respond to
READDIRPLUS than to GETATTR. That makes sense, because the
READDIRPLUS reads inodes from disk, then GETATTR just reads them from
memory. Let's take a look...

Yup! In the traces, GETATTR responds consistently in 0.0002 seconds
(close to ping time, clearly no disk access), whereas READDIRPLUS
takes some 0.2 seconds to respond.

I see that the RTT estimation _is_ adapting quickly to this: the first
READDIRPLUS in a sequence has several duplicates; subsequent ones don't.

Suggestion:

- With RTT estimation, "not responding" should require a certain
amount of absolute time to pass, e.g. 5 seconds, not just a fixed
number of packets. Without this, "soft" is broken because the
time difference between a response for cached data and a response
from the disk is a factor of several hundred.

Note that I'm looking as "ls -R" now because it's easy to repeat, but
the EIO problem occurs with file reading and writing too.

-- Jamie


p.s Another quirk is I see the server sending more than one duplicate
reply to each duplicate READDIRPLUS request... Well, that's a
separate problem I think :)

2003-07-10 15:31:08

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFS client errors with 2.5.74?

>>>>> " " == Jamie Lokier <[email protected]> writes:

> all the original requests are getting replies.

OK. That *is* definitely a bug... Let me ponder a bit. I'm working
through that logic right now...

Cheers,
Trond

2003-07-10 23:22:38

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFS client errors with 2.5.74?

diff -u --recursive --new-file linux-2.5.74-14-soft/include/linux/sunrpc/clnt.h linux-2.5.74-15-mem/include/linux/sunrpc/clnt.h
--- linux-2.5.74-14-soft/include/linux/sunrpc/clnt.h 2003-02-13 13:57:46.000000000 +0100
+++ linux-2.5.74-15-mem/include/linux/sunrpc/clnt.h 2003-07-11 00:48:39.000000000 +0200
@@ -35,6 +35,7 @@
*/
struct rpc_clnt {
atomic_t cl_users; /* number of references */
+ atomic_t cl_active; /* number of active calls */
struct rpc_xprt * cl_xprt; /* transport */
struct rpc_procinfo * cl_procinfo; /* procedure info */
u32 cl_maxproc; /* max procedure number */
@@ -57,6 +58,7 @@

struct rpc_portmap cl_pmap; /* port mapping */
struct rpc_wait_queue cl_bindwait; /* waiting on getport() */
+ wait_queue_head_t cl_waitq; /* wait queue */

int cl_nodelen; /* nodename length */
char cl_nodename[UNX_MAXNODENAME];
@@ -124,6 +126,15 @@
void rpc_clnt_sigmask(struct rpc_clnt *clnt, sigset_t *oldset);
void rpc_clnt_sigunmask(struct rpc_clnt *clnt, sigset_t *oldset);
void rpc_setbufsize(struct rpc_clnt *, unsigned int, unsigned int);
+int rpc_congestion_wait(struct rpc_clnt *);
+
+static inline void rpc_mark_active(struct rpc_task *task)
+{
+ struct rpc_clnt *clnt = task->tk_client;
+ task->tk_active = 1;
+ if (clnt)
+ atomic_inc(&clnt->cl_active);
+}

static __inline__
int rpc_call(struct rpc_clnt *clnt, u32 proc, void *argp, void *resp, int flags)
diff -u --recursive --new-file linux-2.5.74-14-soft/net/sunrpc/clnt.c linux-2.5.74-15-mem/net/sunrpc/clnt.c
--- linux-2.5.74-14-soft/net/sunrpc/clnt.c 2003-07-10 21:46:14.000000000 +0200
+++ linux-2.5.74-15-mem/net/sunrpc/clnt.c 2003-07-11 00:49:31.000000000 +0200
@@ -127,6 +127,7 @@
clnt->cl_prot = xprt->prot;
clnt->cl_stats = program->stats;
INIT_RPC_WAITQ(&clnt->cl_bindwait, "bindwait");
+ init_waitqueue_head(&clnt->cl_waitq);

if (!clnt->cl_port)
clnt->cl_autobind = 1;
@@ -389,6 +390,37 @@
}

/*
+ * Throttle the number of active RPC requests
+ */
+int
+rpc_congestion_wait(struct rpc_clnt *clnt)
+{
+ int ret = 0;
+ DECLARE_WAITQUEUE(wait, current);
+
+ if (atomic_read(&clnt->cl_active) < RPC_MAXCONG)
+ goto out;
+ add_wait_queue(&clnt->cl_waitq, &wait);
+ for (;;) {
+ if (clnt->cl_intr)
+ set_current_state(TASK_INTERRUPTIBLE);
+ else
+ set_current_state(TASK_UNINTERRUPTIBLE);
+ if (atomic_read(&clnt->cl_active) < RPC_MAXCONG)
+ break;
+ if (clnt->cl_intr && signal_pending(current)) {
+ ret = -ERESTARTSYS;
+ break;
+ }
+ io_schedule();
+ }
+ current->state = TASK_RUNNING;
+ remove_wait_queue(&clnt->cl_waitq, &wait);
+out:
+ return ret;
+}
+
+/*
* Restart an (async) RPC call. Usually called from within the
* exit handler.
*/
diff -u --recursive --new-file linux-2.5.74-14-soft/net/sunrpc/sched.c linux-2.5.74-15-mem/net/sunrpc/sched.c
--- linux-2.5.74-14-soft/net/sunrpc/sched.c 2003-06-20 22:16:26.000000000 +0200
+++ linux-2.5.74-15-mem/net/sunrpc/sched.c 2003-07-11 00:47:23.000000000 +0200
@@ -257,13 +257,11 @@
return;
}
rpc_clear_sleeping(task);
- if (waitqueue_active(&rpciod_idle))
- wake_up(&rpciod_idle);
+ wake_up(&rpciod_idle);
}
} else {
rpc_clear_sleeping(task);
- if (waitqueue_active(&task->tk_wait))
- wake_up(&task->tk_wait);
+ wake_up(&task->tk_wait);
}
}

@@ -276,7 +274,7 @@
/* Don't run a child twice! */
if (RPC_IS_ACTIVATED(task))
return;
- task->tk_active = 1;
+ rpc_mark_active(task);
rpc_set_sleeping(task);
rpc_make_runnable(task);
}
@@ -289,8 +287,7 @@
{
if(rpciod_pid==0)
printk(KERN_ERR "rpciod: wot no daemon?\n");
- if (waitqueue_active(&rpciod_idle))
- wake_up(&rpciod_idle);
+ wake_up(&rpciod_idle);
}

/*
@@ -315,7 +312,7 @@

/* Mark the task as being activated if so needed */
if (!RPC_IS_ACTIVATED(task)) {
- task->tk_active = 1;
+ rpc_mark_active(task);
rpc_set_sleeping(task);
}

@@ -488,7 +485,8 @@
static int
__rpc_execute(struct rpc_task *task)
{
- int status = 0;
+ int interruptible = task->tk_client->cl_intr;
+ int status = 0;

dprintk("RPC: %4d rpc_execute flgs %x\n",
task->tk_pid, task->tk_flags);
@@ -547,14 +545,24 @@
}
spin_unlock_bh(&rpc_queue_lock);

- while (RPC_IS_SLEEPING(task)) {
+ if (RPC_IS_SLEEPING(task)) {
+ DEFINE_WAIT(wait);
+
/* sync task: sleep here */
dprintk("RPC: %4d sync task going to sleep\n",
task->tk_pid);
if (current->pid == rpciod_pid)
printk(KERN_ERR "RPC: rpciod waiting on sync task!\n");

- __wait_event(task->tk_wait, !RPC_IS_SLEEPING(task));
+ prepare_to_wait(&task->tk_wait, &wait,
+ interruptible ? TASK_INTERRUPTIBLE :
+ TASK_UNINTERRUPTIBLE);
+ if (likely(RPC_IS_SLEEPING(task))) {
+ if (likely(!(signalled() && interruptible)))
+ io_schedule();
+ }
+ finish_wait(&task->tk_wait, &wait);
+
dprintk("RPC: %4d sync task resuming\n", task->tk_pid);

/*
@@ -563,7 +571,7 @@
* clean up after sleeping on some queue, we don't
* break the loop here, but go around once more.
*/
- if (task->tk_client->cl_intr && signalled()) {
+ if (unlikely(signalled() && interruptible)) {
dprintk("RPC: %4d got signal\n", task->tk_pid);
task->tk_flags |= RPC_TASK_KILLED;
rpc_exit(task, -ERESTARTSYS);
@@ -620,7 +628,12 @@
goto out_err;
}

- task->tk_active = 1;
+ if (task->tk_client) {
+ status = rpc_congestion_wait(task->tk_client);
+ if (status < 0)
+ goto out_release;
+ }
+ rpc_mark_active(task);
rpc_set_running(task);
return __rpc_execute(task);
out_release:
@@ -818,8 +831,6 @@
/* Remove from any wait queue we're still on */
__rpc_remove_wait_queue(task);

- task->tk_active = 0;
-
spin_unlock_bh(&rpc_queue_lock);

/* Synchronously delete any running timer */
@@ -832,6 +843,10 @@
rpcauth_unbindcred(task);
rpc_free(task);
if (task->tk_client) {
+ if (task->tk_active) {
+ atomic_dec(&task->tk_client->cl_active);
+ wake_up(&task->tk_client->cl_waitq);
+ }
rpc_release_client(task->tk_client);
task->tk_client = NULL;
}
@@ -979,8 +994,20 @@
}

if (!rpciod_task_pending()) {
+ DEFINE_WAIT(wait);
+
dprintk("RPC: rpciod back to sleep\n");
- wait_event_interruptible(rpciod_idle, rpciod_task_pending());
+
+ prepare_to_wait(&rpciod_idle, &wait, TASK_INTERRUPTIBLE);
+ spin_lock_bh(&rpc_queue_lock);
+ if (likely(!rpciod_task_pending())) {
+ spin_unlock_bh(&rpc_queue_lock);
+ if (likely(!signalled()))
+ schedule();
+ } else
+ spin_unlock_bh(&rpc_queue_lock);
+ finish_wait(&rpciod_idle, &wait);
+
dprintk("RPC: switch to rpciod\n");
rounds = 0;
}


Attachments:
linux-2.5.74-13-resends.dif (6.19 kB)
patch 1/2
linux-2.5.74-15-mem.dif (6.44 kB)
patch 2/2
Download all attachments

2003-07-12 14:59:02

by Jamie Lokier

[permalink] [raw]
Subject: Re: NFS client errors with 2.5.74?

Trond Myklebust wrote:
> The first one should fix the problem of the kernel missing replies
> while we are busy trying to resend a request.

This by itself doesn't fix the problem of too-fast timeout errors on
soft mounts (e.g. returning EIO within <0.1s).

I am still seeing the fs get into a state where each time a large file
is written, it reports EIO (but writes successfully anyway). And "ls
-R" still shows EIO errors also.

> The second, solves a problem of resource starvation. The fact that we
> can currently just submit arbitrary numbers of asynchronous requests
> means that we can exhaust resources to the point where the socket
> starts dropping replies.
> This patch limits the number of outstanding asynchronous requests to
> 16 per socket (the maximum number of xprt/transport slots).

I haven't tried this yet. It doesn't apply to 2.5.74 due to the calls
to io_schedule().

- Jamie

2003-07-13 13:08:40

by Jamie Lokier

[permalink] [raw]
Subject: Re: NFS client errors with 2.5.74?

Jamie Lokier wrote:
> I haven't tried this yet. It doesn't apply to 2.5.74 due to the calls
> to io_schedule().

I've tried both patches together now, on 2.5.75. Still seeing fast
(<0.1s) timeouts with "soft". Using "hard" seems fairly reliable, but
that was the same without the patches.

-- Jamie