LinuxLists.cc - [PATCH 0/3] NFSD EOS deferral

2008-10-22 18:12:09

Subject: [PATCH 0/3] NFSD EOS deferral

Here's a resend of the deferral patch set for review , responding to comments
on the last patch set.

A deferral occurs when NFSD needs information from an rpc cache, and an upcall
is required. Instead of NFSD waiting for the cache to be filled by the upcall,
the RPC request is inserted back into the receive stream for processing at a
later time.

Exactly once semantics require that NFSD compound RPC deferral processing
restart at the operation that caused the deferral, instead of reprocessing the
full compound RPC from the start possibly repeating operation processing.
These patches add three callbacks, a data pointer, and dynamic page pointer
storage to the sunrpc svc deferral architecture that NFSD uses to accomplish
this goal.

Deferrals that do not define the callbacks act as before. Care has been taken
to ensure that combinations of deferrals - those from the NFSv4 server with
the callbacks defined, and those from the RPC layer without the callbacks
defined work together correctly.

NEW:

I've limited the number of pages held by all deferrals to the number of pages
in the rpc maximum payload - the same as adding another nfsd thread.
Most deferrals (most requests) fit in a page or two. We want to service
the rare deferral where a large readdir or read reply is followed by a
deferral, this limit allows the servicing of one such request at time.

I've changed the deferral page pointer storage from static to dynamic.
The new svc_rqst fields are initialized in the beginning of svc_process
along with the rest of the fields.

As always, thoughts, comments and suggestions welcome.

-->Andy

2008-10-22 18:12:17

by Andy Adamson

[permalink] [raw]

Subject: [PATCH 1/3] SUNRPC add deferral processing callbacks

From: Andy Adamson <[email protected]>

For EOS, NFSD compound RPC deferred processing should restart operation which
caused the deferral.

Add a callback and a defer_data pointer in svc_rqst to enable svc_defer to
save partial result state in the deferral request.

Dynamically allocate page pointers in the save state callback.
A failure to save state will free the deferred request and
return NULL which signals the cache to return -ETIMEDOUT (NFSERR_DELAY).

Add page pointer storage to svc_deferred_req to cache the pages holding the
partially processed request.

Add callbacks and a defer_data pointer in svc_deferred_request to enable
svc_deferred_recv to restore and release the partial result state.

Signed-off-by: Andy Adamson<[email protected]>
---
include/linux/sunrpc/svc.h | 10 ++++++++++
net/sunrpc/svc.c | 4 ++++
net/sunrpc/svc_xprt.c | 10 +++++++++-
3 files changed, 23 insertions(+), 1 deletions(-)

diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index 3afe7fb..8cc8a74 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -216,6 +216,9 @@ struct svc_rqst {
struct svc_cred rq_cred; /* auth info */
void * rq_xprt_ctxt; /* transport specific context ptr */
struct svc_deferred_req*rq_deferred; /* deferred request we are replaying */
+ /* callback to save deferred request state */
+ int (*rq_save_state)(struct svc_rqst *, struct svc_deferred_req *);
+ void *rq_defer_data; /* defer state data to save */

size_t rq_xprt_hlen; /* xprt header len */
struct xdr_buf rq_arg;
@@ -324,6 +327,13 @@ struct svc_deferred_req {
union svc_addr_u daddr; /* where reply must come from */
struct cache_deferred_req handle;
size_t xprt_hlen;
+ /* callbacks to restore and release deferred request state
+ * set in rq_save_state */
+ void (*restore_state)(struct svc_rqst *, struct svc_deferred_req *);
+ void (*release_state)(struct svc_deferred_req *);
+ void *defer_data; /* defer state data */
+ struct page **respages;
+ int respages_used;
int argslen;
__be32 args[0];
};
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index 54c98d8..8a6c69c 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -1024,6 +1024,10 @@ svc_process(struct svc_rqst *rqstp)
/* Will be turned off only in gss privacy case: */
rqstp->rq_splice_ok = 1;

+ /* Reset deferred processing */
+ rqstp->rq_defer_data = NULL;
+ rqstp->rq_save_state = NULL;
+
/* Setup reply header */
rqstp->rq_xprt->xpt_ops->xpo_prep_reply_hdr(rqstp);

diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index bf5b5cd..0a8d6ab 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -903,6 +903,8 @@ static void svc_revisit(struct cache_deferred_req *dreq, int too_many)
struct svc_xprt *xprt = dr->xprt;

if (too_many) {
+ if (dr->release_state)
+ dr->release_state(dr);
svc_xprt_put(xprt);
kfree(dr);
return;
@@ -941,7 +943,7 @@ static struct cache_deferred_req *svc_defer(struct cache_req *req)
size_t size;
/* FIXME maybe discard if size too large */
size = sizeof(struct svc_deferred_req) + rqstp->rq_arg.len;
- dr = kmalloc(size, GFP_KERNEL);
+ dr = kzalloc(size, GFP_KERNEL);
if (dr == NULL)
return NULL;

@@ -958,6 +960,10 @@ static struct cache_deferred_req *svc_defer(struct cache_req *req)
memcpy(dr->args, rqstp->rq_arg.head[0].iov_base - skip,
dr->argslen << 2);
}
+ if (rqstp->rq_save_state && !rqstp->rq_save_state(rqstp, dr)) {
+ kfree(dr);
+ return NULL;
+ }
svc_xprt_get(rqstp->rq_xprt);
dr->xprt = rqstp->rq_xprt;

@@ -986,6 +992,8 @@ static int svc_deferred_recv(struct svc_rqst *rqstp)
rqstp->rq_xprt_hlen = dr->xprt_hlen;
rqstp->rq_daddr = dr->daddr;
rqstp->rq_respages = rqstp->rq_pages;
+ if (dr->restore_state)
+ dr->restore_state(rqstp, dr);
return (dr->argslen<<2) - dr->xprt_hlen;
}

--
1.5.4.3

2008-10-22 18:12:21

by Andy Adamson

[permalink] [raw]

Subject: [PATCH 2/3] NFSD save, restore, and release deferred result pages

From: Andy Adamson <[email protected]>

Dynamically allocate the deferred request page pointer array only if
there are enough deferral pages available.

Limit the number of available deferral pages to the number of pages in the
maximum rpc payload. This allows for one deferral at at time of a request
that requires the maximum payload. Most deferrals require a single page.

Implement the rq_save_state, resume_state, and release_state RPC deferral
callbacks. Save the reply pages in struct svc_deferred_req.

Clear the svc_deferred_req respages in the save_state callback to setup
for another NFSD operation deferral.

Signed-off-by: Andy Adamson<[email protected]>
---
fs/nfsd/nfs4proc.c | 129 ++++++++++++++++++++++++++++++++++++++++++++
include/linux/sunrpc/svc.h | 1 +
2 files changed, 130 insertions(+), 0 deletions(-)

diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
index 669461e..97f2d25 100644
--- a/fs/nfsd/nfs4proc.c
+++ b/fs/nfsd/nfs4proc.c
@@ -836,6 +836,135 @@ static struct nfsd4_compound_state *cstate_alloc(void)
return cstate;
}

+/*
+ * RPC deferral callbacks
+ */
+
+void
+nfsd4_move_pages(struct page **topages, struct page **frompages, int count)
+{
+ int i;
+
+ for (i = 0; i < count; i++) {
+ topages[i] = frompages[i];
+ if (!topages[i])
+ continue;
+ get_page(topages[i]);
+ }
+}
+
+void
+nfsd4_cache_rqst_pages(struct svc_rqst *rqstp, struct page **respages,
+ int *resused)
+{
+ *resused = rqstp->rq_resused;
+ nfsd4_move_pages(respages, rqstp->rq_respages, rqstp->rq_resused);
+}
+
+void
+nfsd4_restore_rqst_pages(struct svc_rqst *rqstp, struct page **respages,
+ int resused)
+{
+ /* release allocated result pages to be replaced from the cache */
+ svc_free_res_pages(rqstp);
+
+ rqstp->rq_resused = resused;
+ nfsd4_move_pages(rqstp->rq_respages, respages, resused);
+}
+
+static void
+nfsd4_clear_respages(struct page **respages, int resused)
+{
+ int i;
+
+ for (i = 0; i < resused; i++) {
+ if (!respages[i])
+ continue;
+ put_page(respages[i]);
+ respages[i] = NULL;
+ }
+}
+
+/*
+ * Limit the number of pages held by any deferral to the
+ * number of pages in the maximum rpc_payload.
+ */
+static struct page**
+nfsd4_alloc_deferred_respages(struct svc_rqst *rqstp)
+{
+ struct page **new = NULL;
+ u32 maxpages = svc_max_payload(rqstp) >> PAGE_SHIFT;
+
+ new = kcalloc(rqstp->rq_resused, sizeof(struct page *), GFP_KERNEL);
+ if (!new)
+ return new;
+ spin_lock(&nfsd_serv->sv_lock);
+ if (nfsd_serv->sv_defer_pages_used + rqstp->rq_resused <= maxpages) {
+ nfsd_serv->sv_defer_pages_used += rqstp->rq_resused;
+ spin_unlock(&nfsd_serv->sv_lock);
+ } else {
+ spin_unlock(&nfsd_serv->sv_lock);
+ kfree(new);
+ new = NULL;
+ }
+ return new;
+}
+
+void
+nfsd4_return_deferred_respages(struct svc_deferred_req *dreq)
+{
+ nfsd4_clear_respages(dreq->respages, dreq->respages_used);
+ spin_lock(&nfsd_serv->sv_lock);
+ nfsd_serv->sv_defer_pages_used -= dreq->respages_used;
+ spin_unlock(&nfsd_serv->sv_lock);
+ kfree(dreq->respages);
+ dreq->respages = NULL;
+ dreq->respages_used = 0;
+}
+
+static void
+nfsd4_release_deferred_state(struct svc_deferred_req *dreq)
+{
+ nfsd4_return_deferred_respages(dreq);
+ cstate_free(dreq->defer_data);
+}
+
+static void
+nfsd4_restore_deferred_state(struct svc_rqst *rqstp,
+ struct svc_deferred_req *dreq)
+{
+ nfsd4_restore_rqst_pages(rqstp, dreq->respages, dreq->respages_used);
+ /* Reset defer_data for a NFSD deferral revisit interrupted
+ * by a non-NFSD deferral */
+ rqstp->rq_defer_data = dreq->defer_data;
+}
+
+static int
+nfsd4_save_deferred_state(struct svc_rqst *rqstp,
+ struct svc_deferred_req *dreq)
+ {
+ struct nfsd4_compound_state *cstate =
+ (struct nfsd4_compound_state *)rqstp->rq_defer_data;
+
+ /* From NFSD deferral on a previous operation */
+ if (dreq->respages)
+ nfsd4_return_deferred_respages(dreq);
+ dreq->respages = nfsd4_alloc_deferred_respages(rqstp);
+ if (!dreq->respages)
+ return 0;
+ dreq->respages_used = rqstp->rq_resused;
+
+ fh_put(&cstate->current_fh);
+ fh_put(&cstate->save_fh);
+
+ nfsd4_cache_rqst_pages(rqstp, dreq->respages, &dreq->respages_used);
+
+ dreq->defer_data = rqstp->rq_defer_data;
+ dreq->restore_state = nfsd4_restore_deferred_state;
+ dreq->release_state = nfsd4_release_deferred_state;
+ return 1;
+}
+
typedef __be32(*nfsd4op_func)(struct svc_rqst *, struct nfsd4_compound_state *,
void *);

diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index 8cc8a74..bf943b7 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -60,6 +60,7 @@ struct svc_serv {
unsigned int sv_nrthreads; /* # of server threads */
unsigned int sv_max_payload; /* datagram payload size */
unsigned int sv_max_mesg; /* max_payload + 1 page for overheads */
+ unsigned int sv_defer_pages_used; /* deferred pages held */
unsigned int sv_xdrsize; /* XDR buffer size */

struct list_head sv_permsocks; /* all permanent sockets */
--
1.5.4.3

2008-10-22 18:12:30

by Andy Adamson

[permalink] [raw]

Subject: [PATCH 3/3] NFSD deferral processing

From: Andy Adamson <[email protected]>

Use a slab cache for nfsd4_compound_state allocation

Save the struct nfsd4_compound_state and set the save_state callback for
each request for potential deferral handling.

If an NFSv4 operation causes a deferral, the save_state callback is called
by svc_defer which saves the defer_data with the deferral, and sets the
restore_state deferral callback.

fh_put is called so that the deferral is not referencing the file handles,
allowing umount of the file system.

Signed-off-by: Andy Adamson<[email protected]>
---
fs/nfsd/nfs4proc.c | 45 ++++++++++++++++++---------------------------
fs/nfsd/nfs4state.c | 37 +++++++++++++++++++++++++++++++++++++
include/linux/nfsd/xdr4.h | 6 ++++++
3 files changed, 61 insertions(+), 27 deletions(-)

diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
index 97f2d25..c556e77 100644
--- a/fs/nfsd/nfs4proc.c
+++ b/fs/nfsd/nfs4proc.c
@@ -813,29 +813,6 @@ static inline void nfsd4_increment_op_stats(u32 opnum)
nfsdstats.nfs4_opcount[opnum]++;
}

-static void cstate_free(struct nfsd4_compound_state *cstate)
-{
- if (cstate == NULL)
- return;
- fh_put(&cstate->current_fh);
- fh_put(&cstate->save_fh);
- BUG_ON(cstate->replay_owner);
- kfree(cstate);
-}
-
-static struct nfsd4_compound_state *cstate_alloc(void)
-{
- struct nfsd4_compound_state *cstate;
-
- cstate = kmalloc(sizeof(struct nfsd4_compound_state), GFP_KERNEL);
- if (cstate == NULL)
- return NULL;
- fh_init(&cstate->current_fh, NFS4_FHSIZE);
- fh_init(&cstate->save_fh, NFS4_FHSIZE);
- cstate->replay_owner = NULL;
- return cstate;
-}
-
/*
* RPC deferral callbacks
*/
@@ -925,8 +902,7 @@ nfsd4_return_deferred_respages(struct svc_deferred_req *dreq)
static void
nfsd4_release_deferred_state(struct svc_deferred_req *dreq)
{
- nfsd4_return_deferred_respages(dreq);
- cstate_free(dreq->defer_data);
+ nfsd4_cstate_free(dreq->defer_data, dreq);
}

static void
@@ -1015,12 +991,23 @@ nfsd4_proc_compound(struct svc_rqst *rqstp,
goto out;

status = nfserr_resource;
- cstate = cstate_alloc();
+ cstate = nfsd4_cstate_alloc(rqstp);
if (cstate == NULL)
goto out;

+ if (rqstp->rq_deferred && rqstp->rq_deferred->defer_data) {
+ resp->opcnt = cstate->last_op_cnt;
+ resp->p = cstate->last_op_p;
+ fh_verify(rqstp, &cstate->current_fh, 0, NFSD_MAY_NOP);
+ fh_verify(rqstp, &cstate->save_fh, 0, NFSD_MAY_NOP);
+ }
+ /* Reset to NULL in svc_process */
+ rqstp->rq_defer_data = cstate;
+ rqstp->rq_save_state = nfsd4_save_deferred_state;
+
status = nfs_ok;
while (!status && resp->opcnt < args->opcnt) {
+ cstate->last_op_p = resp->p;
op = &args->ops[resp->opcnt++];

dprintk("nfsv4 compound op #%d/%d: %d (%s)\n",
@@ -1085,8 +1072,12 @@ encode_op:

nfsd4_increment_op_stats(op->opnum);
}
+ if (status == nfserr_dropit) {
+ cstate->last_op_cnt = resp->opcnt - 1;
+ return status;
+ }

- cstate_free(cstate);
+ nfsd4_cstate_free(cstate, rqstp->rq_deferred);
out:
nfsd4_release_compoundargs(args);
dprintk("nfsv4 compound returned %d\n", ntohl(status));
diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index 0cc7ff5..6ab67fc 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -90,6 +90,7 @@ static struct kmem_cache *stateowner_slab = NULL;
static struct kmem_cache *file_slab = NULL;
static struct kmem_cache *stateid_slab = NULL;
static struct kmem_cache *deleg_slab = NULL;
+static struct kmem_cache *cstate_slab;

void
nfs4_lock_state(void)
@@ -441,6 +442,37 @@ static struct nfs4_client *create_client(struct xdr_netobj name, char *recdir)
return clp;
}

+void nfsd4_cstate_free(struct nfsd4_compound_state *cstate,
+ struct svc_deferred_req *dreq)
+{
+ if (dreq && dreq->release_state)
+ nfsd4_return_deferred_respages(dreq);
+ if (cstate == NULL)
+ return;
+ fh_put(&cstate->current_fh);
+ fh_put(&cstate->save_fh);
+ BUG_ON(cstate->replay_owner);
+ kmem_cache_free(cstate_slab, cstate);
+}
+
+struct nfsd4_compound_state *nfsd4_cstate_alloc(struct svc_rqst *rqstp)
+{
+ struct nfsd4_compound_state *cstate;
+
+ if (rqstp->rq_deferred && rqstp->rq_deferred->defer_data) {
+ cstate = rqstp->rq_deferred->defer_data;
+ goto out;
+ }
+ cstate = kmem_cache_alloc(cstate_slab, GFP_KERNEL);
+ if (cstate == NULL)
+ return NULL;
+ fh_init(&cstate->current_fh, NFS4_FHSIZE);
+ fh_init(&cstate->save_fh, NFS4_FHSIZE);
+ cstate->replay_owner = NULL;
+out:
+ return cstate;
+}
+
static void copy_verf(struct nfs4_client *target, nfs4_verifier *source)
{
memcpy(target->cl_verifier.data, source->data,
@@ -940,6 +972,7 @@ nfsd4_free_slabs(void)
nfsd4_free_slab(&file_slab);
nfsd4_free_slab(&stateid_slab);
nfsd4_free_slab(&deleg_slab);
+ nfsd4_free_slab(&cstate_slab);
}

static int
@@ -961,6 +994,10 @@ nfsd4_init_slabs(void)
sizeof(struct nfs4_delegation), 0, 0, NULL);
if (deleg_slab == NULL)
goto out_nomem;
+ cstate_slab = kmem_cache_create("nfsd4_compound_states",
+ sizeof(struct nfsd4_compound_state), 0, 0, NULL);
+ if (cstate_slab == NULL)
+ goto out_nomem;
return 0;
out_nomem:
nfsd4_free_slabs();
diff --git a/include/linux/nfsd/xdr4.h b/include/linux/nfsd/xdr4.h
index 27bd3e3..ced602c 100644
--- a/include/linux/nfsd/xdr4.h
+++ b/include/linux/nfsd/xdr4.h
@@ -48,6 +48,8 @@ struct nfsd4_compound_state {
struct svc_fh current_fh;
struct svc_fh save_fh;
struct nfs4_stateowner *replay_owner;
+ __be32 *last_op_p;
+ u32 last_op_cnt;
};

struct nfsd4_change_info {
@@ -442,6 +444,10 @@ void nfsd4_encode_replay(struct nfsd4_compoundres *resp, struct nfsd4_op *op);
__be32 nfsd4_encode_fattr(struct svc_fh *fhp, struct svc_export *exp,
struct dentry *dentry, __be32 *buffer, int *countp,
u32 *bmval, struct svc_rqst *, int ignore_crossmnt);
+extern void nfsd4_return_deferred_respages(struct svc_deferred_req *dreq);
+extern struct nfsd4_compound_state *nfsd4_cstate_alloc(struct svc_rqst *rqstp);
+extern void nfsd4_cstate_free(struct nfsd4_compound_state *cstate,
+ struct svc_deferred_req *dreq);
extern __be32 nfsd4_setclientid(struct svc_rqst *rqstp,
struct nfsd4_compound_state *,
struct nfsd4_setclientid *setclid);
--
1.5.4.3

2008-10-17 20:30:44

by Talpey, Thomas

[permalink] [raw]

Subject: Re: [PATCH 0/3] NFSD EOS deferral

At 02:59 PM 10/17/2008, Marc Eshel wrote:
>[email protected] wrote on 10/17/2008 10:44:54 AM:
>
>> "J. Bruce Fields" <[email protected]>
>> Requests longer than a page are still not deferred, so large writes that
>> trigger upcalls still get an ERR_DELAY. OK, probably no big deal.
>>
>> I don't think we can apply this until we have some way to track the
>> number and size of deferred requests outstanding and fall back on
>> ERR_DELAY if it's too much.
>
>But I thought that the problem here is that the Linux NFS client doesn't
>handle this return code properly.

Definitely this is an issue. Early clients do one of two things, they either
pass the error back to the application, or they enter a buzz loop resending
the operation with no delay. Later clients back off, but for a constant
five seconds. Either way, the server is generally better off gritting its
teeth and completing the operation.

Blocking server threads is drastic, but in effect it will stall the client
queues and "push back". The issue on Linux is the small number of
nfsd contexts involved. It could lead to significant issues possibly
including DOS attack. Dropping connections (judiciously) could be
used instead of blocking the last few threads, though even that will
have consequences.

The easy way to test all this is decorate /etc/exports with lots of
names, then break the nameservice and start sending requests from
many new clients. It's very hard to get it all right.

Tom.

2008-10-17 20:36:39

by J. Bruce Fields

[permalink] [raw]

Subject: Re: [PATCH 0/3] NFSD EOS deferral

On Fri, Oct 17, 2008 at 04:26:18PM -0400, Talpey, Thomas wrote:
> At 02:59 PM 10/17/2008, Marc Eshel wrote:
> >[email protected] wrote on 10/17/2008 10:44:54 AM:
> >
> >> "J. Bruce Fields" <[email protected]>
> >> Requests longer than a page are still not deferred, so large writes that
> >> trigger upcalls still get an ERR_DELAY. OK, probably no big deal.
> >>
> >> I don't think we can apply this until we have some way to track the
> >> number and size of deferred requests outstanding and fall back on
> >> ERR_DELAY if it's too much.
> >
> >But I thought that the problem here is that the Linux NFS client doesn't
> >handle this return code properly.
>
> Definitely this is an issue. Early clients do one of two things, they either
> pass the error back to the application, or they enter a buzz loop resending
> the operation with no delay. Later clients back off, but for a constant
> five seconds.

I haven't tested it, but from fs/nfs/nfs4proc.c:nfs4_delay() it appears
to start at a tenth of a second and then do exponential backoff (up to
15 seconds). Looks to me like the code's been that way since at least
2.6.19.

--b.

> Either way, the server is generally better off gritting its
> teeth and completing the operation.
>
> Blocking server threads is drastic, but in effect it will stall the client
> queues and "push back". The issue on Linux is the small number of
> nfsd contexts involved. It could lead to significant issues possibly
> including DOS attack. Dropping connections (judiciously) could be
> used instead of blocking the last few threads, though even that will
> have consequences.
>
> The easy way to test all this is decorate /etc/exports with lots of
> names, then break the nameservice and start sending requests from
> many new clients. It's very hard to get it all right.
>
> Tom.
>

2008-10-17 20:51:38

by Talpey, Thomas

[permalink] [raw]

Subject: Re: [PATCH 0/3] NFSD EOS deferral

At 04:36 PM 10/17/2008, J. Bruce Fields wrote:
>On Fri, Oct 17, 2008 at 04:26:18PM -0400, Talpey, Thomas wrote:
>> At 02:59 PM 10/17/2008, Marc Eshel wrote:
>> >[email protected] wrote on 10/17/2008 10:44:54 AM:
>> >
>> >> "J. Bruce Fields" <[email protected]>
>> >> Requests longer than a page are still not deferred, so large writes that
>> >> trigger upcalls still get an ERR_DELAY. OK, probably no big deal.
>> >>
>> >> I don't think we can apply this until we have some way to track the
>> >> number and size of deferred requests outstanding and fall back on
>> >> ERR_DELAY if it's too much.
>> >
>> >But I thought that the problem here is that the Linux NFS client doesn't
>> >handle this return code properly.
>>
>> Definitely this is an issue. Early clients do one of two things, they either
>> pass the error back to the application, or they enter a buzz loop resending
>> the operation with no delay. Later clients back off, but for a constant
>> five seconds.
>
>I haven't tested it, but from fs/nfs/nfs4proc.c:nfs4_delay() it appears
>to start at a tenth of a second and then do exponential backoff (up to
>15 seconds). Looks to me like the code's been that way since at least
>2.6.19.

I was referring to NFSv3, actually - also impacted by this codepath.

But I'll take the opportunity to point out that we'll get 5 retries from
an NFSv4 client before 2 seconds go by, and only one from NFSv3
in twice that. In either case, it's a heck of a bad trade to return "I'm
busy" only to have your bell rung repeatedly in response.

Sorry, I have always hated EJUKEBOX.

Tom.

>
>--b.
>
>> Either way, the server is generally better off gritting its
>> teeth and completing the operation.
>>
>> Blocking server threads is drastic, but in effect it will stall the client
>> queues and "push back". The issue on Linux is the small number of
>> nfsd contexts involved. It could lead to significant issues possibly
>> including DOS attack. Dropping connections (judiciously) could be
>> used instead of blocking the last few threads, though even that will
>> have consequences.
>>
>> The easy way to test all this is decorate /etc/exports with lots of
>> names, then break the nameservice and start sending requests from
>> many new clients. It's very hard to get it all right.
>>
>> Tom.
>>