2021-09-29 00:57:48

by Dai Ngo

[permalink] [raw]
Subject: [PATCH RFC v5 0/2] nfsd: Initial implementation of NFSv4 Courteous Server


Hi Bruce,

This series of patches implement the NFSv4 Courteous Server.

A server which does not immediately expunge the state on lease expiration
is known as a Courteous Server. A Courteous Server continues to recognize
previously generated state tokens as valid until conflict arises between
the expired state and the requests from another client, or the server
reboots.

The v2 patch includes the following:

. add new callback, lm_expire_lock, to lock_manager_operations to
allow the lock manager to take appropriate action with conflict lock.

. handle conflicts of NFSv4 locks with NFSv3/NLM and local locks.

. expire courtesy client after 24hr if client has not reconnected.

. do not allow expired client to become courtesy client if there are
waiters for client's locks.

. modify client_info_show to show courtesy client and seconds from
last renew.

. fix a problem with NFSv4.1 server where the it keeps returning
SEQ4_STATUS_CB_PATH_DOWN in the successful SEQUENCE reply, after
the courtesy client re-connects, causing the client to keep sending
BCTS requests to server.

The v3 patch includes the following:

. modified posix_test_lock to check and resolve conflict locks
to handle NLM TEST and NFSv4 LOCKT requests.

. separate out fix for back channel stuck in SEQ4_STATUS_CB_PATH_DOWN.

The v4 patch includes:

. rework nfsd_check_courtesy to avoid dead lock of fl_lock and client_lock
by asking the laudromat thread to destroy the courtesy client.

. handle NFSv4 share reservation conflicts with courtesy client. This
includes conflicts between access mode and deny mode and vice versa.

. drop the patch for back channel stuck in SEQ4_STATUS_CB_PATH_DOWN.

The v5 patch includes:

. fix recursive locking of file_rwsem from posix_lock_file.

. retest with LOCKDEP enabled.

NOTE: I will submit pynfs tests for courteous server including tests
for share reservation conflicts in a separate patch.



2021-09-29 00:59:56

by Dai Ngo

[permalink] [raw]
Subject: [PATCH RFC v5 2/2] nfsd: Initial implementation of NFSv4 Courteous Server

Currently an NFSv4 client must maintain its lease by using the at least
one of the state tokens or if nothing else, by issuing a RENEW (4.0), or
a singleton SEQUENCE (4.1) at least once during each lease period. If the
client fails to renew the lease, for any reason, the Linux server expunges
the state tokens immediately upon detection of the "failure to renew the
lease" condition and begins returning NFS4ERR_EXPIRED if the client should
reconnect and attempt to use the (now) expired state.

The default lease period for the Linux server is 90 seconds. The typical
client cuts that in half and will issue a lease renewing operation every
45 seconds. The 90 second lease period is very short considering the
potential for moderately long term network partitions. A network partition
refers to any loss of network connectivity between the NFS client and the
NFS server, regardless of its root cause. This includes NIC failures, NIC
driver bugs, network misconfigurations & administrative errors, routers &
switches crashing and/or having software updates applied, even down to
cables being physically pulled. In most cases, these network failures are
transient, although the duration is unknown.

A server which does not immediately expunge the state on lease expiration
is known as a Courteous Server. A Courteous Server continues to recognize
previously generated state tokens as valid until conflict arises between
the expired state and the requests from another client, or the server
reboots.

The initial implementation of the Courteous Server will do the following:

. when the laundromat thread detects an expired client and if that client
still has established states on the Linux server and there is no waiters
for the client's locks then mark the client as a COURTESY_CLIENT and skip
destroying the client and all its states, otherwise destroy the client as
usual.

. detects conflict of OPEN request with COURTESY_CLIENT, destroys the
expired client and all its states, skips the delegation recall then allows
the conflicting request to succeed.

. detects conflict of LOCK/LOCKT, NLM LOCK and TEST, and local locks
requests with COURTESY_CLIENT, destroys the expired client and all its
states then allows the conflicting request to succeed.

Signed-off-by: Dai Ngo <[email protected]>
---
fs/nfsd/nfs4state.c | 269 +++++++++++++++++++++++++++++++++++++++++++++++++++-
fs/nfsd/state.h | 3 +
2 files changed, 269 insertions(+), 3 deletions(-)

diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index 3f4027a5de88..f6d835d6e99b 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -125,6 +125,11 @@ static void free_session(struct nfsd4_session *);
static const struct nfsd4_callback_ops nfsd4_cb_recall_ops;
static const struct nfsd4_callback_ops nfsd4_cb_notify_lock_ops;

+static struct workqueue_struct *laundry_wq;
+static void laundromat_main(struct work_struct *);
+
+static int courtesy_client_expiry = (24 * 60 * 60); /* in secs */
+
static bool is_session_dead(struct nfsd4_session *ses)
{
return ses->se_flags & NFS4_SESSION_DEAD;
@@ -172,6 +177,7 @@ renew_client_locked(struct nfs4_client *clp)

list_move_tail(&clp->cl_lru, &nn->client_lru);
clp->cl_time = ktime_get_boottime_seconds();
+ clear_bit(NFSD4_COURTESY_CLIENT, &clp->cl_flags);
}

static void put_client_renew_locked(struct nfs4_client *clp)
@@ -2389,6 +2395,10 @@ static int client_info_show(struct seq_file *m, void *v)
seq_puts(m, "status: confirmed\n");
else
seq_puts(m, "status: unconfirmed\n");
+ seq_printf(m, "courtesy client: %s\n",
+ test_bit(NFSD4_COURTESY_CLIENT, &clp->cl_flags) ? "yes" : "no");
+ seq_printf(m, "seconds from last renew: %lld\n",
+ ktime_get_boottime_seconds() - clp->cl_time);
seq_printf(m, "name: ");
seq_quote_mem(m, clp->cl_name.data, clp->cl_name.len);
seq_printf(m, "\nminor version: %d\n", clp->cl_minorversion);
@@ -4662,6 +4672,33 @@ static void nfsd_break_one_deleg(struct nfs4_delegation *dp)
nfsd4_run_cb(&dp->dl_recall);
}

+/*
+ * This function is called when a file is opened and there is a
+ * delegation conflict with another client. If the other client
+ * is a courtesy client then kick start the laundromat to destroy
+ * it.
+ */
+static bool
+nfsd_check_courtesy_client(struct nfs4_delegation *dp)
+{
+ struct svc_rqst *rqst;
+ struct nfs4_client *clp = dp->dl_recall.cb_clp;
+ struct nfsd_net *nn = net_generic(clp->net, nfsd_net_id);
+
+ if (!i_am_nfsd())
+ goto out;
+ rqst = kthread_data(current);
+ if (rqst->rq_prog != NFS_PROGRAM || rqst->rq_vers < 4)
+ return false;
+out:
+ if (test_bit(NFSD4_COURTESY_CLIENT, &clp->cl_flags)) {
+ set_bit(NFSD4_DESTROY_COURTESY_CLIENT, &clp->cl_flags);
+ mod_delayed_work(laundry_wq, &nn->laundromat_work, 0);
+ return true;
+ }
+ return false;
+}
+
/* Called from break_lease() with i_lock held. */
static bool
nfsd_break_deleg_cb(struct file_lock *fl)
@@ -4670,6 +4707,8 @@ nfsd_break_deleg_cb(struct file_lock *fl)
struct nfs4_delegation *dp = (struct nfs4_delegation *)fl->fl_owner;
struct nfs4_file *fp = dp->dl_stid.sc_file;

+ if (nfsd_check_courtesy_client(dp))
+ return false;
trace_nfsd_cb_recall(&dp->dl_stid);

/*
@@ -4912,6 +4951,104 @@ nfsd4_truncate(struct svc_rqst *rqstp, struct svc_fh *fh,
return nfsd_setattr(rqstp, fh, &iattr, 0, (time64_t)0);
}

+static bool
+__nfs4_check_deny_bmap(struct nfs4_file *fp, struct nfs4_ol_stateid *stp,
+ u32 access, bool share_access)
+{
+ if (share_access) {
+ if (!stp->st_deny_bmap)
+ return false;
+
+ if ((stp->st_deny_bmap & (1 << NFS4_SHARE_DENY_BOTH)) ||
+ (access & NFS4_SHARE_ACCESS_READ &&
+ stp->st_deny_bmap & (1 << NFS4_SHARE_DENY_READ)) ||
+ (access & NFS4_SHARE_ACCESS_WRITE &&
+ stp->st_deny_bmap & (1 << NFS4_SHARE_DENY_WRITE))) {
+ return true;
+ }
+ return false;
+ }
+ if ((access & NFS4_SHARE_DENY_BOTH) ||
+ (access & NFS4_SHARE_DENY_READ &&
+ stp->st_access_bmap & (1 << NFS4_SHARE_ACCESS_READ)) ||
+ (access & NFS4_SHARE_DENY_WRITE &&
+ stp->st_access_bmap & (1 << NFS4_SHARE_ACCESS_WRITE))) {
+ return true;
+ }
+ return false;
+}
+
+/*
+ * access: access mode if share_access is true else is deny mode
+ */
+static bool
+nfs4_check_deny_bmap(struct nfs4_client *clp, struct nfs4_file *fp,
+ u32 access, bool share_access)
+{
+ int i;
+ struct nfs4_openowner *oo;
+ struct nfs4_stateowner *so, *tmp;
+ struct nfs4_ol_stateid *stp, *stmp;
+
+ spin_lock(&clp->cl_lock);
+ for (i = 0; i < OWNER_HASH_SIZE; i++) {
+ list_for_each_entry_safe(so, tmp, &clp->cl_ownerstr_hashtbl[i],
+ so_strhash) {
+ if (!so->so_is_open_owner)
+ continue;
+ oo = openowner(so);
+ list_for_each_entry_safe(stp, stmp,
+ &oo->oo_owner.so_stateids, st_perstateowner) {
+ if (__nfs4_check_deny_bmap(fp, stp, access,
+ share_access)) {
+ spin_unlock(&clp->cl_lock);
+ return true;
+ }
+ }
+ }
+ }
+ spin_unlock(&clp->cl_lock);
+ return false;
+}
+
+static int
+nfs4_destroy_clnts_with_sresv_conflict(struct svc_rqst *rqstp,
+ struct nfs4_client *clp, struct nfs4_file *fp,
+ u32 access, bool share_access)
+{
+ int cnt = 0;
+ struct nfs4_client *cl;
+ struct list_head *pos, *next, reaplist;
+ struct nfsd_net *nn = net_generic(SVC_NET(rqstp), nfsd_net_id);
+
+ INIT_LIST_HEAD(&reaplist);
+ spin_lock(&nn->client_lock);
+ list_for_each_safe(pos, next, &nn->client_lru) {
+ cl = list_entry(pos, struct nfs4_client, cl_lru);
+ if (cl == clp ||
+ (!test_bit(NFSD4_COURTESY_CLIENT, &cl->cl_flags) &&
+ !test_bit(NFSD4_DESTROY_COURTESY_CLIENT, &cl->cl_flags)))
+ continue;
+ /*
+ * check CS client for deny_bmap that
+ * matches nfs4_file and access mode
+ */
+ if (nfs4_check_deny_bmap(cl, fp, access, share_access)) {
+ if (mark_client_expired_locked(cl))
+ continue;
+ list_add(&cl->cl_lru, &reaplist);
+ }
+ }
+ spin_unlock(&nn->client_lock);
+ list_for_each_safe(pos, next, &reaplist) {
+ cl = list_entry(pos, struct nfs4_client, cl_lru);
+ list_del_init(&cl->cl_lru);
+ expire_client(cl);
+ cnt++;
+ }
+ return cnt;
+}
+
static __be32 nfs4_get_vfs_file(struct svc_rqst *rqstp, struct nfs4_file *fp,
struct svc_fh *cur_fh, struct nfs4_ol_stateid *stp,
struct nfsd4_open *open)
@@ -4921,6 +5058,7 @@ static __be32 nfs4_get_vfs_file(struct svc_rqst *rqstp, struct nfs4_file *fp,
int oflag = nfs4_access_to_omode(open->op_share_access);
int access = nfs4_access_to_access(open->op_share_access);
unsigned char old_access_bmap, old_deny_bmap;
+ int cnt = 0;

spin_lock(&fp->fi_lock);

@@ -4928,16 +5066,46 @@ static __be32 nfs4_get_vfs_file(struct svc_rqst *rqstp, struct nfs4_file *fp,
* Are we trying to set a deny mode that would conflict with
* current access?
*/
+chk_deny:
status = nfs4_file_check_deny(fp, open->op_share_deny);
if (status != nfs_ok) {
spin_unlock(&fp->fi_lock);
+ if (status != nfserr_share_denied)
+ goto out;
+ /*
+ * search and destroy all courtesy clients that own
+ * openstate of nfs4_file, fp, that has st_access_bmap
+ * that matches open->op_share_deny
+ */
+ cnt = nfs4_destroy_clnts_with_sresv_conflict(rqstp,
+ stp->st_stateowner->so_client, fp,
+ open->op_share_deny, false);
+ if (cnt) {
+ spin_lock(&fp->fi_lock);
+ goto chk_deny;
+ }
goto out;
}

/* set access to the file */
+get_access:
status = nfs4_file_get_access(fp, open->op_share_access);
if (status != nfs_ok) {
spin_unlock(&fp->fi_lock);
+ if (status != nfserr_share_denied)
+ goto out;
+ /*
+ * search and destroy all courtesy clients that own
+ * openstate of nfs4_file, fp, that has st_deny_bmap
+ * that matches open->op_share_access.
+ */
+ cnt = nfs4_destroy_clnts_with_sresv_conflict(rqstp,
+ stp->st_stateowner->so_client, fp,
+ open->op_share_access, true);
+ if (cnt) {
+ spin_lock(&fp->fi_lock);
+ goto get_access;
+ }
goto out;
}

@@ -5289,6 +5457,22 @@ static void nfsd4_deleg_xgrade_none_ext(struct nfsd4_open *open,
*/
}

+static bool
+nfs4_destroy_courtesy_client(struct nfs4_client *clp)
+{
+ struct nfsd_net *nn = net_generic(clp->net, nfsd_net_id);
+
+ spin_lock(&nn->client_lock);
+ if (!test_bit(NFSD4_COURTESY_CLIENT, &clp->cl_flags) ||
+ mark_client_expired_locked(clp)) {
+ spin_unlock(&nn->client_lock);
+ return false;
+ }
+ spin_unlock(&nn->client_lock);
+ expire_client(clp);
+ return true;
+}
+
__be32
nfsd4_process_open2(struct svc_rqst *rqstp, struct svc_fh *current_fh, struct nfsd4_open *open)
{
@@ -5572,6 +5756,47 @@ static void nfsd4_ssc_expire_umount(struct nfsd_net *nn)
}
#endif

+static
+bool nfs4_anylock_conflict(struct nfs4_client *clp)
+{
+ int i;
+ struct nfs4_stateowner *so, *tmp;
+ struct nfs4_lockowner *lo;
+ struct nfs4_ol_stateid *stp;
+ struct nfs4_file *nf;
+ struct inode *ino;
+ struct file_lock_context *ctx;
+ struct file_lock *fl;
+
+ for (i = 0; i < OWNER_HASH_SIZE; i++) {
+ /* scan each lock owner */
+ list_for_each_entry_safe(so, tmp, &clp->cl_ownerstr_hashtbl[i],
+ so_strhash) {
+ if (so->so_is_open_owner)
+ continue;
+
+ /* scan lock states of this lock owner */
+ lo = lockowner(so);
+ list_for_each_entry(stp, &lo->lo_owner.so_stateids,
+ st_perstateowner) {
+ nf = stp->st_stid.sc_file;
+ ino = nf->fi_inode;
+ ctx = ino->i_flctx;
+ if (!ctx)
+ continue;
+ /* check each lock belongs to this lock state */
+ list_for_each_entry(fl, &ctx->flc_posix, fl_list) {
+ if (fl->fl_owner != lo)
+ continue;
+ if (!list_empty(&fl->fl_blocked_requests))
+ return true;
+ }
+ }
+ }
+ }
+ return false;
+}
+
static time64_t
nfs4_laundromat(struct nfsd_net *nn)
{
@@ -5587,7 +5812,9 @@ nfs4_laundromat(struct nfsd_net *nn)
};
struct nfs4_cpntf_state *cps;
copy_stateid_t *cps_t;
+ struct nfs4_stid *stid;
int i;
+ int id = 0;

if (clients_still_reclaiming(nn)) {
lt.new_timeo = 0;
@@ -5608,8 +5835,33 @@ nfs4_laundromat(struct nfsd_net *nn)
spin_lock(&nn->client_lock);
list_for_each_safe(pos, next, &nn->client_lru) {
clp = list_entry(pos, struct nfs4_client, cl_lru);
+ if (test_bit(NFSD4_DESTROY_COURTESY_CLIENT, &clp->cl_flags)) {
+ clear_bit(NFSD4_COURTESY_CLIENT, &clp->cl_flags);
+ goto exp_client;
+ }
+ if (test_bit(NFSD4_COURTESY_CLIENT, &clp->cl_flags)) {
+ if (ktime_get_boottime_seconds() >= clp->courtesy_client_expiry)
+ goto exp_client;
+ /*
+ * after umount, v4.0 client is still
+ * around waiting to be expired
+ */
+ if (clp->cl_minorversion)
+ continue;
+ }
if (!state_expired(&lt, clp->cl_time))
break;
+ spin_lock(&clp->cl_lock);
+ stid = idr_get_next(&clp->cl_stateids, &id);
+ spin_unlock(&clp->cl_lock);
+ if (stid && !nfs4_anylock_conflict(clp)) {
+ /* client still has states */
+ clp->courtesy_client_expiry =
+ ktime_get_boottime_seconds() + courtesy_client_expiry;
+ set_bit(NFSD4_COURTESY_CLIENT, &clp->cl_flags);
+ continue;
+ }
+exp_client:
if (mark_client_expired_locked(clp))
continue;
list_add(&clp->cl_lru, &reaplist);
@@ -5689,9 +5941,6 @@ nfs4_laundromat(struct nfsd_net *nn)
return max_t(time64_t, lt.new_timeo, NFSD_LAUNDROMAT_MINTIMEOUT);
}

-static struct workqueue_struct *laundry_wq;
-static void laundromat_main(struct work_struct *);
-
static void
laundromat_main(struct work_struct *laundry)
{
@@ -6496,6 +6745,19 @@ nfs4_transform_lock_offset(struct file_lock *lock)
lock->fl_end = OFFSET_MAX;
}

+/* return true if lock was expired else return false */
+static bool
+nfsd4_fl_expire_lock(struct file_lock *fl, bool testonly)
+{
+ struct nfs4_lockowner *lo = (struct nfs4_lockowner *)fl->fl_owner;
+ struct nfs4_client *clp = lo->lo_owner.so_client;
+
+ if (testonly)
+ return test_bit(NFSD4_COURTESY_CLIENT, &clp->cl_flags) ?
+ true : false;
+ return nfs4_destroy_courtesy_client(clp);
+}
+
static fl_owner_t
nfsd4_fl_get_owner(fl_owner_t owner)
{
@@ -6543,6 +6805,7 @@ static const struct lock_manager_operations nfsd_posix_mng_ops = {
.lm_notify = nfsd4_lm_notify,
.lm_get_owner = nfsd4_fl_get_owner,
.lm_put_owner = nfsd4_fl_put_owner,
+ .lm_expire_lock = nfsd4_fl_expire_lock,
};

static inline void
diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h
index e73bdbb1634a..93e30b101578 100644
--- a/fs/nfsd/state.h
+++ b/fs/nfsd/state.h
@@ -345,6 +345,8 @@ struct nfs4_client {
#define NFSD4_CLIENT_UPCALL_LOCK (5) /* upcall serialization */
#define NFSD4_CLIENT_CB_FLAG_MASK (1 << NFSD4_CLIENT_CB_UPDATE | \
1 << NFSD4_CLIENT_CB_KILL)
+#define NFSD4_COURTESY_CLIENT (6) /* be nice to expired client */
+#define NFSD4_DESTROY_COURTESY_CLIENT (7)
unsigned long cl_flags;
const struct cred *cl_cb_cred;
struct rpc_clnt *cl_cb_client;
@@ -385,6 +387,7 @@ struct nfs4_client {
struct list_head async_copies; /* list of async copies */
spinlock_t async_lock; /* lock for async copies */
atomic_t cl_cb_inflight; /* Outstanding callbacks */
+ int courtesy_client_expiry;
};

/* struct nfs4_client_reset
--
2.9.5

2021-10-01 21:09:37

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH RFC v5 0/2] nfsd: Initial implementation of NFSv4 Courteous Server

On Tue, Sep 28, 2021 at 08:56:39PM -0400, Dai Ngo wrote:
>
> Hi Bruce,
>
> This series of patches implement the NFSv4 Courteous Server.

Apologies, I keep meaning to get back to this and haven't yet.

I do notice I'm seeing a timeout on pynfs 4.0 test OPEN18.

--b.

>
> A server which does not immediately expunge the state on lease expiration
> is known as a Courteous Server. A Courteous Server continues to recognize
> previously generated state tokens as valid until conflict arises between
> the expired state and the requests from another client, or the server
> reboots.
>
> The v2 patch includes the following:
>
> . add new callback, lm_expire_lock, to lock_manager_operations to
> allow the lock manager to take appropriate action with conflict lock.
>
> . handle conflicts of NFSv4 locks with NFSv3/NLM and local locks.
>
> . expire courtesy client after 24hr if client has not reconnected.
>
> . do not allow expired client to become courtesy client if there are
> waiters for client's locks.
>
> . modify client_info_show to show courtesy client and seconds from
> last renew.
>
> . fix a problem with NFSv4.1 server where the it keeps returning
> SEQ4_STATUS_CB_PATH_DOWN in the successful SEQUENCE reply, after
> the courtesy client re-connects, causing the client to keep sending
> BCTS requests to server.
>
> The v3 patch includes the following:
>
> . modified posix_test_lock to check and resolve conflict locks
> to handle NLM TEST and NFSv4 LOCKT requests.
>
> . separate out fix for back channel stuck in SEQ4_STATUS_CB_PATH_DOWN.
>
> The v4 patch includes:
>
> . rework nfsd_check_courtesy to avoid dead lock of fl_lock and client_lock
> by asking the laudromat thread to destroy the courtesy client.
>
> . handle NFSv4 share reservation conflicts with courtesy client. This
> includes conflicts between access mode and deny mode and vice versa.
>
> . drop the patch for back channel stuck in SEQ4_STATUS_CB_PATH_DOWN.
>
> The v5 patch includes:
>
> . fix recursive locking of file_rwsem from posix_lock_file.
>
> . retest with LOCKDEP enabled.
>
> NOTE: I will submit pynfs tests for courteous server including tests
> for share reservation conflicts in a separate patch.
>

2021-10-01 21:52:09

by Dai Ngo

[permalink] [raw]
Subject: Re: [PATCH RFC v5 0/2] nfsd: Initial implementation of NFSv4 Courteous Server


On 10/1/21 1:53 PM, J. Bruce Fields wrote:
> On Tue, Sep 28, 2021 at 08:56:39PM -0400, Dai Ngo wrote:
>> Hi Bruce,
>>
>> This series of patches implement the NFSv4 Courteous Server.
> Apologies, I keep meaning to get back to this and haven't yet.
>
> I do notice I'm seeing a timeout on pynfs 4.0 test OPEN18.

It's weird, this test passes on my system:


[root@nfsvmf25 nfs4.0]# ./testserver.py $server --rundeps -v OPEN18
INIT st_setclientid.testValid : RUNNING
INIT st_setclientid.testValid : PASS
MKFILE st_open.testOpen : RUNNING
MKFILE st_open.testOpen : PASS
OPEN18 st_open.testShareConflict1 : RUNNING
OPEN18 st_open.testShareConflict1 : PASS
**************************************************
INIT st_setclientid.testValid : PASS
OPEN18 st_open.testShareConflict1 : PASS
MKFILE st_open.testOpen : PASS
**************************************************
Command line asked for 3 of 673 tests
Of those: 0 Skipped, 0 Failed, 0 Warned, 3 Passed
[root@nfsvmf25 nfs4.0]#

Do you have a network trace?

-Dai

>
> --b.
>
>> A server which does not immediately expunge the state on lease expiration
>> is known as a Courteous Server. A Courteous Server continues to recognize
>> previously generated state tokens as valid until conflict arises between
>> the expired state and the requests from another client, or the server
>> reboots.
>>
>> The v2 patch includes the following:
>>
>> . add new callback, lm_expire_lock, to lock_manager_operations to
>> allow the lock manager to take appropriate action with conflict lock.
>>
>> . handle conflicts of NFSv4 locks with NFSv3/NLM and local locks.
>>
>> . expire courtesy client after 24hr if client has not reconnected.
>>
>> . do not allow expired client to become courtesy client if there are
>> waiters for client's locks.
>>
>> . modify client_info_show to show courtesy client and seconds from
>> last renew.
>>
>> . fix a problem with NFSv4.1 server where the it keeps returning
>> SEQ4_STATUS_CB_PATH_DOWN in the successful SEQUENCE reply, after
>> the courtesy client re-connects, causing the client to keep sending
>> BCTS requests to server.
>>
>> The v3 patch includes the following:
>>
>> . modified posix_test_lock to check and resolve conflict locks
>> to handle NLM TEST and NFSv4 LOCKT requests.
>>
>> . separate out fix for back channel stuck in SEQ4_STATUS_CB_PATH_DOWN.
>>
>> The v4 patch includes:
>>
>> . rework nfsd_check_courtesy to avoid dead lock of fl_lock and client_lock
>> by asking the laudromat thread to destroy the courtesy client.
>>
>> . handle NFSv4 share reservation conflicts with courtesy client. This
>> includes conflicts between access mode and deny mode and vice versa.
>>
>> . drop the patch for back channel stuck in SEQ4_STATUS_CB_PATH_DOWN.
>>
>> The v5 patch includes:
>>
>> . fix recursive locking of file_rwsem from posix_lock_file.
>>
>> . retest with LOCKDEP enabled.
>>
>> NOTE: I will submit pynfs tests for courteous server including tests
>> for share reservation conflicts in a separate patch.
>>

2021-10-01 23:04:52

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH RFC v5 0/2] nfsd: Initial implementation of NFSv4 Courteous Server

On Fri, Oct 01, 2021 at 02:41:55PM -0700, [email protected] wrote:
>
> On 10/1/21 1:53 PM, J. Bruce Fields wrote:
> >On Tue, Sep 28, 2021 at 08:56:39PM -0400, Dai Ngo wrote:
> >>Hi Bruce,
> >>
> >>This series of patches implement the NFSv4 Courteous Server.
> >Apologies, I keep meaning to get back to this and haven't yet.
> >
> >I do notice I'm seeing a timeout on pynfs 4.0 test OPEN18.
>
> It's weird, this test passes on my system:
>
>
> [root@nfsvmf25 nfs4.0]# ./testserver.py $server --rundeps -v OPEN18
> INIT st_setclientid.testValid : RUNNING
> INIT st_setclientid.testValid : PASS
> MKFILE st_open.testOpen : RUNNING
> MKFILE st_open.testOpen : PASS
> OPEN18 st_open.testShareConflict1 : RUNNING
> OPEN18 st_open.testShareConflict1 : PASS
> **************************************************
> INIT st_setclientid.testValid : PASS
> OPEN18 st_open.testShareConflict1 : PASS
> MKFILE st_open.testOpen : PASS
> **************************************************
> Command line asked for 3 of 673 tests
> Of those: 0 Skipped, 0 Failed, 0 Warned, 3 Passed
> [root@nfsvmf25 nfs4.0]#
>
> Do you have a network trace?

Yeah, weirdly, I think it's failing only when I run it with all the
other pynfs tests, not when I run it alone. I'll check again and see if
I can get a trace, probably next week.

--b.

2021-11-16 23:06:43

by Dai Ngo

[permalink] [raw]
Subject: Re: [PATCH RFC v5 0/2] nfsd: Initial implementation of NFSv4 Courteous Server

Hi Bruce,

Just a reminder that this patch is still waiting for your review.

Thanks,
-Dai

On 10/1/21 2:41 PM, [email protected] wrote:
>
> On 10/1/21 1:53 PM, J. Bruce Fields wrote:
>> On Tue, Sep 28, 2021 at 08:56:39PM -0400, Dai Ngo wrote:
>>> Hi Bruce,
>>>
>>> This series of patches implement the NFSv4 Courteous Server.
>> Apologies, I keep meaning to get back to this and haven't yet.
>>
>> I do notice I'm seeing a timeout on pynfs 4.0 test OPEN18.
>
> It's weird, this test passes on my system:
>
>
> [root@nfsvmf25 nfs4.0]# ./testserver.py $server --rundeps -v OPEN18
> INIT     st_setclientid.testValid : RUNNING
> INIT     st_setclientid.testValid : PASS
> MKFILE   st_open.testOpen : RUNNING
> MKFILE   st_open.testOpen : PASS
> OPEN18   st_open.testShareConflict1 : RUNNING
> OPEN18   st_open.testShareConflict1 : PASS
> **************************************************
> INIT     st_setclientid.testValid : PASS
> OPEN18   st_open.testShareConflict1 : PASS
> MKFILE   st_open.testOpen : PASS
> **************************************************
> Command line asked for 3 of 673 tests
> Of those: 0 Skipped, 0 Failed, 0 Warned, 3 Passed
> [root@nfsvmf25 nfs4.0]#
>
> Do you have a network trace?
>
> -Dai
>
>>
>> --b.
>>
>>> A server which does not immediately expunge the state on lease
>>> expiration
>>> is known as a Courteous Server.  A Courteous Server continues to
>>> recognize
>>> previously generated state tokens as valid until conflict arises
>>> between
>>> the expired state and the requests from another client, or the server
>>> reboots.
>>>
>>> The v2 patch includes the following:
>>>
>>> . add new callback, lm_expire_lock, to lock_manager_operations to
>>>    allow the lock manager to take appropriate action with conflict
>>> lock.
>>>
>>> . handle conflicts of NFSv4 locks with NFSv3/NLM and local locks.
>>>
>>> . expire courtesy client after 24hr if client has not reconnected.
>>>
>>> . do not allow expired client to become courtesy client if there are
>>>    waiters for client's locks.
>>>
>>> . modify client_info_show to show courtesy client and seconds from
>>>    last renew.
>>>
>>> . fix a problem with NFSv4.1 server where the it keeps returning
>>>    SEQ4_STATUS_CB_PATH_DOWN in the successful SEQUENCE reply, after
>>>    the courtesy client re-connects, causing the client to keep sending
>>>    BCTS requests to server.
>>>
>>> The v3 patch includes the following:
>>>
>>> . modified posix_test_lock to check and resolve conflict locks
>>>    to handle NLM TEST and NFSv4 LOCKT requests.
>>>
>>> . separate out fix for back channel stuck in SEQ4_STATUS_CB_PATH_DOWN.
>>>
>>> The v4 patch includes:
>>>
>>> . rework nfsd_check_courtesy to avoid dead lock of fl_lock and
>>> client_lock
>>>    by asking the laudromat thread to destroy the courtesy client.
>>>
>>> . handle NFSv4 share reservation conflicts with courtesy client. This
>>>    includes conflicts between access mode and deny mode and vice versa.
>>>
>>> . drop the patch for back channel stuck in SEQ4_STATUS_CB_PATH_DOWN.
>>>
>>> The v5 patch includes:
>>>
>>> . fix recursive locking of file_rwsem from posix_lock_file.
>>>
>>> . retest with LOCKDEP enabled.
>>>
>>> NOTE: I will submit pynfs tests for courteous server including tests
>>> for share reservation conflicts in a separate patch.
>>>

2021-11-17 14:14:37

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH RFC v5 0/2] nfsd: Initial implementation of NFSv4 Courteous Server

On Tue, Nov 16, 2021 at 03:06:32PM -0800, [email protected] wrote:
> Just a reminder that this patch is still waiting for your review.

Yeah, I was procrastinating and hoping yo'ud figure out the pynfs
failure for me.... I'll see if I can get some time today.--b.

>
> Thanks,
> -Dai
>
> On 10/1/21 2:41 PM, [email protected] wrote:
> >
> >On 10/1/21 1:53 PM, J. Bruce Fields wrote:
> >>On Tue, Sep 28, 2021 at 08:56:39PM -0400, Dai Ngo wrote:
> >>>Hi Bruce,
> >>>
> >>>This series of patches implement the NFSv4 Courteous Server.
> >>Apologies, I keep meaning to get back to this and haven't yet.
> >>
> >>I do notice I'm seeing a timeout on pynfs 4.0 test OPEN18.
> >
> >It's weird, this test passes on my system:
> >
> >
> >[root@nfsvmf25 nfs4.0]# ./testserver.py $server --rundeps -v OPEN18
> >INIT     st_setclientid.testValid : RUNNING
> >INIT     st_setclientid.testValid : PASS
> >MKFILE   st_open.testOpen : RUNNING
> >MKFILE   st_open.testOpen : PASS
> >OPEN18   st_open.testShareConflict1 : RUNNING
> >OPEN18   st_open.testShareConflict1 : PASS
> >**************************************************
> >INIT     st_setclientid.testValid : PASS
> >OPEN18   st_open.testShareConflict1 : PASS
> >MKFILE   st_open.testOpen : PASS
> >**************************************************
> >Command line asked for 3 of 673 tests
> >Of those: 0 Skipped, 0 Failed, 0 Warned, 3 Passed
> >[root@nfsvmf25 nfs4.0]#
> >
> >Do you have a network trace?
> >
> >-Dai
> >
> >>
> >>--b.
> >>
> >>>A server which does not immediately expunge the state on lease
> >>>expiration
> >>>is known as a Courteous Server.  A Courteous Server continues
> >>>to recognize
> >>>previously generated state tokens as valid until conflict
> >>>arises between
> >>>the expired state and the requests from another client, or the server
> >>>reboots.
> >>>
> >>>The v2 patch includes the following:
> >>>
> >>>. add new callback, lm_expire_lock, to lock_manager_operations to
> >>>   allow the lock manager to take appropriate action with
> >>>conflict lock.
> >>>
> >>>. handle conflicts of NFSv4 locks with NFSv3/NLM and local locks.
> >>>
> >>>. expire courtesy client after 24hr if client has not reconnected.
> >>>
> >>>. do not allow expired client to become courtesy client if there are
> >>>   waiters for client's locks.
> >>>
> >>>. modify client_info_show to show courtesy client and seconds from
> >>>   last renew.
> >>>
> >>>. fix a problem with NFSv4.1 server where the it keeps returning
> >>>   SEQ4_STATUS_CB_PATH_DOWN in the successful SEQUENCE reply, after
> >>>   the courtesy client re-connects, causing the client to keep sending
> >>>   BCTS requests to server.
> >>>
> >>>The v3 patch includes the following:
> >>>
> >>>. modified posix_test_lock to check and resolve conflict locks
> >>>   to handle NLM TEST and NFSv4 LOCKT requests.
> >>>
> >>>. separate out fix for back channel stuck in SEQ4_STATUS_CB_PATH_DOWN.
> >>>
> >>>The v4 patch includes:
> >>>
> >>>. rework nfsd_check_courtesy to avoid dead lock of fl_lock and
> >>>client_lock
> >>>   by asking the laudromat thread to destroy the courtesy client.
> >>>
> >>>. handle NFSv4 share reservation conflicts with courtesy client. This
> >>>   includes conflicts between access mode and deny mode and vice versa.
> >>>
> >>>. drop the patch for back channel stuck in SEQ4_STATUS_CB_PATH_DOWN.
> >>>
> >>>The v5 patch includes:
> >>>
> >>>. fix recursive locking of file_rwsem from posix_lock_file.
> >>>
> >>>. retest with LOCKDEP enabled.
> >>>
> >>>NOTE: I will submit pynfs tests for courteous server including tests
> >>>for share reservation conflicts in a separate patch.
> >>>

2021-11-17 18:00:04

by Dai Ngo

[permalink] [raw]
Subject: Re: [PATCH RFC v5 0/2] nfsd: Initial implementation of NFSv4 Courteous Server


On 11/17/21 6:14 AM, J. Bruce Fields wrote:
> On Tue, Nov 16, 2021 at 03:06:32PM -0800, [email protected] wrote:
>> Just a reminder that this patch is still waiting for your review.
> Yeah, I was procrastinating and hoping yo'ud figure out the pynfs
> failure for me....

Last time I ran 4.0 OPEN18 test by itself and it passed. I will run
all OPEN tests together with 5.15-rc7 to see if the problem you've
seen still there.

-Dai

> I'll see if I can get some time today.--b.
>
>> Thanks,
>> -Dai
>>
>> On 10/1/21 2:41 PM, [email protected] wrote:
>>> On 10/1/21 1:53 PM, J. Bruce Fields wrote:
>>>> On Tue, Sep 28, 2021 at 08:56:39PM -0400, Dai Ngo wrote:
>>>>> Hi Bruce,
>>>>>
>>>>> This series of patches implement the NFSv4 Courteous Server.
>>>> Apologies, I keep meaning to get back to this and haven't yet.
>>>>
>>>> I do notice I'm seeing a timeout on pynfs 4.0 test OPEN18.
>>> It's weird, this test passes on my system:
>>>
>>>
>>> [root@nfsvmf25 nfs4.0]# ./testserver.py $server --rundeps -v OPEN18
>>> INIT     st_setclientid.testValid : RUNNING
>>> INIT     st_setclientid.testValid : PASS
>>> MKFILE   st_open.testOpen : RUNNING
>>> MKFILE   st_open.testOpen : PASS
>>> OPEN18   st_open.testShareConflict1 : RUNNING
>>> OPEN18   st_open.testShareConflict1 : PASS
>>> **************************************************
>>> INIT     st_setclientid.testValid : PASS
>>> OPEN18   st_open.testShareConflict1 : PASS
>>> MKFILE   st_open.testOpen : PASS
>>> **************************************************
>>> Command line asked for 3 of 673 tests
>>> Of those: 0 Skipped, 0 Failed, 0 Warned, 3 Passed
>>> [root@nfsvmf25 nfs4.0]#
>>>
>>> Do you have a network trace?
>>>
>>> -Dai
>>>
>>>> --b.
>>>>
>>>>> A server which does not immediately expunge the state on lease
>>>>> expiration
>>>>> is known as a Courteous Server.  A Courteous Server continues
>>>>> to recognize
>>>>> previously generated state tokens as valid until conflict
>>>>> arises between
>>>>> the expired state and the requests from another client, or the server
>>>>> reboots.
>>>>>
>>>>> The v2 patch includes the following:
>>>>>
>>>>> . add new callback, lm_expire_lock, to lock_manager_operations to
>>>>>    allow the lock manager to take appropriate action with
>>>>> conflict lock.
>>>>>
>>>>> . handle conflicts of NFSv4 locks with NFSv3/NLM and local locks.
>>>>>
>>>>> . expire courtesy client after 24hr if client has not reconnected.
>>>>>
>>>>> . do not allow expired client to become courtesy client if there are
>>>>>    waiters for client's locks.
>>>>>
>>>>> . modify client_info_show to show courtesy client and seconds from
>>>>>    last renew.
>>>>>
>>>>> . fix a problem with NFSv4.1 server where the it keeps returning
>>>>>    SEQ4_STATUS_CB_PATH_DOWN in the successful SEQUENCE reply, after
>>>>>    the courtesy client re-connects, causing the client to keep sending
>>>>>    BCTS requests to server.
>>>>>
>>>>> The v3 patch includes the following:
>>>>>
>>>>> . modified posix_test_lock to check and resolve conflict locks
>>>>>    to handle NLM TEST and NFSv4 LOCKT requests.
>>>>>
>>>>> . separate out fix for back channel stuck in SEQ4_STATUS_CB_PATH_DOWN.
>>>>>
>>>>> The v4 patch includes:
>>>>>
>>>>> . rework nfsd_check_courtesy to avoid dead lock of fl_lock and
>>>>> client_lock
>>>>>    by asking the laudromat thread to destroy the courtesy client.
>>>>>
>>>>> . handle NFSv4 share reservation conflicts with courtesy client. This
>>>>>    includes conflicts between access mode and deny mode and vice versa.
>>>>>
>>>>> . drop the patch for back channel stuck in SEQ4_STATUS_CB_PATH_DOWN.
>>>>>
>>>>> The v5 patch includes:
>>>>>
>>>>> . fix recursive locking of file_rwsem from posix_lock_file.
>>>>>
>>>>> . retest with LOCKDEP enabled.
>>>>>
>>>>> NOTE: I will submit pynfs tests for courteous server including tests
>>>>> for share reservation conflicts in a separate patch.
>>>>>

2021-11-17 21:46:11

by Dai Ngo

[permalink] [raw]
Subject: Re: [PATCH RFC v5 0/2] nfsd: Initial implementation of NFSv4 Courteous Server


On 11/17/21 9:59 AM, [email protected] wrote:
>
> On 11/17/21 6:14 AM, J. Bruce Fields wrote:
>> On Tue, Nov 16, 2021 at 03:06:32PM -0800, [email protected] wrote:
>>> Just a reminder that this patch is still waiting for your review.
>> Yeah, I was procrastinating and hoping yo'ud figure out the pynfs
>> failure for me....
>
> Last time I ran 4.0 OPEN18 test by itself and it passed. I will run
> all OPEN tests together with 5.15-rc7 to see if the problem you've
> seen still there.

I ran all tests in nfsv4.1 and nfsv4.0 with courteous and non-courteous
5.15-rc7 server.

Nfs4.1 results are the same for both courteous and non-courteous server:
> Of those: 0 Skipped, 0 Failed, 0 Warned, 169 Passed

Results of nfs4.0 with non-courteous server:
>Of those: 8 Skipped, 1 Failed, 0 Warned, 577 Passed
test failed: LOCK24

Results of nfs4.0 with courteous server:
>Of those: 8 Skipped, 3 Failed, 0 Warned, 575 Passed
tests failed: LOCK24, OPEN18, OPEN30

OPEN18 and OPEN30 test pass if each is run by itself.
I will look into this problem.

-Dai

>
> -Dai
>
>>   I'll see if I can get some time today.--b.
>>
>>> Thanks,
>>> -Dai
>>>
>>> On 10/1/21 2:41 PM, [email protected] wrote:
>>>> On 10/1/21 1:53 PM, J. Bruce Fields wrote:
>>>>> On Tue, Sep 28, 2021 at 08:56:39PM -0400, Dai Ngo wrote:
>>>>>> Hi Bruce,
>>>>>>
>>>>>> This series of patches implement the NFSv4 Courteous Server.
>>>>> Apologies, I keep meaning to get back to this and haven't yet.
>>>>>
>>>>> I do notice I'm seeing a timeout on pynfs 4.0 test OPEN18.
>>>> It's weird, this test passes on my system:
>>>>
>>>>
>>>> [root@nfsvmf25 nfs4.0]# ./testserver.py $server --rundeps -v OPEN18
>>>> INIT     st_setclientid.testValid : RUNNING
>>>> INIT     st_setclientid.testValid : PASS
>>>> MKFILE   st_open.testOpen : RUNNING
>>>> MKFILE   st_open.testOpen : PASS
>>>> OPEN18   st_open.testShareConflict1 : RUNNING
>>>> OPEN18   st_open.testShareConflict1 : PASS
>>>> **************************************************
>>>> INIT     st_setclientid.testValid : PASS
>>>> OPEN18   st_open.testShareConflict1 : PASS
>>>> MKFILE   st_open.testOpen : PASS
>>>> **************************************************
>>>> Command line asked for 3 of 673 tests
>>>> Of those: 0 Skipped, 0 Failed, 0 Warned, 3 Passed
>>>> [root@nfsvmf25 nfs4.0]#
>>>>
>>>> Do you have a network trace?
>>>>
>>>> -Dai
>>>>
>>>>> --b.
>>>>>
>>>>>> A server which does not immediately expunge the state on lease
>>>>>> expiration
>>>>>> is known as a Courteous Server.  A Courteous Server continues
>>>>>> to recognize
>>>>>> previously generated state tokens as valid until conflict
>>>>>> arises between
>>>>>> the expired state and the requests from another client, or the
>>>>>> server
>>>>>> reboots.
>>>>>>
>>>>>> The v2 patch includes the following:
>>>>>>
>>>>>> . add new callback, lm_expire_lock, to lock_manager_operations to
>>>>>>     allow the lock manager to take appropriate action with
>>>>>> conflict lock.
>>>>>>
>>>>>> . handle conflicts of NFSv4 locks with NFSv3/NLM and local locks.
>>>>>>
>>>>>> . expire courtesy client after 24hr if client has not reconnected.
>>>>>>
>>>>>> . do not allow expired client to become courtesy client if there are
>>>>>>     waiters for client's locks.
>>>>>>
>>>>>> . modify client_info_show to show courtesy client and seconds from
>>>>>>     last renew.
>>>>>>
>>>>>> . fix a problem with NFSv4.1 server where the it keeps returning
>>>>>>     SEQ4_STATUS_CB_PATH_DOWN in the successful SEQUENCE reply, after
>>>>>>     the courtesy client re-connects, causing the client to keep
>>>>>> sending
>>>>>>     BCTS requests to server.
>>>>>>
>>>>>> The v3 patch includes the following:
>>>>>>
>>>>>> . modified posix_test_lock to check and resolve conflict locks
>>>>>>     to handle NLM TEST and NFSv4 LOCKT requests.
>>>>>>
>>>>>> . separate out fix for back channel stuck in
>>>>>> SEQ4_STATUS_CB_PATH_DOWN.
>>>>>>
>>>>>> The v4 patch includes:
>>>>>>
>>>>>> . rework nfsd_check_courtesy to avoid dead lock of fl_lock and
>>>>>> client_lock
>>>>>>     by asking the laudromat thread to destroy the courtesy client.
>>>>>>
>>>>>> . handle NFSv4 share reservation conflicts with courtesy client.
>>>>>> This
>>>>>>     includes conflicts between access mode and deny mode and vice
>>>>>> versa.
>>>>>>
>>>>>> . drop the patch for back channel stuck in SEQ4_STATUS_CB_PATH_DOWN.
>>>>>>
>>>>>> The v5 patch includes:
>>>>>>
>>>>>> . fix recursive locking of file_rwsem from posix_lock_file.
>>>>>>
>>>>>> . retest with LOCKDEP enabled.
>>>>>>
>>>>>> NOTE: I will submit pynfs tests for courteous server including tests
>>>>>> for share reservation conflicts in a separate patch.
>>>>>>

2021-11-18 00:34:59

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH RFC v5 0/2] nfsd: Initial implementation of NFSv4 Courteous Server

On Wed, Nov 17, 2021 at 01:46:02PM -0800, [email protected] wrote:
>
> On 11/17/21 9:59 AM, [email protected] wrote:
> >
> >On 11/17/21 6:14 AM, J. Bruce Fields wrote:
> >>On Tue, Nov 16, 2021 at 03:06:32PM -0800, [email protected] wrote:
> >>>Just a reminder that this patch is still waiting for your review.
> >>Yeah, I was procrastinating and hoping yo'ud figure out the pynfs
> >>failure for me....
> >
> >Last time I ran 4.0 OPEN18 test by itself and it passed. I will run
> >all OPEN tests together with 5.15-rc7 to see if the problem you've
> >seen still there.
>
> I ran all tests in nfsv4.1 and nfsv4.0 with courteous and non-courteous
> 5.15-rc7 server.
>
> Nfs4.1 results are the same for both courteous and non-courteous server:
> >Of those: 0 Skipped, 0 Failed, 0 Warned, 169 Passed
>
> Results of nfs4.0 with non-courteous server:
> >Of those: 8 Skipped, 1 Failed, 0 Warned, 577 Passed
> test failed: LOCK24
>
> Results of nfs4.0 with courteous server:
> >Of those: 8 Skipped, 3 Failed, 0 Warned, 575 Passed
> tests failed: LOCK24, OPEN18, OPEN30
>
> OPEN18 and OPEN30 test pass if each is run by itself.

Could well be a bug in the tests, I don't know.

> I will look into this problem.

Thanks!

--b.

2021-11-22 03:04:44

by Dai Ngo

[permalink] [raw]
Subject: Re: [PATCH RFC v5 0/2] nfsd: Initial implementation of NFSv4 Courteous Server


On 11/17/21 4:34 PM, J. Bruce Fields wrote:
> On Wed, Nov 17, 2021 at 01:46:02PM -0800, [email protected] wrote:
>> On 11/17/21 9:59 AM, [email protected] wrote:
>>> On 11/17/21 6:14 AM, J. Bruce Fields wrote:
>>>> On Tue, Nov 16, 2021 at 03:06:32PM -0800, [email protected] wrote:
>>>>> Just a reminder that this patch is still waiting for your review.
>>>> Yeah, I was procrastinating and hoping yo'ud figure out the pynfs
>>>> failure for me....
>>> Last time I ran 4.0 OPEN18 test by itself and it passed. I will run
>>> all OPEN tests together with 5.15-rc7 to see if the problem you've
>>> seen still there.
>> I ran all tests in nfsv4.1 and nfsv4.0 with courteous and non-courteous
>> 5.15-rc7 server.
>>
>> Nfs4.1 results are the same for both courteous and non-courteous server:
>>> Of those: 0 Skipped, 0 Failed, 0 Warned, 169 Passed
>> Results of nfs4.0 with non-courteous server:
>>> Of those: 8 Skipped, 1 Failed, 0 Warned, 577 Passed
>> test failed: LOCK24
>>
>> Results of nfs4.0 with courteous server:
>>> Of those: 8 Skipped, 3 Failed, 0 Warned, 575 Passed
>> tests failed: LOCK24, OPEN18, OPEN30
>>
>> OPEN18 and OPEN30 test pass if each is run by itself.
> Could well be a bug in the tests, I don't know.

The reason OPEN18 failed was because the test timed out waiting for
the reply of an OPEN call. The RPC connection used for the test was
configured with 15 secs timeout. Note that OPEN18 only fails when
the tests were run with 'all' option, this test passes if it's run
by itself.

With courteous server, by the time OPEN18 runs, there are about 1026
courtesy 4.0 clients on the server and all of these clients have opened
the same file X with WRITE access. These clients were created by the
previous tests. After each test completed, since 4.0 does not have
session, the client states are not cleaned up immediately on the
server and are allowed to become courtesy clients.

When OPEN18 runs (about 20 minutes after the 1st test started), it
sends OPEN of file X with OPEN4_SHARE_DENY_WRITE which causes the
server to check for conflicts with courtesy clients. The loop that
checks 1026 courtesy clients for share/access conflict took less
than 1 sec. But it took about 55 secs, on my VM, for the server
to expire all 1026 courtesy clients.

I modified pynfs to configure the 4.0 RPC connection with 60 seconds
timeout and OPEN18 now consistently passed. The 4.0 test results are
now the same for courteous and non-courteous server:

8 Skipped, 1 Failed, 0 Warned, 577 Passed

Note that 4.1 tests do not suffer this timeout problem because the
4.1 clients and sessions are destroyed after each test completes.


-Dai

>> I will look into this problem.
> Thanks!
>
> --b.

2021-11-29 17:15:28

by Dai Ngo

[permalink] [raw]
Subject: Re: [PATCH RFC v5 0/2] nfsd: Initial implementation of NFSv4 Courteous Server

Hi Bruce,

On 11/21/21 7:04 PM, [email protected] wrote:
>
> On 11/17/21 4:34 PM, J. Bruce Fields wrote:
>> On Wed, Nov 17, 2021 at 01:46:02PM -0800, [email protected] wrote:
>>> On 11/17/21 9:59 AM, [email protected] wrote:
>>>> On 11/17/21 6:14 AM, J. Bruce Fields wrote:
>>>>> On Tue, Nov 16, 2021 at 03:06:32PM -0800, [email protected] wrote:
>>>>>> Just a reminder that this patch is still waiting for your review.
>>>>> Yeah, I was procrastinating and hoping yo'ud figure out the pynfs
>>>>> failure for me....
>>>> Last time I ran 4.0 OPEN18 test by itself and it passed. I will run
>>>> all OPEN tests together with 5.15-rc7 to see if the problem you've
>>>> seen still there.
>>> I ran all tests in nfsv4.1 and nfsv4.0 with courteous and non-courteous
>>> 5.15-rc7 server.
>>>
>>> Nfs4.1 results are the same for both courteous and non-courteous
>>> server:
>>>> Of those: 0 Skipped, 0 Failed, 0 Warned, 169 Passed
>>> Results of nfs4.0 with non-courteous server:
>>>> Of those: 8 Skipped, 1 Failed, 0 Warned, 577 Passed
>>> test failed: LOCK24
>>>
>>> Results of nfs4.0 with courteous server:
>>>> Of those: 8 Skipped, 3 Failed, 0 Warned, 575 Passed
>>> tests failed: LOCK24, OPEN18, OPEN30
>>>
>>> OPEN18 and OPEN30 test pass if each is run by itself.
>> Could well be a bug in the tests, I don't know.
>
> The reason OPEN18 failed was because the test timed out waiting for
> the reply of an OPEN call. The RPC connection used for the test was
> configured with 15 secs timeout. Note that OPEN18 only fails when
> the tests were run with 'all' option, this test passes if it's run
> by itself.
>
> With courteous server, by the time OPEN18 runs, there are about 1026
> courtesy 4.0 clients on the server and all of these clients have opened
> the same file X with WRITE access. These clients were created by the
> previous tests. After each test completed, since 4.0 does not have
> session, the client states are not cleaned up immediately on the
> server and are allowed to become courtesy clients.
>
> When OPEN18 runs (about 20 minutes after the 1st test started), it
> sends OPEN of file X with OPEN4_SHARE_DENY_WRITE which causes the
> server to check for conflicts with courtesy clients. The loop that
> checks 1026 courtesy clients for share/access conflict took less
> than 1 sec. But it took about 55 secs, on my VM, for the server
> to expire all 1026 courtesy clients.
>
> I modified pynfs to configure the 4.0 RPC connection with 60 seconds
> timeout and OPEN18 now consistently passed. The 4.0 test results are
> now the same for courteous and non-courteous server:
>
> 8 Skipped, 1 Failed, 0 Warned, 577 Passed
>
> Note that 4.1 tests do not suffer this timeout problem because the
> 4.1 clients and sessions are destroyed after each test completes.

Do you want me to send the patch to increase the timeout for pynfs?
or is there any other things you think we should do?

Thanks,
-Dai


2021-11-29 18:34:29

by Dai Ngo

[permalink] [raw]
Subject: Re: [PATCH RFC v5 0/2] nfsd: Initial implementation of NFSv4 Courteous Server


On 11/29/21 9:30 AM, J. Bruce Fields wrote:
> On Mon, Nov 29, 2021 at 09:13:16AM -0800, [email protected] wrote:
>> Hi Bruce,
>>
>> On 11/21/21 7:04 PM, [email protected] wrote:
>>> On 11/17/21 4:34 PM, J. Bruce Fields wrote:
>>>> On Wed, Nov 17, 2021 at 01:46:02PM -0800, [email protected] wrote:
>>>>> On 11/17/21 9:59 AM, [email protected] wrote:
>>>>>> On 11/17/21 6:14 AM, J. Bruce Fields wrote:
>>>>>>> On Tue, Nov 16, 2021 at 03:06:32PM -0800, [email protected] wrote:
>>>>>>>> Just a reminder that this patch is still waiting for your review.
>>>>>>> Yeah, I was procrastinating and hoping yo'ud figure out the pynfs
>>>>>>> failure for me....
>>>>>> Last time I ran 4.0 OPEN18 test by itself and it passed. I will run
>>>>>> all OPEN tests together with 5.15-rc7 to see if the problem you've
>>>>>> seen still there.
>>>>> I ran all tests in nfsv4.1 and nfsv4.0 with courteous and non-courteous
>>>>> 5.15-rc7 server.
>>>>>
>>>>> Nfs4.1 results are the same for both courteous and
>>>>> non-courteous server:
>>>>>> Of those: 0 Skipped, 0 Failed, 0 Warned, 169 Passed
>>>>> Results of nfs4.0 with non-courteous server:
>>>>>> Of those: 8 Skipped, 1 Failed, 0 Warned, 577 Passed
>>>>> test failed: LOCK24
>>>>>
>>>>> Results of nfs4.0 with courteous server:
>>>>>> Of those: 8 Skipped, 3 Failed, 0 Warned, 575 Passed
>>>>> tests failed: LOCK24, OPEN18, OPEN30
>>>>>
>>>>> OPEN18 and OPEN30 test pass if each is run by itself.
>>>> Could well be a bug in the tests, I don't know.
>>> The reason OPEN18 failed was because the test timed out waiting for
>>> the reply of an OPEN call. The RPC connection used for the test was
>>> configured with 15 secs timeout. Note that OPEN18 only fails when
>>> the tests were run with 'all' option, this test passes if it's run
>>> by itself.
>>>
>>> With courteous server, by the time OPEN18 runs, there are about 1026
>>> courtesy 4.0 clients on the server and all of these clients have opened
>>> the same file X with WRITE access. These clients were created by the
>>> previous tests. After each test completed, since 4.0 does not have
>>> session, the client states are not cleaned up immediately on the
>>> server and are allowed to become courtesy clients.
>>>
>>> When OPEN18 runs (about 20 minutes after the 1st test started), it
>>> sends OPEN of file X with OPEN4_SHARE_DENY_WRITE which causes the
>>> server to check for conflicts with courtesy clients. The loop that
>>> checks 1026 courtesy clients for share/access conflict took less
>>> than 1 sec. But it took about 55 secs, on my VM, for the server
>>> to expire all 1026 courtesy clients.
>>>
>>> I modified pynfs to configure the 4.0 RPC connection with 60 seconds
>>> timeout and OPEN18 now consistently passed. The 4.0 test results are
>>> now the same for courteous and non-courteous server:
>>>
>>> 8 Skipped, 1 Failed, 0 Warned, 577 Passed
>>>
>>> Note that 4.1 tests do not suffer this timeout problem because the
>>> 4.1 clients and sessions are destroyed after each test completes.
>> Do you want me to send the patch to increase the timeout for pynfs?
>> or is there any other things you think we should do?
> I don't know.
>
> 55 seconds to clean up 1026 clients is about 50ms per client, which is
> pretty slow. I wonder why. I guess it's probably updating the stable
> storage information. Is /var/lib/nfs/ on your server backed by a hard
> drive or an SSD or something else?

My server is a virtualbox VM that has 1 CPU, 4GB RAM and 64GB of hard
disk. I think a production system that supports this many clients should
have faster CPUs, faster storage.

>
> I wonder if that's an argument for limiting the number of courtesy
> clients.

I think we might want to treat 4.0 clients a bit different from 4.1
clients. With 4.0, every client will become a courtesy client after
the client is done with the export and unmounts it. Since there is
no destroy session/client with 4.0, the courteous server allows the
client to be around and becomes a courtesy client. So after awhile,
even with normal usage, there will be lots 4.0 courtesy clients
hanging around and these clients won't be destroyed until 24hrs
later, or until they cause conflicts with other clients.

We can reduce the courtesy_client_expiry time for 4.0 clients from
24hrs to 15/20 mins, enough for most network partition to heal?,
or limit the number of 4.0 courtesy clients. Or don't support 4.0
clients at all which is my preference since I think in general users
should skip 4.0 and use 4.1 instead.

-Dai


2021-11-29 19:06:15

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH RFC v5 0/2] nfsd: Initial implementation of NFSv4 Courteous Server

Hello Dai!


> On Nov 29, 2021, at 1:32 PM, Dai Ngo <[email protected]> wrote:
>
>
> On 11/29/21 9:30 AM, J. Bruce Fields wrote:
>> On Mon, Nov 29, 2021 at 09:13:16AM -0800, [email protected] wrote:
>>> Hi Bruce,
>>>
>>> On 11/21/21 7:04 PM, [email protected] wrote:
>>>> On 11/17/21 4:34 PM, J. Bruce Fields wrote:
>>>>> On Wed, Nov 17, 2021 at 01:46:02PM -0800, [email protected] wrote:
>>>>>> On 11/17/21 9:59 AM, [email protected] wrote:
>>>>>>> On 11/17/21 6:14 AM, J. Bruce Fields wrote:
>>>>>>>> On Tue, Nov 16, 2021 at 03:06:32PM -0800, [email protected] wrote:
>>>>>>>>> Just a reminder that this patch is still waiting for your review.
>>>>>>>> Yeah, I was procrastinating and hoping yo'ud figure out the pynfs
>>>>>>>> failure for me....
>>>>>>> Last time I ran 4.0 OPEN18 test by itself and it passed. I will run
>>>>>>> all OPEN tests together with 5.15-rc7 to see if the problem you've
>>>>>>> seen still there.
>>>>>> I ran all tests in nfsv4.1 and nfsv4.0 with courteous and non-courteous
>>>>>> 5.15-rc7 server.
>>>>>>
>>>>>> Nfs4.1 results are the same for both courteous and
>>>>>> non-courteous server:
>>>>>>> Of those: 0 Skipped, 0 Failed, 0 Warned, 169 Passed
>>>>>> Results of nfs4.0 with non-courteous server:
>>>>>>> Of those: 8 Skipped, 1 Failed, 0 Warned, 577 Passed
>>>>>> test failed: LOCK24
>>>>>>
>>>>>> Results of nfs4.0 with courteous server:
>>>>>>> Of those: 8 Skipped, 3 Failed, 0 Warned, 575 Passed
>>>>>> tests failed: LOCK24, OPEN18, OPEN30
>>>>>>
>>>>>> OPEN18 and OPEN30 test pass if each is run by itself.
>>>>> Could well be a bug in the tests, I don't know.
>>>> The reason OPEN18 failed was because the test timed out waiting for
>>>> the reply of an OPEN call. The RPC connection used for the test was
>>>> configured with 15 secs timeout. Note that OPEN18 only fails when
>>>> the tests were run with 'all' option, this test passes if it's run
>>>> by itself.
>>>>
>>>> With courteous server, by the time OPEN18 runs, there are about 1026
>>>> courtesy 4.0 clients on the server and all of these clients have opened
>>>> the same file X with WRITE access. These clients were created by the
>>>> previous tests. After each test completed, since 4.0 does not have
>>>> session, the client states are not cleaned up immediately on the
>>>> server and are allowed to become courtesy clients.
>>>>
>>>> When OPEN18 runs (about 20 minutes after the 1st test started), it
>>>> sends OPEN of file X with OPEN4_SHARE_DENY_WRITE which causes the
>>>> server to check for conflicts with courtesy clients. The loop that
>>>> checks 1026 courtesy clients for share/access conflict took less
>>>> than 1 sec. But it took about 55 secs, on my VM, for the server
>>>> to expire all 1026 courtesy clients.
>>>>
>>>> I modified pynfs to configure the 4.0 RPC connection with 60 seconds
>>>> timeout and OPEN18 now consistently passed. The 4.0 test results are
>>>> now the same for courteous and non-courteous server:
>>>>
>>>> 8 Skipped, 1 Failed, 0 Warned, 577 Passed
>>>>
>>>> Note that 4.1 tests do not suffer this timeout problem because the
>>>> 4.1 clients and sessions are destroyed after each test completes.
>>> Do you want me to send the patch to increase the timeout for pynfs?
>>> or is there any other things you think we should do?
>> I don't know.
>>
>> 55 seconds to clean up 1026 clients is about 50ms per client, which is
>> pretty slow. I wonder why. I guess it's probably updating the stable
>> storage information. Is /var/lib/nfs/ on your server backed by a hard
>> drive or an SSD or something else?
>
> My server is a virtualbox VM that has 1 CPU, 4GB RAM and 64GB of hard
> disk. I think a production system that supports this many clients should
> have faster CPUs, faster storage.
>
>>
>> I wonder if that's an argument for limiting the number of courtesy
>> clients.
>
> I think we might want to treat 4.0 clients a bit different from 4.1
> clients. With 4.0, every client will become a courtesy client after
> the client is done with the export and unmounts it.

It should be safe for a server to purge a client's lease immediately
if there is no open or lock state associated with it.

When an NFSv4.0 client unmounts, all files should be closed at that
point, so the server can wait for the lease to expire and purge it
normally. Or am I missing something?


> Since there is
> no destroy session/client with 4.0, the courteous server allows the
> client to be around and becomes a courtesy client. So after awhile,
> even with normal usage, there will be lots 4.0 courtesy clients
> hanging around and these clients won't be destroyed until 24hrs
> later, or until they cause conflicts with other clients.
>
> We can reduce the courtesy_client_expiry time for 4.0 clients from
> 24hrs to 15/20 mins, enough for most network partition to heal?,
> or limit the number of 4.0 courtesy clients. Or don't support 4.0
> clients at all which is my preference since I think in general users
> should skip 4.0 and use 4.1 instead.
>
> -Dai

--
Chuck Lever




2021-11-29 19:41:11

by Dai Ngo

[permalink] [raw]
Subject: Re: [PATCH RFC v5 0/2] nfsd: Initial implementation of NFSv4 Courteous Server


On 11/29/21 11:13 AM, Bruce Fields wrote:
> On Mon, Nov 29, 2021 at 07:03:12PM +0000, Chuck Lever III wrote:
>> Hello Dai!
>>
>>
>>> On Nov 29, 2021, at 1:32 PM, Dai Ngo <[email protected]> wrote:
>>>
>>>
>>> On 11/29/21 9:30 AM, J. Bruce Fields wrote:
>>>> On Mon, Nov 29, 2021 at 09:13:16AM -0800, [email protected] wrote:
>>>>> Hi Bruce,
>>>>>
>>>>> On 11/21/21 7:04 PM, [email protected] wrote:
>>>>>> On 11/17/21 4:34 PM, J. Bruce Fields wrote:
>>>>>>> On Wed, Nov 17, 2021 at 01:46:02PM -0800, [email protected] wrote:
>>>>>>>> On 11/17/21 9:59 AM, [email protected] wrote:
>>>>>>>>> On 11/17/21 6:14 AM, J. Bruce Fields wrote:
>>>>>>>>>> On Tue, Nov 16, 2021 at 03:06:32PM -0800, [email protected] wrote:
>>>>>>>>>>> Just a reminder that this patch is still waiting for your review.
>>>>>>>>>> Yeah, I was procrastinating and hoping yo'ud figure out the pynfs
>>>>>>>>>> failure for me....
>>>>>>>>> Last time I ran 4.0 OPEN18 test by itself and it passed. I will run
>>>>>>>>> all OPEN tests together with 5.15-rc7 to see if the problem you've
>>>>>>>>> seen still there.
>>>>>>>> I ran all tests in nfsv4.1 and nfsv4.0 with courteous and non-courteous
>>>>>>>> 5.15-rc7 server.
>>>>>>>>
>>>>>>>> Nfs4.1 results are the same for both courteous and
>>>>>>>> non-courteous server:
>>>>>>>>> Of those: 0 Skipped, 0 Failed, 0 Warned, 169 Passed
>>>>>>>> Results of nfs4.0 with non-courteous server:
>>>>>>>>> Of those: 8 Skipped, 1 Failed, 0 Warned, 577 Passed
>>>>>>>> test failed: LOCK24
>>>>>>>>
>>>>>>>> Results of nfs4.0 with courteous server:
>>>>>>>>> Of those: 8 Skipped, 3 Failed, 0 Warned, 575 Passed
>>>>>>>> tests failed: LOCK24, OPEN18, OPEN30
>>>>>>>>
>>>>>>>> OPEN18 and OPEN30 test pass if each is run by itself.
>>>>>>> Could well be a bug in the tests, I don't know.
>>>>>> The reason OPEN18 failed was because the test timed out waiting for
>>>>>> the reply of an OPEN call. The RPC connection used for the test was
>>>>>> configured with 15 secs timeout. Note that OPEN18 only fails when
>>>>>> the tests were run with 'all' option, this test passes if it's run
>>>>>> by itself.
>>>>>>
>>>>>> With courteous server, by the time OPEN18 runs, there are about 1026
>>>>>> courtesy 4.0 clients on the server and all of these clients have opened
>>>>>> the same file X with WRITE access. These clients were created by the
>>>>>> previous tests. After each test completed, since 4.0 does not have
>>>>>> session, the client states are not cleaned up immediately on the
>>>>>> server and are allowed to become courtesy clients.
>>>>>>
>>>>>> When OPEN18 runs (about 20 minutes after the 1st test started), it
>>>>>> sends OPEN of file X with OPEN4_SHARE_DENY_WRITE which causes the
>>>>>> server to check for conflicts with courtesy clients. The loop that
>>>>>> checks 1026 courtesy clients for share/access conflict took less
>>>>>> than 1 sec. But it took about 55 secs, on my VM, for the server
>>>>>> to expire all 1026 courtesy clients.
>>>>>>
>>>>>> I modified pynfs to configure the 4.0 RPC connection with 60 seconds
>>>>>> timeout and OPEN18 now consistently passed. The 4.0 test results are
>>>>>> now the same for courteous and non-courteous server:
>>>>>>
>>>>>> 8 Skipped, 1 Failed, 0 Warned, 577 Passed
>>>>>>
>>>>>> Note that 4.1 tests do not suffer this timeout problem because the
>>>>>> 4.1 clients and sessions are destroyed after each test completes.
>>>>> Do you want me to send the patch to increase the timeout for pynfs?
>>>>> or is there any other things you think we should do?
>>>> I don't know.
>>>>
>>>> 55 seconds to clean up 1026 clients is about 50ms per client, which is
>>>> pretty slow. I wonder why. I guess it's probably updating the stable
>>>> storage information. Is /var/lib/nfs/ on your server backed by a hard
>>>> drive or an SSD or something else?
>>> My server is a virtualbox VM that has 1 CPU, 4GB RAM and 64GB of hard
>>> disk. I think a production system that supports this many clients should
>>> have faster CPUs, faster storage.
>>>
>>>> I wonder if that's an argument for limiting the number of courtesy
>>>> clients.
>>> I think we might want to treat 4.0 clients a bit different from 4.1
>>> clients. With 4.0, every client will become a courtesy client after
>>> the client is done with the export and unmounts it.
>> It should be safe for a server to purge a client's lease immediately
>> if there is no open or lock state associated with it.
>>
>> When an NFSv4.0 client unmounts, all files should be closed at that
>> point, so the server can wait for the lease to expire and purge it
>> normally. Or am I missing something?
> Makes sense to me!
>
>>> Since there is
>>> no destroy session/client with 4.0, the courteous server allows the
>>> client to be around and becomes a courtesy client. So after awhile,
>>> even with normal usage, there will be lots 4.0 courtesy clients
>>> hanging around and these clients won't be destroyed until 24hrs
>>> later, or until they cause conflicts with other clients.
>>>
>>> We can reduce the courtesy_client_expiry time for 4.0 clients from
>>> 24hrs to 15/20 mins, enough for most network partition to heal?,
>>> or limit the number of 4.0 courtesy clients. Or don't support 4.0
>>> clients at all which is my preference since I think in general users
>>> should skip 4.0 and use 4.1 instead.
> I'm also totally fine with leaving out 4.0, at least to start.

Ok Bruce, I will submit v6 patch for this.

Thanks,
-Dai

>
> --b.

2021-11-29 19:53:01

by Dai Ngo

[permalink] [raw]
Subject: Re: [PATCH RFC v5 0/2] nfsd: Initial implementation of NFSv4 Courteous Server


On 11/29/21 11:03 AM, Chuck Lever III wrote:
> Hello Dai!
>
>
>> On Nov 29, 2021, at 1:32 PM, Dai Ngo <[email protected]> wrote:
>>
>>
>> On 11/29/21 9:30 AM, J. Bruce Fields wrote:
>>> On Mon, Nov 29, 2021 at 09:13:16AM -0800, [email protected] wrote:
>>>> Hi Bruce,
>>>>
>>>> On 11/21/21 7:04 PM, [email protected] wrote:
>>>>> On 11/17/21 4:34 PM, J. Bruce Fields wrote:
>>>>>> On Wed, Nov 17, 2021 at 01:46:02PM -0800, [email protected] wrote:
>>>>>>> On 11/17/21 9:59 AM, [email protected] wrote:
>>>>>>>> On 11/17/21 6:14 AM, J. Bruce Fields wrote:
>>>>>>>>> On Tue, Nov 16, 2021 at 03:06:32PM -0800, [email protected] wrote:
>>>>>>>>>> Just a reminder that this patch is still waiting for your review.
>>>>>>>>> Yeah, I was procrastinating and hoping yo'ud figure out the pynfs
>>>>>>>>> failure for me....
>>>>>>>> Last time I ran 4.0 OPEN18 test by itself and it passed. I will run
>>>>>>>> all OPEN tests together with 5.15-rc7 to see if the problem you've
>>>>>>>> seen still there.
>>>>>>> I ran all tests in nfsv4.1 and nfsv4.0 with courteous and non-courteous
>>>>>>> 5.15-rc7 server.
>>>>>>>
>>>>>>> Nfs4.1 results are the same for both courteous and
>>>>>>> non-courteous server:
>>>>>>>> Of those: 0 Skipped, 0 Failed, 0 Warned, 169 Passed
>>>>>>> Results of nfs4.0 with non-courteous server:
>>>>>>>> Of those: 8 Skipped, 1 Failed, 0 Warned, 577 Passed
>>>>>>> test failed: LOCK24
>>>>>>>
>>>>>>> Results of nfs4.0 with courteous server:
>>>>>>>> Of those: 8 Skipped, 3 Failed, 0 Warned, 575 Passed
>>>>>>> tests failed: LOCK24, OPEN18, OPEN30
>>>>>>>
>>>>>>> OPEN18 and OPEN30 test pass if each is run by itself.
>>>>>> Could well be a bug in the tests, I don't know.
>>>>> The reason OPEN18 failed was because the test timed out waiting for
>>>>> the reply of an OPEN call. The RPC connection used for the test was
>>>>> configured with 15 secs timeout. Note that OPEN18 only fails when
>>>>> the tests were run with 'all' option, this test passes if it's run
>>>>> by itself.
>>>>>
>>>>> With courteous server, by the time OPEN18 runs, there are about 1026
>>>>> courtesy 4.0 clients on the server and all of these clients have opened
>>>>> the same file X with WRITE access. These clients were created by the
>>>>> previous tests. After each test completed, since 4.0 does not have
>>>>> session, the client states are not cleaned up immediately on the
>>>>> server and are allowed to become courtesy clients.
>>>>>
>>>>> When OPEN18 runs (about 20 minutes after the 1st test started), it
>>>>> sends OPEN of file X with OPEN4_SHARE_DENY_WRITE which causes the
>>>>> server to check for conflicts with courtesy clients. The loop that
>>>>> checks 1026 courtesy clients for share/access conflict took less
>>>>> than 1 sec. But it took about 55 secs, on my VM, for the server
>>>>> to expire all 1026 courtesy clients.
>>>>>
>>>>> I modified pynfs to configure the 4.0 RPC connection with 60 seconds
>>>>> timeout and OPEN18 now consistently passed. The 4.0 test results are
>>>>> now the same for courteous and non-courteous server:
>>>>>
>>>>> 8 Skipped, 1 Failed, 0 Warned, 577 Passed
>>>>>
>>>>> Note that 4.1 tests do not suffer this timeout problem because the
>>>>> 4.1 clients and sessions are destroyed after each test completes.
>>>> Do you want me to send the patch to increase the timeout for pynfs?
>>>> or is there any other things you think we should do?
>>> I don't know.
>>>
>>> 55 seconds to clean up 1026 clients is about 50ms per client, which is
>>> pretty slow. I wonder why. I guess it's probably updating the stable
>>> storage information. Is /var/lib/nfs/ on your server backed by a hard
>>> drive or an SSD or something else?
>> My server is a virtualbox VM that has 1 CPU, 4GB RAM and 64GB of hard
>> disk. I think a production system that supports this many clients should
>> have faster CPUs, faster storage.
>>
>>> I wonder if that's an argument for limiting the number of courtesy
>>> clients.
>> I think we might want to treat 4.0 clients a bit different from 4.1
>> clients. With 4.0, every client will become a courtesy client after
>> the client is done with the export and unmounts it.
> It should be safe for a server to purge a client's lease immediately
> if there is no open or lock state associated with it.

In this case, each client has opened files so there are open states
associated with them.

>
> When an NFSv4.0 client unmounts, all files should be closed at that
> point,

I'm not sure pynfs does proper clean up after each subtest, I will
check. There must be state associated with the client in order for
it to become courtesy client.

> so the server can wait for the lease to expire and purge it
> normally. Or am I missing something?

When 4.0 client lease expires and there are still states associated
with the client then the server allows this client to become courtesy
client.

-Dai

>
>
>> Since there is
>> no destroy session/client with 4.0, the courteous server allows the
>> client to be around and becomes a courtesy client. So after awhile,
>> even with normal usage, there will be lots 4.0 courtesy clients
>> hanging around and these clients won't be destroyed until 24hrs
>> later, or until they cause conflicts with other clients.
>>
>> We can reduce the courtesy_client_expiry time for 4.0 clients from
>> 24hrs to 15/20 mins, enough for most network partition to heal?,
>> or limit the number of 4.0 courtesy clients. Or don't support 4.0
>> clients at all which is my preference since I think in general users
>> should skip 4.0 and use 4.1 instead.
>>
>> -Dai
> --
> Chuck Lever
>
>
>

2021-11-29 21:03:33

by Dai Ngo

[permalink] [raw]
Subject: Re: [PATCH RFC v5 0/2] nfsd: Initial implementation of NFSv4 Courteous Server


On 11/29/21 11:36 AM, [email protected] wrote:
>
> On 11/29/21 11:03 AM, Chuck Lever III wrote:
>> Hello Dai!
>>
>>
>>> On Nov 29, 2021, at 1:32 PM, Dai Ngo <[email protected]> wrote:
>>>
>>>
>>> On 11/29/21 9:30 AM, J. Bruce Fields wrote:
>>>> On Mon, Nov 29, 2021 at 09:13:16AM -0800, [email protected] wrote:
>>>>> Hi Bruce,
>>>>>
>>>>> On 11/21/21 7:04 PM, [email protected] wrote:
>>>>>> On 11/17/21 4:34 PM, J. Bruce Fields wrote:
>>>>>>> On Wed, Nov 17, 2021 at 01:46:02PM -0800, [email protected] wrote:
>>>>>>>> On 11/17/21 9:59 AM, [email protected] wrote:
>>>>>>>>> On 11/17/21 6:14 AM, J. Bruce Fields wrote:
>>>>>>>>>> On Tue, Nov 16, 2021 at 03:06:32PM -0800, [email protected]
>>>>>>>>>> wrote:
>>>>>>>>>>> Just a reminder that this patch is still waiting for your
>>>>>>>>>>> review.
>>>>>>>>>> Yeah, I was procrastinating and hoping yo'ud figure out the
>>>>>>>>>> pynfs
>>>>>>>>>> failure for me....
>>>>>>>>> Last time I ran 4.0 OPEN18 test by itself and it passed. I
>>>>>>>>> will run
>>>>>>>>> all OPEN tests together with 5.15-rc7 to see if the problem
>>>>>>>>> you've
>>>>>>>>> seen still there.
>>>>>>>> I ran all tests in nfsv4.1 and nfsv4.0 with courteous and
>>>>>>>> non-courteous
>>>>>>>> 5.15-rc7 server.
>>>>>>>>
>>>>>>>> Nfs4.1 results are the same for both courteous and
>>>>>>>> non-courteous server:
>>>>>>>>> Of those: 0 Skipped, 0 Failed, 0 Warned, 169 Passed
>>>>>>>> Results of nfs4.0 with non-courteous server:
>>>>>>>>> Of those: 8 Skipped, 1 Failed, 0 Warned, 577 Passed
>>>>>>>> test failed: LOCK24
>>>>>>>>
>>>>>>>> Results of nfs4.0 with courteous server:
>>>>>>>>> Of those: 8 Skipped, 3 Failed, 0 Warned, 575 Passed
>>>>>>>> tests failed: LOCK24, OPEN18, OPEN30
>>>>>>>>
>>>>>>>> OPEN18 and OPEN30 test pass if each is run by itself.
>>>>>>> Could well be a bug in the tests, I don't know.
>>>>>> The reason OPEN18 failed was because the test timed out waiting for
>>>>>> the reply of an OPEN call. The RPC connection used for the test was
>>>>>> configured with 15 secs timeout. Note that OPEN18 only fails when
>>>>>> the tests were run with 'all' option, this test passes if it's run
>>>>>> by itself.
>>>>>>
>>>>>> With courteous server, by the time OPEN18 runs, there are about 1026
>>>>>> courtesy 4.0 clients on the server and all of these clients have
>>>>>> opened
>>>>>> the same file X with WRITE access. These clients were created by the
>>>>>> previous tests. After each test completed, since 4.0 does not have
>>>>>> session, the client states are not cleaned up immediately on the
>>>>>> server and are allowed to become courtesy clients.
>>>>>>
>>>>>> When OPEN18 runs (about 20 minutes after the 1st test started), it
>>>>>> sends OPEN of file X with OPEN4_SHARE_DENY_WRITE which causes the
>>>>>> server to check for conflicts with courtesy clients. The loop that
>>>>>> checks 1026 courtesy clients for share/access conflict took less
>>>>>> than 1 sec. But it took about 55 secs, on my VM, for the server
>>>>>> to expire all 1026 courtesy clients.
>>>>>>
>>>>>> I modified pynfs to configure the 4.0 RPC connection with 60 seconds
>>>>>> timeout and OPEN18 now consistently passed. The 4.0 test results are
>>>>>> now the same for courteous and non-courteous server:
>>>>>>
>>>>>> 8 Skipped, 1 Failed, 0 Warned, 577 Passed
>>>>>>
>>>>>> Note that 4.1 tests do not suffer this timeout problem because the
>>>>>> 4.1 clients and sessions are destroyed after each test completes.
>>>>> Do you want me to send the patch to increase the timeout for pynfs?
>>>>> or is there any other things you think we should do?
>>>> I don't know.
>>>>
>>>> 55 seconds to clean up 1026 clients is about 50ms per client, which is
>>>> pretty slow.  I wonder why.  I guess it's probably updating the stable
>>>> storage information.  Is /var/lib/nfs/ on your server backed by a hard
>>>> drive or an SSD or something else?
>>> My server is a virtualbox VM that has 1 CPU, 4GB RAM and 64GB of hard
>>> disk. I think a production system that supports this many clients
>>> should
>>> have faster CPUs, faster storage.
>>>
>>>> I wonder if that's an argument for limiting the number of courtesy
>>>> clients.
>>> I think we might want to treat 4.0 clients a bit different from 4.1
>>> clients. With 4.0, every client will become a courtesy client after
>>> the client is done with the export and unmounts it.
>> It should be safe for a server to purge a client's lease immediately
>> if there is no open or lock state associated with it.
>
> In this case, each client has opened files so there are open states
> associated with them.
>
>>
>> When an NFSv4.0 client unmounts, all files should be closed at that
>> point,
>
> I'm not sure pynfs does proper clean up after each subtest, I will
> check. There must be state associated with the client in order for
> it to become courtesy client.

pynfs 4.0 test uses LOOKUP, OPEN with OPEN4_CREATE to create the
test file and uses PUTFH and REMOVE to remove the test file when
done. I don't see where the open state associated the removed
file being freed by nfsd_remove. I guess for 4.0, the open state
remains valid on the server until the client lease expires.

I attached the pcap of OPEN18 test for reference.

-Dai

>
>> so the server can wait for the lease to expire and purge it
>> normally. Or am I missing something?
>
> When 4.0 client lease expires and there are still states associated
> with the client then the server allows this client to become courtesy
> client.
>
> -Dai
>
>>
>>
>>> Since there is
>>> no destroy session/client with 4.0, the courteous server allows the
>>> client to be around and becomes a courtesy client. So after awhile,
>>> even with normal usage, there will be lots 4.0 courtesy clients
>>> hanging around and these clients won't be destroyed until 24hrs
>>> later, or until they cause conflicts with other clients.
>>>
>>> We can reduce the courtesy_client_expiry time for 4.0 clients from
>>> 24hrs to 15/20 mins, enough for most network partition to heal?,
>>> or limit the number of 4.0 courtesy clients. Or don't support 4.0
>>> clients at all which is my preference since I think in general users
>>> should skip 4.0 and use 4.1 instead.
>>>
>>> -Dai
>> --
>> Chuck Lever
>>
>>
>>


Attachments:
pynfs_open18_40.pcap (15.86 kB)

2021-11-29 21:12:51

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH RFC v5 0/2] nfsd: Initial implementation of NFSv4 Courteous Server



> On Nov 29, 2021, at 2:36 PM, Dai Ngo <[email protected]> wrote:
>
>
> On 11/29/21 11:03 AM, Chuck Lever III wrote:
>> Hello Dai!
>>
>>
>>> On Nov 29, 2021, at 1:32 PM, Dai Ngo <[email protected]> wrote:
>>>
>>>
>>> On 11/29/21 9:30 AM, J. Bruce Fields wrote:
>>>> On Mon, Nov 29, 2021 at 09:13:16AM -0800, [email protected] wrote:
>>>>> Hi Bruce,
>>>>>
>>>>> On 11/21/21 7:04 PM, [email protected] wrote:
>>>>>> On 11/17/21 4:34 PM, J. Bruce Fields wrote:
>>>>>>> On Wed, Nov 17, 2021 at 01:46:02PM -0800, [email protected] wrote:
>>>>>>>> On 11/17/21 9:59 AM, [email protected] wrote:
>>>>>>>>> On 11/17/21 6:14 AM, J. Bruce Fields wrote:
>>>>>>>>>> On Tue, Nov 16, 2021 at 03:06:32PM -0800, [email protected] wrote:
>>>>>>>>>>> Just a reminder that this patch is still waiting for your review.
>>>>>>>>>> Yeah, I was procrastinating and hoping yo'ud figure out the pynfs
>>>>>>>>>> failure for me....
>>>>>>>>> Last time I ran 4.0 OPEN18 test by itself and it passed. I will run
>>>>>>>>> all OPEN tests together with 5.15-rc7 to see if the problem you've
>>>>>>>>> seen still there.
>>>>>>>> I ran all tests in nfsv4.1 and nfsv4.0 with courteous and non-courteous
>>>>>>>> 5.15-rc7 server.
>>>>>>>>
>>>>>>>> Nfs4.1 results are the same for both courteous and
>>>>>>>> non-courteous server:
>>>>>>>>> Of those: 0 Skipped, 0 Failed, 0 Warned, 169 Passed
>>>>>>>> Results of nfs4.0 with non-courteous server:
>>>>>>>>> Of those: 8 Skipped, 1 Failed, 0 Warned, 577 Passed
>>>>>>>> test failed: LOCK24
>>>>>>>>
>>>>>>>> Results of nfs4.0 with courteous server:
>>>>>>>>> Of those: 8 Skipped, 3 Failed, 0 Warned, 575 Passed
>>>>>>>> tests failed: LOCK24, OPEN18, OPEN30
>>>>>>>>
>>>>>>>> OPEN18 and OPEN30 test pass if each is run by itself.
>>>>>>> Could well be a bug in the tests, I don't know.
>>>>>> The reason OPEN18 failed was because the test timed out waiting for
>>>>>> the reply of an OPEN call. The RPC connection used for the test was
>>>>>> configured with 15 secs timeout. Note that OPEN18 only fails when
>>>>>> the tests were run with 'all' option, this test passes if it's run
>>>>>> by itself.
>>>>>>
>>>>>> With courteous server, by the time OPEN18 runs, there are about 1026
>>>>>> courtesy 4.0 clients on the server and all of these clients have opened
>>>>>> the same file X with WRITE access. These clients were created by the
>>>>>> previous tests. After each test completed, since 4.0 does not have
>>>>>> session, the client states are not cleaned up immediately on the
>>>>>> server and are allowed to become courtesy clients.
>>>>>>
>>>>>> When OPEN18 runs (about 20 minutes after the 1st test started), it
>>>>>> sends OPEN of file X with OPEN4_SHARE_DENY_WRITE which causes the
>>>>>> server to check for conflicts with courtesy clients. The loop that
>>>>>> checks 1026 courtesy clients for share/access conflict took less
>>>>>> than 1 sec. But it took about 55 secs, on my VM, for the server
>>>>>> to expire all 1026 courtesy clients.
>>>>>>
>>>>>> I modified pynfs to configure the 4.0 RPC connection with 60 seconds
>>>>>> timeout and OPEN18 now consistently passed. The 4.0 test results are
>>>>>> now the same for courteous and non-courteous server:
>>>>>>
>>>>>> 8 Skipped, 1 Failed, 0 Warned, 577 Passed
>>>>>>
>>>>>> Note that 4.1 tests do not suffer this timeout problem because the
>>>>>> 4.1 clients and sessions are destroyed after each test completes.
>>>>> Do you want me to send the patch to increase the timeout for pynfs?
>>>>> or is there any other things you think we should do?
>>>> I don't know.
>>>>
>>>> 55 seconds to clean up 1026 clients is about 50ms per client, which is
>>>> pretty slow. I wonder why. I guess it's probably updating the stable
>>>> storage information. Is /var/lib/nfs/ on your server backed by a hard
>>>> drive or an SSD or something else?
>>> My server is a virtualbox VM that has 1 CPU, 4GB RAM and 64GB of hard
>>> disk. I think a production system that supports this many clients should
>>> have faster CPUs, faster storage.
>>>
>>>> I wonder if that's an argument for limiting the number of courtesy
>>>> clients.
>>> I think we might want to treat 4.0 clients a bit different from 4.1
>>> clients. With 4.0, every client will become a courtesy client after
>>> the client is done with the export and unmounts it.
>> It should be safe for a server to purge a client's lease immediately
>> if there is no open or lock state associated with it.
>
> In this case, each client has opened files so there are open states
> associated with them.
>
>>
>> When an NFSv4.0 client unmounts, all files should be closed at that
>> point,
>
> I'm not sure pynfs does proper clean up after each subtest, I will
> check. There must be state associated with the client in order for
> it to become courtesy client.

Makes sense. Then a synthetic client like pynfs can DoS a courteous
server.


>> so the server can wait for the lease to expire and purge it
>> normally. Or am I missing something?
>
> When 4.0 client lease expires and there are still states associated
> with the client then the server allows this client to become courtesy
> client.

I think the same thing happens if an NFSv4.1 client neglects to send
DESTROY_SESSION / DESTROY_CLIENTID. Either such a client is broken
or malicious, but the server faces the same issue of protecting
itself from a DoS attack.

IMO you should consider limiting the number of courteous clients
the server can hold onto. Let's say that number is 1000. When the
server wants to turn a 1001st client into a courteous client, it
can simply expire and purge the oldest courteous client on its
list. Otherwise, over time, the 24-hour expiry will reduce the
set of courteous clients back to zero.

What do you think?


>>> Since there is
>>> no destroy session/client with 4.0, the courteous server allows the
>>> client to be around and becomes a courtesy client. So after awhile,
>>> even with normal usage, there will be lots 4.0 courtesy clients
>>> hanging around and these clients won't be destroyed until 24hrs
>>> later, or until they cause conflicts with other clients.
>>>
>>> We can reduce the courtesy_client_expiry time for 4.0 clients from
>>> 24hrs to 15/20 mins, enough for most network partition to heal?,
>>> or limit the number of 4.0 courtesy clients. Or don't support 4.0
>>> clients at all which is my preference since I think in general users
>>> should skip 4.0 and use 4.1 instead.
>>>
>>> -Dai
>> --
>> Chuck Lever
>>
>>
>>

--
Chuck Lever




2021-11-29 22:16:54

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH RFC v5 0/2] nfsd: Initial implementation of NFSv4 Courteous Server

On Mon, Nov 29, 2021 at 07:03:12PM +0000, Chuck Lever III wrote:
> Hello Dai!
>
>
> > On Nov 29, 2021, at 1:32 PM, Dai Ngo <[email protected]> wrote:
> >
> >
> > On 11/29/21 9:30 AM, J. Bruce Fields wrote:
> >> On Mon, Nov 29, 2021 at 09:13:16AM -0800, [email protected] wrote:
> >>> Hi Bruce,
> >>>
> >>> On 11/21/21 7:04 PM, [email protected] wrote:
> >>>> On 11/17/21 4:34 PM, J. Bruce Fields wrote:
> >>>>> On Wed, Nov 17, 2021 at 01:46:02PM -0800, [email protected] wrote:
> >>>>>> On 11/17/21 9:59 AM, [email protected] wrote:
> >>>>>>> On 11/17/21 6:14 AM, J. Bruce Fields wrote:
> >>>>>>>> On Tue, Nov 16, 2021 at 03:06:32PM -0800, [email protected] wrote:
> >>>>>>>>> Just a reminder that this patch is still waiting for your review.
> >>>>>>>> Yeah, I was procrastinating and hoping yo'ud figure out the pynfs
> >>>>>>>> failure for me....
> >>>>>>> Last time I ran 4.0 OPEN18 test by itself and it passed. I will run
> >>>>>>> all OPEN tests together with 5.15-rc7 to see if the problem you've
> >>>>>>> seen still there.
> >>>>>> I ran all tests in nfsv4.1 and nfsv4.0 with courteous and non-courteous
> >>>>>> 5.15-rc7 server.
> >>>>>>
> >>>>>> Nfs4.1 results are the same for both courteous and
> >>>>>> non-courteous server:
> >>>>>>> Of those: 0 Skipped, 0 Failed, 0 Warned, 169 Passed
> >>>>>> Results of nfs4.0 with non-courteous server:
> >>>>>>> Of those: 8 Skipped, 1 Failed, 0 Warned, 577 Passed
> >>>>>> test failed: LOCK24
> >>>>>>
> >>>>>> Results of nfs4.0 with courteous server:
> >>>>>>> Of those: 8 Skipped, 3 Failed, 0 Warned, 575 Passed
> >>>>>> tests failed: LOCK24, OPEN18, OPEN30
> >>>>>>
> >>>>>> OPEN18 and OPEN30 test pass if each is run by itself.
> >>>>> Could well be a bug in the tests, I don't know.
> >>>> The reason OPEN18 failed was because the test timed out waiting for
> >>>> the reply of an OPEN call. The RPC connection used for the test was
> >>>> configured with 15 secs timeout. Note that OPEN18 only fails when
> >>>> the tests were run with 'all' option, this test passes if it's run
> >>>> by itself.
> >>>>
> >>>> With courteous server, by the time OPEN18 runs, there are about 1026
> >>>> courtesy 4.0 clients on the server and all of these clients have opened
> >>>> the same file X with WRITE access. These clients were created by the
> >>>> previous tests. After each test completed, since 4.0 does not have
> >>>> session, the client states are not cleaned up immediately on the
> >>>> server and are allowed to become courtesy clients.
> >>>>
> >>>> When OPEN18 runs (about 20 minutes after the 1st test started), it
> >>>> sends OPEN of file X with OPEN4_SHARE_DENY_WRITE which causes the
> >>>> server to check for conflicts with courtesy clients. The loop that
> >>>> checks 1026 courtesy clients for share/access conflict took less
> >>>> than 1 sec. But it took about 55 secs, on my VM, for the server
> >>>> to expire all 1026 courtesy clients.
> >>>>
> >>>> I modified pynfs to configure the 4.0 RPC connection with 60 seconds
> >>>> timeout and OPEN18 now consistently passed. The 4.0 test results are
> >>>> now the same for courteous and non-courteous server:
> >>>>
> >>>> 8 Skipped, 1 Failed, 0 Warned, 577 Passed
> >>>>
> >>>> Note that 4.1 tests do not suffer this timeout problem because the
> >>>> 4.1 clients and sessions are destroyed after each test completes.
> >>> Do you want me to send the patch to increase the timeout for pynfs?
> >>> or is there any other things you think we should do?
> >> I don't know.
> >>
> >> 55 seconds to clean up 1026 clients is about 50ms per client, which is
> >> pretty slow. I wonder why. I guess it's probably updating the stable
> >> storage information. Is /var/lib/nfs/ on your server backed by a hard
> >> drive or an SSD or something else?
> >
> > My server is a virtualbox VM that has 1 CPU, 4GB RAM and 64GB of hard
> > disk. I think a production system that supports this many clients should
> > have faster CPUs, faster storage.
> >
> >>
> >> I wonder if that's an argument for limiting the number of courtesy
> >> clients.
> >
> > I think we might want to treat 4.0 clients a bit different from 4.1
> > clients. With 4.0, every client will become a courtesy client after
> > the client is done with the export and unmounts it.
>
> It should be safe for a server to purge a client's lease immediately
> if there is no open or lock state associated with it.
>
> When an NFSv4.0 client unmounts, all files should be closed at that
> point, so the server can wait for the lease to expire and purge it
> normally. Or am I missing something?

Makes sense to me!

> > Since there is
> > no destroy session/client with 4.0, the courteous server allows the
> > client to be around and becomes a courtesy client. So after awhile,
> > even with normal usage, there will be lots 4.0 courtesy clients
> > hanging around and these clients won't be destroyed until 24hrs
> > later, or until they cause conflicts with other clients.
> >
> > We can reduce the courtesy_client_expiry time for 4.0 clients from
> > 24hrs to 15/20 mins, enough for most network partition to heal?,
> > or limit the number of 4.0 courtesy clients. Or don't support 4.0
> > clients at all which is my preference since I think in general users
> > should skip 4.0 and use 4.1 instead.

I'm also totally fine with leaving out 4.0, at least to start.

--b.

2021-11-29 22:44:38

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH RFC v5 0/2] nfsd: Initial implementation of NFSv4 Courteous Server

On Mon, Nov 29, 2021 at 09:13:16AM -0800, [email protected] wrote:
> Hi Bruce,
>
> On 11/21/21 7:04 PM, [email protected] wrote:
> >
> >On 11/17/21 4:34 PM, J. Bruce Fields wrote:
> >>On Wed, Nov 17, 2021 at 01:46:02PM -0800, [email protected] wrote:
> >>>On 11/17/21 9:59 AM, [email protected] wrote:
> >>>>On 11/17/21 6:14 AM, J. Bruce Fields wrote:
> >>>>>On Tue, Nov 16, 2021 at 03:06:32PM -0800, [email protected] wrote:
> >>>>>>Just a reminder that this patch is still waiting for your review.
> >>>>>Yeah, I was procrastinating and hoping yo'ud figure out the pynfs
> >>>>>failure for me....
> >>>>Last time I ran 4.0 OPEN18 test by itself and it passed. I will run
> >>>>all OPEN tests together with 5.15-rc7 to see if the problem you've
> >>>>seen still there.
> >>>I ran all tests in nfsv4.1 and nfsv4.0 with courteous and non-courteous
> >>>5.15-rc7 server.
> >>>
> >>>Nfs4.1 results are the same for both courteous and
> >>>non-courteous server:
> >>>>Of those: 0 Skipped, 0 Failed, 0 Warned, 169 Passed
> >>>Results of nfs4.0 with non-courteous server:
> >>>>Of those: 8 Skipped, 1 Failed, 0 Warned, 577 Passed
> >>>test failed: LOCK24
> >>>
> >>>Results of nfs4.0 with courteous server:
> >>>>Of those: 8 Skipped, 3 Failed, 0 Warned, 575 Passed
> >>>tests failed: LOCK24, OPEN18, OPEN30
> >>>
> >>>OPEN18 and OPEN30 test pass if each is run by itself.
> >>Could well be a bug in the tests, I don't know.
> >
> >The reason OPEN18 failed was because the test timed out waiting for
> >the reply of an OPEN call. The RPC connection used for the test was
> >configured with 15 secs timeout. Note that OPEN18 only fails when
> >the tests were run with 'all' option, this test passes if it's run
> >by itself.
> >
> >With courteous server, by the time OPEN18 runs, there are about 1026
> >courtesy 4.0 clients on the server and all of these clients have opened
> >the same file X with WRITE access. These clients were created by the
> >previous tests. After each test completed, since 4.0 does not have
> >session, the client states are not cleaned up immediately on the
> >server and are allowed to become courtesy clients.
> >
> >When OPEN18 runs (about 20 minutes after the 1st test started), it
> >sends OPEN of file X with OPEN4_SHARE_DENY_WRITE which causes the
> >server to check for conflicts with courtesy clients. The loop that
> >checks 1026 courtesy clients for share/access conflict took less
> >than 1 sec. But it took about 55 secs, on my VM, for the server
> >to expire all 1026 courtesy clients.
> >
> >I modified pynfs to configure the 4.0 RPC connection with 60 seconds
> >timeout and OPEN18 now consistently passed. The 4.0 test results are
> >now the same for courteous and non-courteous server:
> >
> >8 Skipped, 1 Failed, 0 Warned, 577 Passed
> >
> >Note that 4.1 tests do not suffer this timeout problem because the
> >4.1 clients and sessions are destroyed after each test completes.
>
> Do you want me to send the patch to increase the timeout for pynfs?
> or is there any other things you think we should do?

I don't know.

55 seconds to clean up 1026 clients is about 50ms per client, which is
pretty slow. I wonder why. I guess it's probably updating the stable
storage information. Is /var/lib/nfs/ on your server backed by a hard
drive or an SSD or something else?

I wonder if that's an argument for limiting the number of courtesy
clients.

--b.

2021-11-30 00:12:07

by Dai Ngo

[permalink] [raw]
Subject: Re: [PATCH RFC v5 0/2] nfsd: Initial implementation of NFSv4 Courteous Server


On 11/29/21 1:10 PM, Chuck Lever III wrote:
>
>> On Nov 29, 2021, at 2:36 PM, Dai Ngo <[email protected]> wrote:
>>
>>
>> On 11/29/21 11:03 AM, Chuck Lever III wrote:
>>> Hello Dai!
>>>
>>>
>>>> On Nov 29, 2021, at 1:32 PM, Dai Ngo <[email protected]> wrote:
>>>>
>>>>
>>>> On 11/29/21 9:30 AM, J. Bruce Fields wrote:
>>>>> On Mon, Nov 29, 2021 at 09:13:16AM -0800, [email protected] wrote:
>>>>>> Hi Bruce,
>>>>>>
>>>>>> On 11/21/21 7:04 PM, [email protected] wrote:
>>>>>>> On 11/17/21 4:34 PM, J. Bruce Fields wrote:
>>>>>>>> On Wed, Nov 17, 2021 at 01:46:02PM -0800, [email protected] wrote:
>>>>>>>>> On 11/17/21 9:59 AM, [email protected] wrote:
>>>>>>>>>> On 11/17/21 6:14 AM, J. Bruce Fields wrote:
>>>>>>>>>>> On Tue, Nov 16, 2021 at 03:06:32PM -0800, [email protected] wrote:
>>>>>>>>>>>> Just a reminder that this patch is still waiting for your review.
>>>>>>>>>>> Yeah, I was procrastinating and hoping yo'ud figure out the pynfs
>>>>>>>>>>> failure for me....
>>>>>>>>>> Last time I ran 4.0 OPEN18 test by itself and it passed. I will run
>>>>>>>>>> all OPEN tests together with 5.15-rc7 to see if the problem you've
>>>>>>>>>> seen still there.
>>>>>>>>> I ran all tests in nfsv4.1 and nfsv4.0 with courteous and non-courteous
>>>>>>>>> 5.15-rc7 server.
>>>>>>>>>
>>>>>>>>> Nfs4.1 results are the same for both courteous and
>>>>>>>>> non-courteous server:
>>>>>>>>>> Of those: 0 Skipped, 0 Failed, 0 Warned, 169 Passed
>>>>>>>>> Results of nfs4.0 with non-courteous server:
>>>>>>>>>> Of those: 8 Skipped, 1 Failed, 0 Warned, 577 Passed
>>>>>>>>> test failed: LOCK24
>>>>>>>>>
>>>>>>>>> Results of nfs4.0 with courteous server:
>>>>>>>>>> Of those: 8 Skipped, 3 Failed, 0 Warned, 575 Passed
>>>>>>>>> tests failed: LOCK24, OPEN18, OPEN30
>>>>>>>>>
>>>>>>>>> OPEN18 and OPEN30 test pass if each is run by itself.
>>>>>>>> Could well be a bug in the tests, I don't know.
>>>>>>> The reason OPEN18 failed was because the test timed out waiting for
>>>>>>> the reply of an OPEN call. The RPC connection used for the test was
>>>>>>> configured with 15 secs timeout. Note that OPEN18 only fails when
>>>>>>> the tests were run with 'all' option, this test passes if it's run
>>>>>>> by itself.
>>>>>>>
>>>>>>> With courteous server, by the time OPEN18 runs, there are about 1026
>>>>>>> courtesy 4.0 clients on the server and all of these clients have opened
>>>>>>> the same file X with WRITE access. These clients were created by the
>>>>>>> previous tests. After each test completed, since 4.0 does not have
>>>>>>> session, the client states are not cleaned up immediately on the
>>>>>>> server and are allowed to become courtesy clients.
>>>>>>>
>>>>>>> When OPEN18 runs (about 20 minutes after the 1st test started), it
>>>>>>> sends OPEN of file X with OPEN4_SHARE_DENY_WRITE which causes the
>>>>>>> server to check for conflicts with courtesy clients. The loop that
>>>>>>> checks 1026 courtesy clients for share/access conflict took less
>>>>>>> than 1 sec. But it took about 55 secs, on my VM, for the server
>>>>>>> to expire all 1026 courtesy clients.
>>>>>>>
>>>>>>> I modified pynfs to configure the 4.0 RPC connection with 60 seconds
>>>>>>> timeout and OPEN18 now consistently passed. The 4.0 test results are
>>>>>>> now the same for courteous and non-courteous server:
>>>>>>>
>>>>>>> 8 Skipped, 1 Failed, 0 Warned, 577 Passed
>>>>>>>
>>>>>>> Note that 4.1 tests do not suffer this timeout problem because the
>>>>>>> 4.1 clients and sessions are destroyed after each test completes.
>>>>>> Do you want me to send the patch to increase the timeout for pynfs?
>>>>>> or is there any other things you think we should do?
>>>>> I don't know.
>>>>>
>>>>> 55 seconds to clean up 1026 clients is about 50ms per client, which is
>>>>> pretty slow. I wonder why. I guess it's probably updating the stable
>>>>> storage information. Is /var/lib/nfs/ on your server backed by a hard
>>>>> drive or an SSD or something else?
>>>> My server is a virtualbox VM that has 1 CPU, 4GB RAM and 64GB of hard
>>>> disk. I think a production system that supports this many clients should
>>>> have faster CPUs, faster storage.
>>>>
>>>>> I wonder if that's an argument for limiting the number of courtesy
>>>>> clients.
>>>> I think we might want to treat 4.0 clients a bit different from 4.1
>>>> clients. With 4.0, every client will become a courtesy client after
>>>> the client is done with the export and unmounts it.
>>> It should be safe for a server to purge a client's lease immediately
>>> if there is no open or lock state associated with it.
>> In this case, each client has opened files so there are open states
>> associated with them.
>>
>>> When an NFSv4.0 client unmounts, all files should be closed at that
>>> point,
>> I'm not sure pynfs does proper clean up after each subtest, I will
>> check. There must be state associated with the client in order for
>> it to become courtesy client.
> Makes sense. Then a synthetic client like pynfs can DoS a courteous
> server.
>
>
>>> so the server can wait for the lease to expire and purge it
>>> normally. Or am I missing something?
>> When 4.0 client lease expires and there are still states associated
>> with the client then the server allows this client to become courtesy
>> client.
> I think the same thing happens if an NFSv4.1 client neglects to send
> DESTROY_SESSION / DESTROY_CLIENTID. Either such a client is broken
> or malicious, but the server faces the same issue of protecting
> itself from a DoS attack.
>
> IMO you should consider limiting the number of courteous clients
> the server can hold onto. Let's say that number is 1000. When the
> server wants to turn a 1001st client into a courteous client, it
> can simply expire and purge the oldest courteous client on its
> list. Otherwise, over time, the 24-hour expiry will reduce the
> set of courteous clients back to zero.
>
> What do you think?

Limiting the number of courteous clients to handle the cases of
broken/malicious 4.1 clients seems reasonable as the last resort.

I think if a malicious 4.1 clients could mount the server's export,
opens a file (to create state) and repeats the same with a different
client id then it seems like some basic security was already broken;
allowing unauthorized clients to mount server's exports.

I think if we have to enforce a limit, then it's only for handling
of seriously buggy 4.1 clients which should not be the norm. The
issue with this is how to pick an optimal number that is suitable
for the running server which can be a very slow or a very fast server.

Note that even if we impose an limit, that does not completely solve
the problem with pynfs 4.0 test since its RPC timeout is configured
with 15 secs which just enough to expire 277 clients based on 53ms
for each client, unless we limit it ~270 clients which I think it's
too low.

This is what I plan to do:

1. do not support 4.0 courteous clients, for sure.

2. limit the number of courteous clients to 1000 (?), if you still
think we need it.

Pls let me know what you think.

Thanks,
-Dai

>
>
>>>> Since there is
>>>> no destroy session/client with 4.0, the courteous server allows the
>>>> client to be around and becomes a courtesy client. So after awhile,
>>>> even with normal usage, there will be lots 4.0 courtesy clients
>>>> hanging around and these clients won't be destroyed until 24hrs
>>>> later, or until they cause conflicts with other clients.
>>>>
>>>> We can reduce the courtesy_client_expiry time for 4.0 clients from
>>>> 24hrs to 15/20 mins, enough for most network partition to heal?,
>>>> or limit the number of 4.0 courtesy clients. Or don't support 4.0
>>>> clients at all which is my preference since I think in general users
>>>> should skip 4.0 and use 4.1 instead.
>>>>
>>>> -Dai
>>> --
>>> Chuck Lever
>>>
>>>
>>>
> --
> Chuck Lever
>
>
>

2021-11-30 01:42:48

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH RFC v5 0/2] nfsd: Initial implementation of NFSv4 Courteous Server


> On Nov 29, 2021, at 7:11 PM, Dai Ngo <[email protected]> wrote:
>
> 
>> On 11/29/21 1:10 PM, Chuck Lever III wrote:
>>
>>>> On Nov 29, 2021, at 2:36 PM, Dai Ngo <[email protected]> wrote:
>>>
>>>
>>> On 11/29/21 11:03 AM, Chuck Lever III wrote:
>>>> Hello Dai!
>>>>
>>>>
>>>>> On Nov 29, 2021, at 1:32 PM, Dai Ngo <[email protected]> wrote:
>>>>>
>>>>>
>>>>> On 11/29/21 9:30 AM, J. Bruce Fields wrote:
>>>>>> On Mon, Nov 29, 2021 at 09:13:16AM -0800, [email protected] wrote:
>>>>>>> Hi Bruce,
>>>>>>>
>>>>>>> On 11/21/21 7:04 PM, [email protected] wrote:
>>>>>>>> On 11/17/21 4:34 PM, J. Bruce Fields wrote:
>>>>>>>>> On Wed, Nov 17, 2021 at 01:46:02PM -0800, [email protected] wrote:
>>>>>>>>>> On 11/17/21 9:59 AM, [email protected] wrote:
>>>>>>>>>>> On 11/17/21 6:14 AM, J. Bruce Fields wrote:
>>>>>>>>>>>> On Tue, Nov 16, 2021 at 03:06:32PM -0800, [email protected] wrote:
>>>>>>>>>>>>> Just a reminder that this patch is still waiting for your review.
>>>>>>>>>>>> Yeah, I was procrastinating and hoping yo'ud figure out the pynfs
>>>>>>>>>>>> failure for me....
>>>>>>>>>>> Last time I ran 4.0 OPEN18 test by itself and it passed. I will run
>>>>>>>>>>> all OPEN tests together with 5.15-rc7 to see if the problem you've
>>>>>>>>>>> seen still there.
>>>>>>>>>> I ran all tests in nfsv4.1 and nfsv4.0 with courteous and non-courteous
>>>>>>>>>> 5.15-rc7 server.
>>>>>>>>>>
>>>>>>>>>> Nfs4.1 results are the same for both courteous and
>>>>>>>>>> non-courteous server:
>>>>>>>>>>> Of those: 0 Skipped, 0 Failed, 0 Warned, 169 Passed
>>>>>>>>>> Results of nfs4.0 with non-courteous server:
>>>>>>>>>>> Of those: 8 Skipped, 1 Failed, 0 Warned, 577 Passed
>>>>>>>>>> test failed: LOCK24
>>>>>>>>>>
>>>>>>>>>> Results of nfs4.0 with courteous server:
>>>>>>>>>>> Of those: 8 Skipped, 3 Failed, 0 Warned, 575 Passed
>>>>>>>>>> tests failed: LOCK24, OPEN18, OPEN30
>>>>>>>>>>
>>>>>>>>>> OPEN18 and OPEN30 test pass if each is run by itself.
>>>>>>>>> Could well be a bug in the tests, I don't know.
>>>>>>>> The reason OPEN18 failed was because the test timed out waiting for
>>>>>>>> the reply of an OPEN call. The RPC connection used for the test was
>>>>>>>> configured with 15 secs timeout. Note that OPEN18 only fails when
>>>>>>>> the tests were run with 'all' option, this test passes if it's run
>>>>>>>> by itself.
>>>>>>>>
>>>>>>>> With courteous server, by the time OPEN18 runs, there are about 1026
>>>>>>>> courtesy 4.0 clients on the server and all of these clients have opened
>>>>>>>> the same file X with WRITE access. These clients were created by the
>>>>>>>> previous tests. After each test completed, since 4.0 does not have
>>>>>>>> session, the client states are not cleaned up immediately on the
>>>>>>>> server and are allowed to become courtesy clients.
>>>>>>>>
>>>>>>>> When OPEN18 runs (about 20 minutes after the 1st test started), it
>>>>>>>> sends OPEN of file X with OPEN4_SHARE_DENY_WRITE which causes the
>>>>>>>> server to check for conflicts with courtesy clients. The loop that
>>>>>>>> checks 1026 courtesy clients for share/access conflict took less
>>>>>>>> than 1 sec. But it took about 55 secs, on my VM, for the server
>>>>>>>> to expire all 1026 courtesy clients.
>>>>>>>>
>>>>>>>> I modified pynfs to configure the 4.0 RPC connection with 60 seconds
>>>>>>>> timeout and OPEN18 now consistently passed. The 4.0 test results are
>>>>>>>> now the same for courteous and non-courteous server:
>>>>>>>>
>>>>>>>> 8 Skipped, 1 Failed, 0 Warned, 577 Passed
>>>>>>>>
>>>>>>>> Note that 4.1 tests do not suffer this timeout problem because the
>>>>>>>> 4.1 clients and sessions are destroyed after each test completes.
>>>>>>> Do you want me to send the patch to increase the timeout for pynfs?
>>>>>>> or is there any other things you think we should do?
>>>>>> I don't know.
>>>>>>
>>>>>> 55 seconds to clean up 1026 clients is about 50ms per client, which is
>>>>>> pretty slow. I wonder why. I guess it's probably updating the stable
>>>>>> storage information. Is /var/lib/nfs/ on your server backed by a hard
>>>>>> drive or an SSD or something else?
>>>>> My server is a virtualbox VM that has 1 CPU, 4GB RAM and 64GB of hard
>>>>> disk. I think a production system that supports this many clients should
>>>>> have faster CPUs, faster storage.
>>>>>
>>>>>> I wonder if that's an argument for limiting the number of courtesy
>>>>>> clients.
>>>>> I think we might want to treat 4.0 clients a bit different from 4.1
>>>>> clients. With 4.0, every client will become a courtesy client after
>>>>> the client is done with the export and unmounts it.
>>>> It should be safe for a server to purge a client's lease immediately
>>>> if there is no open or lock state associated with it.
>>> In this case, each client has opened files so there are open states
>>> associated with them.
>>>
>>>> When an NFSv4.0 client unmounts, all files should be closed at that
>>>> point,
>>> I'm not sure pynfs does proper clean up after each subtest, I will
>>> check. There must be state associated with the client in order for
>>> it to become courtesy client.
>> Makes sense. Then a synthetic client like pynfs can DoS a courteous
>> server.
>>
>>
>>>> so the server can wait for the lease to expire and purge it
>>>> normally. Or am I missing something?
>>> When 4.0 client lease expires and there are still states associated
>>> with the client then the server allows this client to become courtesy
>>> client.
>> I think the same thing happens if an NFSv4.1 client neglects to send
>> DESTROY_SESSION / DESTROY_CLIENTID. Either such a client is broken
>> or malicious, but the server faces the same issue of protecting
>> itself from a DoS attack.
>>
>> IMO you should consider limiting the number of courteous clients
>> the server can hold onto. Let's say that number is 1000. When the
>> server wants to turn a 1001st client into a courteous client, it
>> can simply expire and purge the oldest courteous client on its
>> list. Otherwise, over time, the 24-hour expiry will reduce the
>> set of courteous clients back to zero.
>>
>> What do you think?
>
> Limiting the number of courteous clients to handle the cases of
> broken/malicious 4.1 clients seems reasonable as the last resort.
>
> I think if a malicious 4.1 clients could mount the server's export,
> opens a file (to create state) and repeats the same with a different
> client id then it seems like some basic security was already broken;
> allowing unauthorized clients to mount server's exports.

You can do this today with AUTH_SYS. I consider it a genuine attack surface.


> I think if we have to enforce a limit, then it's only for handling
> of seriously buggy 4.1 clients which should not be the norm. The
> issue with this is how to pick an optimal number that is suitable
> for the running server which can be a very slow or a very fast server.
>
> Note that even if we impose an limit, that does not completely solve
> the problem with pynfs 4.0 test since its RPC timeout is configured
> with 15 secs which just enough to expire 277 clients based on 53ms
> for each client, unless we limit it ~270 clients which I think it's
> too low.
>
> This is what I plan to do:
>
> 1. do not support 4.0 courteous clients, for sure.

Not supporting 4.0 isn’t an option, IMHO. It is a fully supported protocol at this time, and the same exposure exists for 4.1, it’s just a little harder to exploit.

If you submit the courteous server patch without support for 4.0, I think it needs to include a plan for how 4.0 will be added later.


> 2. limit the number of courteous clients to 1000 (?), if you still
> think we need it.

I think this limit is necessary. It can be set based on the server’s physical memory size if a dynamic limit is desired.


> Pls let me know what you think.
>
> Thanks,
> -Dai
>
>>
>>
>>>>> Since there is
>>>>> no destroy session/client with 4.0, the courteous server allows the
>>>>> client to be around and becomes a courtesy client. So after awhile,
>>>>> even with normal usage, there will be lots 4.0 courtesy clients
>>>>> hanging around and these clients won't be destroyed until 24hrs
>>>>> later, or until they cause conflicts with other clients.
>>>>>
>>>>> We can reduce the courtesy_client_expiry time for 4.0 clients from
>>>>> 24hrs to 15/20 mins, enough for most network partition to heal?,
>>>>> or limit the number of 4.0 courtesy clients. Or don't support 4.0
>>>>> clients at all which is my preference since I think in general users
>>>>> should skip 4.0 and use 4.1 instead.
>>>>>
>>>>> -Dai
>>>> --
>>>> Chuck Lever
>>>>
>>>>
>>>>
>> --
>> Chuck Lever
>>
>>
>>

2021-11-30 04:08:33

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH RFC v5 0/2] nfsd: Initial implementation of NFSv4 Courteous Server

On Tue, 2021-11-30 at 01:42 +0000, Chuck Lever III wrote:
>
> > On Nov 29, 2021, at 7:11 PM, Dai Ngo <[email protected]> wrote:
> >
> > 
> > > On 11/29/21 1:10 PM, Chuck Lever III wrote:
> > >
> > > > > On Nov 29, 2021, at 2:36 PM, Dai Ngo <[email protected]>
> > > > > wrote:
> > > >
> > > >
> > > > On 11/29/21 11:03 AM, Chuck Lever III wrote:
> > > > > Hello Dai!
> > > > >
> > > > >
> > > > > > On Nov 29, 2021, at 1:32 PM, Dai Ngo <[email protected]>
> > > > > > wrote:
> > > > > >
> > > > > >
> > > > > > On 11/29/21 9:30 AM, J. Bruce Fields wrote:
> > > > > > > On Mon, Nov 29, 2021 at 09:13:16AM -0800,
> > > > > > > [email protected] wrote:
> > > > > > > > Hi Bruce,
> > > > > > > >
> > > > > > > > On 11/21/21 7:04 PM, [email protected] wrote:
> > > > > > > > > On 11/17/21 4:34 PM, J. Bruce Fields wrote:
> > > > > > > > > > On Wed, Nov 17, 2021 at 01:46:02PM -0800,
> > > > > > > > > > [email protected] wrote:
> > > > > > > > > > > On 11/17/21 9:59 AM, [email protected] wrote:
> > > > > > > > > > > > On 11/17/21 6:14 AM, J. Bruce Fields wrote:
> > > > > > > > > > > > > On Tue, Nov 16, 2021 at 03:06:32PM -0800,
> > > > > > > > > > > > > [email protected] wrote:
> > > > > > > > > > > > > > Just a reminder that this patch is still
> > > > > > > > > > > > > > waiting for your review.
> > > > > > > > > > > > > Yeah, I was procrastinating and hoping yo'ud
> > > > > > > > > > > > > figure out the pynfs
> > > > > > > > > > > > > failure for me....
> > > > > > > > > > > > Last time I ran 4.0 OPEN18 test by itself and
> > > > > > > > > > > > it passed. I will run
> > > > > > > > > > > > all OPEN tests together with 5.15-rc7 to see if
> > > > > > > > > > > > the problem you've
> > > > > > > > > > > > seen still there.
> > > > > > > > > > > I ran all tests in nfsv4.1 and nfsv4.0 with
> > > > > > > > > > > courteous and non-courteous
> > > > > > > > > > > 5.15-rc7 server.
> > > > > > > > > > >
> > > > > > > > > > > Nfs4.1 results are the same for both courteous
> > > > > > > > > > > and
> > > > > > > > > > > non-courteous server:
> > > > > > > > > > > > Of those: 0 Skipped, 0 Failed, 0 Warned, 169
> > > > > > > > > > > > Passed
> > > > > > > > > > > Results of nfs4.0 with non-courteous server:
> > > > > > > > > > > > Of those: 8 Skipped, 1 Failed, 0 Warned, 577
> > > > > > > > > > > > Passed
> > > > > > > > > > > test failed: LOCK24
> > > > > > > > > > >
> > > > > > > > > > > Results of nfs4.0 with courteous server:
> > > > > > > > > > > > Of those: 8 Skipped, 3 Failed, 0 Warned, 575
> > > > > > > > > > > > Passed
> > > > > > > > > > > tests failed: LOCK24, OPEN18, OPEN30
> > > > > > > > > > >
> > > > > > > > > > > OPEN18 and OPEN30 test pass if each is run by
> > > > > > > > > > > itself.
> > > > > > > > > > Could well be a bug in the tests, I don't know.
> > > > > > > > > The reason OPEN18 failed was because the test timed
> > > > > > > > > out waiting for
> > > > > > > > > the reply of an OPEN call. The RPC connection used
> > > > > > > > > for the test was
> > > > > > > > > configured with 15 secs timeout. Note that OPEN18
> > > > > > > > > only fails when
> > > > > > > > > the tests were run with 'all' option, this test
> > > > > > > > > passes if it's run
> > > > > > > > > by itself.
> > > > > > > > >
> > > > > > > > > With courteous server, by the time OPEN18 runs, there
> > > > > > > > > are about 1026
> > > > > > > > > courtesy 4.0 clients on the server and all of these
> > > > > > > > > clients have opened
> > > > > > > > > the same file X with WRITE access. These clients were
> > > > > > > > > created by the
> > > > > > > > > previous tests. After each test completed, since 4.0
> > > > > > > > > does not have
> > > > > > > > > session, the client states are not cleaned up
> > > > > > > > > immediately on the
> > > > > > > > > server and are allowed to become courtesy clients.
> > > > > > > > >
> > > > > > > > > When OPEN18 runs (about 20 minutes after the 1st test
> > > > > > > > > started), it
> > > > > > > > > sends OPEN of file X with OPEN4_SHARE_DENY_WRITE
> > > > > > > > > which causes the
> > > > > > > > > server to check for conflicts with courtesy clients.
> > > > > > > > > The loop that
> > > > > > > > > checks 1026 courtesy clients for share/access
> > > > > > > > > conflict took less
> > > > > > > > > than 1 sec. But it took about 55 secs, on my VM, for
> > > > > > > > > the server
> > > > > > > > > to expire all 1026 courtesy clients.
> > > > > > > > >
> > > > > > > > > I modified pynfs to configure the 4.0 RPC connection
> > > > > > > > > with 60 seconds
> > > > > > > > > timeout and OPEN18 now consistently passed. The 4.0
> > > > > > > > > test results are
> > > > > > > > > now the same for courteous and non-courteous server:
> > > > > > > > >
> > > > > > > > > 8 Skipped, 1 Failed, 0 Warned, 577 Passed
> > > > > > > > >
> > > > > > > > > Note that 4.1 tests do not suffer this timeout
> > > > > > > > > problem because the
> > > > > > > > > 4.1 clients and sessions are destroyed after each
> > > > > > > > > test completes.
> > > > > > > > Do you want me to send the patch to increase the
> > > > > > > > timeout for pynfs?
> > > > > > > > or is there any other things you think we should do?
> > > > > > > I don't know.
> > > > > > >
> > > > > > > 55 seconds to clean up 1026 clients is about 50ms per
> > > > > > > client, which is
> > > > > > > pretty slow.  I wonder why.  I guess it's probably
> > > > > > > updating the stable
> > > > > > > storage information.  Is /var/lib/nfs/ on your server
> > > > > > > backed by a hard
> > > > > > > drive or an SSD or something else?
> > > > > > My server is a virtualbox VM that has 1 CPU, 4GB RAM and
> > > > > > 64GB of hard
> > > > > > disk. I think a production system that supports this many
> > > > > > clients should
> > > > > > have faster CPUs, faster storage.
> > > > > >
> > > > > > > I wonder if that's an argument for limiting the number of
> > > > > > > courtesy
> > > > > > > clients.
> > > > > > I think we might want to treat 4.0 clients a bit different
> > > > > > from 4.1
> > > > > > clients. With 4.0, every client will become a courtesy
> > > > > > client after
> > > > > > the client is done with the export and unmounts it.
> > > > > It should be safe for a server to purge a client's lease
> > > > > immediately
> > > > > if there is no open or lock state associated with it.
> > > > In this case, each client has opened files so there are open
> > > > states
> > > > associated with them.
> > > >
> > > > > When an NFSv4.0 client unmounts, all files should be closed
> > > > > at that
> > > > > point,
> > > > I'm not sure pynfs does proper clean up after each subtest, I
> > > > will
> > > > check. There must be state associated with the client in order
> > > > for
> > > > it to become courtesy client.
> > > Makes sense. Then a synthetic client like pynfs can DoS a
> > > courteous
> > > server.
> > >
> > >
> > > > > so the server can wait for the lease to expire and purge it
> > > > > normally. Or am I missing something?
> > > > When 4.0 client lease expires and there are still states
> > > > associated
> > > > with the client then the server allows this client to become
> > > > courtesy
> > > > client.
> > > I think the same thing happens if an NFSv4.1 client neglects to
> > > send
> > > DESTROY_SESSION / DESTROY_CLIENTID. Either such a client is
> > > broken
> > > or malicious, but the server faces the same issue of protecting
> > > itself from a DoS attack.
> > >
> > > IMO you should consider limiting the number of courteous clients
> > > the server can hold onto. Let's say that number is 1000. When the
> > > server wants to turn a 1001st client into a courteous client, it
> > > can simply expire and purge the oldest courteous client on its
> > > list. Otherwise, over time, the 24-hour expiry will reduce the
> > > set of courteous clients back to zero.
> > >
> > > What do you think?
> >
> > Limiting the number of courteous clients to handle the cases of
> > broken/malicious 4.1 clients seems reasonable as the last resort.
> >
> > I think if a malicious 4.1 clients could mount the server's export,
> > opens a file (to create state) and repeats the same with a
> > different
> > client id then it seems like some basic security was already
> > broken;
> > allowing unauthorized clients to mount server's exports.
>
> You can do this today with AUTH_SYS. I consider it a genuine attack
> surface.
>
>
> > I think if we have to enforce a limit, then it's only for handling
> > of seriously buggy 4.1 clients which should not be the norm. The
> > issue with this is how to pick an optimal number that is suitable
> > for the running server which can be a very slow or a very fast
> > server.
> >
> > Note that even if we impose an limit, that does not completely
> > solve
> > the problem with pynfs 4.0 test since its RPC timeout is configured
> > with 15 secs which just enough to expire 277 clients based on 53ms
> > for each client, unless we limit it ~270 clients which I think it's
> > too low.
> >
> > This is what I plan to do:
> >
> > 1. do not support 4.0 courteous clients, for sure.
>
> Not supporting 4.0 isn’t an option, IMHO. It is a fully supported
> protocol at this time, and the same exposure exists for 4.1, it’s
> just a little harder to exploit.
>
> If you submit the courteous server patch without support for 4.0, I
> think it needs to include a plan for how 4.0 will be added later.
>
> >

Why is there a problem here? The requirements are the same for 4.0 and
4.1 (or 4.2). If the lease under which the courtesy lock was
established has expired, then that courtesy lock must be released if
some other client requests a lock that conflicts with the cached lock
(unless the client breaks the courtesy framework by renewing that
original lease before the conflict occurs). Otherwise, it is completely
up to the server when it decides to actually release the lock.

For NFSv4.1 and NFSv4.2, we have DESTROY_CLIENTID, which tells the
server when the client is actually done with the lease, making it easy
to determine when it is safe to release all the courtesy locks. However
if the client does not send DESTROY_CLIENTID, then we're in the same
situation with 4.x (x>0) as we would be with bog standard NFSv4.0. The
lease has expired, and so the courtesy locks are liable to being
dropped.

At Hammerspace we have implemented courtesy locks, and our strategy is
that when a conflict occurs, we drop the entire set of courtesy locks
so that we don't have to deal with the "some locks were revoked"
scenario. The reason is that when we originally implemented courtesy
locks, the Linux NFSv4 client support for lock revocation was a lot
less sophisticated than today. My suggestion is that you might
therefore consider starting along this path, and then refining the
support to make revocation more nuanced once you are confident that the
coarser strategy is working as expected.

--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]


2021-11-30 04:47:35

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH RFC v5 0/2] nfsd: Initial implementation of NFSv4 Courteous Server


> On Nov 29, 2021, at 11:08 PM, Trond Myklebust <[email protected]> wrote:
>
> On Tue, 2021-11-30 at 01:42 +0000, Chuck Lever III wrote:
>>
>>>> On Nov 29, 2021, at 7:11 PM, Dai Ngo <[email protected]> wrote:
>>>
>>> 
>>>> On 11/29/21 1:10 PM, Chuck Lever III wrote:
>>>>
>>>>>> On Nov 29, 2021, at 2:36 PM, Dai Ngo <[email protected]>
>>>>>> wrote:
>>>>>
>>>>>
>>>>> On 11/29/21 11:03 AM, Chuck Lever III wrote:
>>>>>> Hello Dai!
>>>>>>
>>>>>>
>>>>>>> On Nov 29, 2021, at 1:32 PM, Dai Ngo <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 11/29/21 9:30 AM, J. Bruce Fields wrote:
>>>>>>>> On Mon, Nov 29, 2021 at 09:13:16AM -0800,
>>>>>>>> [email protected] wrote:
>>>>>>>>> Hi Bruce,
>>>>>>>>>
>>>>>>>>> On 11/21/21 7:04 PM, [email protected] wrote:
>>>>>>>>>> On 11/17/21 4:34 PM, J. Bruce Fields wrote:
>>>>>>>>>>> On Wed, Nov 17, 2021 at 01:46:02PM -0800,
>>>>>>>>>>> [email protected] wrote:
>>>>>>>>>>>> On 11/17/21 9:59 AM, [email protected] wrote:
>>>>>>>>>>>>> On 11/17/21 6:14 AM, J. Bruce Fields wrote:
>>>>>>>>>>>>>> On Tue, Nov 16, 2021 at 03:06:32PM -0800,
>>>>>>>>>>>>>> [email protected] wrote:
>>>>>>>>>>>>>>> Just a reminder that this patch is still
>>>>>>>>>>>>>>> waiting for your review.
>>>>>>>>>>>>>> Yeah, I was procrastinating and hoping yo'ud
>>>>>>>>>>>>>> figure out the pynfs
>>>>>>>>>>>>>> failure for me....
>>>>>>>>>>>>> Last time I ran 4.0 OPEN18 test by itself and
>>>>>>>>>>>>> it passed. I will run
>>>>>>>>>>>>> all OPEN tests together with 5.15-rc7 to see if
>>>>>>>>>>>>> the problem you've
>>>>>>>>>>>>> seen still there.
>>>>>>>>>>>> I ran all tests in nfsv4.1 and nfsv4.0 with
>>>>>>>>>>>> courteous and non-courteous
>>>>>>>>>>>> 5.15-rc7 server.
>>>>>>>>>>>>
>>>>>>>>>>>> Nfs4.1 results are the same for both courteous
>>>>>>>>>>>> and
>>>>>>>>>>>> non-courteous server:
>>>>>>>>>>>>> Of those: 0 Skipped, 0 Failed, 0 Warned, 169
>>>>>>>>>>>>> Passed
>>>>>>>>>>>> Results of nfs4.0 with non-courteous server:
>>>>>>>>>>>>> Of those: 8 Skipped, 1 Failed, 0 Warned, 577
>>>>>>>>>>>>> Passed
>>>>>>>>>>>> test failed: LOCK24
>>>>>>>>>>>>
>>>>>>>>>>>> Results of nfs4.0 with courteous server:
>>>>>>>>>>>>> Of those: 8 Skipped, 3 Failed, 0 Warned, 575
>>>>>>>>>>>>> Passed
>>>>>>>>>>>> tests failed: LOCK24, OPEN18, OPEN30
>>>>>>>>>>>>
>>>>>>>>>>>> OPEN18 and OPEN30 test pass if each is run by
>>>>>>>>>>>> itself.
>>>>>>>>>>> Could well be a bug in the tests, I don't know.
>>>>>>>>>> The reason OPEN18 failed was because the test timed
>>>>>>>>>> out waiting for
>>>>>>>>>> the reply of an OPEN call. The RPC connection used
>>>>>>>>>> for the test was
>>>>>>>>>> configured with 15 secs timeout. Note that OPEN18
>>>>>>>>>> only fails when
>>>>>>>>>> the tests were run with 'all' option, this test
>>>>>>>>>> passes if it's run
>>>>>>>>>> by itself.
>>>>>>>>>>
>>>>>>>>>> With courteous server, by the time OPEN18 runs, there
>>>>>>>>>> are about 1026
>>>>>>>>>> courtesy 4.0 clients on the server and all of these
>>>>>>>>>> clients have opened
>>>>>>>>>> the same file X with WRITE access. These clients were
>>>>>>>>>> created by the
>>>>>>>>>> previous tests. After each test completed, since 4.0
>>>>>>>>>> does not have
>>>>>>>>>> session, the client states are not cleaned up
>>>>>>>>>> immediately on the
>>>>>>>>>> server and are allowed to become courtesy clients.
>>>>>>>>>>
>>>>>>>>>> When OPEN18 runs (about 20 minutes after the 1st test
>>>>>>>>>> started), it
>>>>>>>>>> sends OPEN of file X with OPEN4_SHARE_DENY_WRITE
>>>>>>>>>> which causes the
>>>>>>>>>> server to check for conflicts with courtesy clients.
>>>>>>>>>> The loop that
>>>>>>>>>> checks 1026 courtesy clients for share/access
>>>>>>>>>> conflict took less
>>>>>>>>>> than 1 sec. But it took about 55 secs, on my VM, for
>>>>>>>>>> the server
>>>>>>>>>> to expire all 1026 courtesy clients.
>>>>>>>>>>
>>>>>>>>>> I modified pynfs to configure the 4.0 RPC connection
>>>>>>>>>> with 60 seconds
>>>>>>>>>> timeout and OPEN18 now consistently passed. The 4.0
>>>>>>>>>> test results are
>>>>>>>>>> now the same for courteous and non-courteous server:
>>>>>>>>>>
>>>>>>>>>> 8 Skipped, 1 Failed, 0 Warned, 577 Passed
>>>>>>>>>>
>>>>>>>>>> Note that 4.1 tests do not suffer this timeout
>>>>>>>>>> problem because the
>>>>>>>>>> 4.1 clients and sessions are destroyed after each
>>>>>>>>>> test completes.
>>>>>>>>> Do you want me to send the patch to increase the
>>>>>>>>> timeout for pynfs?
>>>>>>>>> or is there any other things you think we should do?
>>>>>>>> I don't know.
>>>>>>>>
>>>>>>>> 55 seconds to clean up 1026 clients is about 50ms per
>>>>>>>> client, which is
>>>>>>>> pretty slow. I wonder why. I guess it's probably
>>>>>>>> updating the stable
>>>>>>>> storage information. Is /var/lib/nfs/ on your server
>>>>>>>> backed by a hard
>>>>>>>> drive or an SSD or something else?
>>>>>>> My server is a virtualbox VM that has 1 CPU, 4GB RAM and
>>>>>>> 64GB of hard
>>>>>>> disk. I think a production system that supports this many
>>>>>>> clients should
>>>>>>> have faster CPUs, faster storage.
>>>>>>>
>>>>>>>> I wonder if that's an argument for limiting the number of
>>>>>>>> courtesy
>>>>>>>> clients.
>>>>>>> I think we might want to treat 4.0 clients a bit different
>>>>>>> from 4.1
>>>>>>> clients. With 4.0, every client will become a courtesy
>>>>>>> client after
>>>>>>> the client is done with the export and unmounts it.
>>>>>> It should be safe for a server to purge a client's lease
>>>>>> immediately
>>>>>> if there is no open or lock state associated with it.
>>>>> In this case, each client has opened files so there are open
>>>>> states
>>>>> associated with them.
>>>>>
>>>>>> When an NFSv4.0 client unmounts, all files should be closed
>>>>>> at that
>>>>>> point,
>>>>> I'm not sure pynfs does proper clean up after each subtest, I
>>>>> will
>>>>> check. There must be state associated with the client in order
>>>>> for
>>>>> it to become courtesy client.
>>>> Makes sense. Then a synthetic client like pynfs can DoS a
>>>> courteous
>>>> server.
>>>>
>>>>
>>>>>> so the server can wait for the lease to expire and purge it
>>>>>> normally. Or am I missing something?
>>>>> When 4.0 client lease expires and there are still states
>>>>> associated
>>>>> with the client then the server allows this client to become
>>>>> courtesy
>>>>> client.
>>>> I think the same thing happens if an NFSv4.1 client neglects to
>>>> send
>>>> DESTROY_SESSION / DESTROY_CLIENTID. Either such a client is
>>>> broken
>>>> or malicious, but the server faces the same issue of protecting
>>>> itself from a DoS attack.
>>>>
>>>> IMO you should consider limiting the number of courteous clients
>>>> the server can hold onto. Let's say that number is 1000. When the
>>>> server wants to turn a 1001st client into a courteous client, it
>>>> can simply expire and purge the oldest courteous client on its
>>>> list. Otherwise, over time, the 24-hour expiry will reduce the
>>>> set of courteous clients back to zero.
>>>>
>>>> What do you think?
>>>
>>> Limiting the number of courteous clients to handle the cases of
>>> broken/malicious 4.1 clients seems reasonable as the last resort.
>>>
>>> I think if a malicious 4.1 clients could mount the server's export,
>>> opens a file (to create state) and repeats the same with a
>>> different
>>> client id then it seems like some basic security was already
>>> broken;
>>> allowing unauthorized clients to mount server's exports.
>>
>> You can do this today with AUTH_SYS. I consider it a genuine attack
>> surface.
>>
>>
>>> I think if we have to enforce a limit, then it's only for handling
>>> of seriously buggy 4.1 clients which should not be the norm. The
>>> issue with this is how to pick an optimal number that is suitable
>>> for the running server which can be a very slow or a very fast
>>> server.
>>>
>>> Note that even if we impose an limit, that does not completely
>>> solve
>>> the problem with pynfs 4.0 test since its RPC timeout is configured
>>> with 15 secs which just enough to expire 277 clients based on 53ms
>>> for each client, unless we limit it ~270 clients which I think it's
>>> too low.
>>>
>>> This is what I plan to do:
>>>
>>> 1. do not support 4.0 courteous clients, for sure.
>>
>> Not supporting 4.0 isn’t an option, IMHO. It is a fully supported
>> protocol at this time, and the same exposure exists for 4.1, it’s
>> just a little harder to exploit.
>>
>> If you submit the courteous server patch without support for 4.0, I
>> think it needs to include a plan for how 4.0 will be added later.
>>
>>>
>
> Why is there a problem here? The requirements are the same for 4.0 and
> 4.1 (or 4.2). If the lease under which the courtesy lock was
> established has expired, then that courtesy lock must be released if
> some other client requests a lock that conflicts with the cached lock
> (unless the client breaks the courtesy framework by renewing that
> original lease before the conflict occurs). Otherwise, it is completely
> up to the server when it decides to actually release the lock.
>
> For NFSv4.1 and NFSv4.2, we have DESTROY_CLIENTID, which tells the
> server when the client is actually done with the lease, making it easy
> to determine when it is safe to release all the courtesy locks. However
> if the client does not send DESTROY_CLIENTID, then we're in the same
> situation with 4.x (x>0) as we would be with bog standard NFSv4.0. The
> lease has expired, and so the courtesy locks are liable to being
> dropped.

I agree the situation is the same for all minor versions.


> At Hammerspace we have implemented courtesy locks, and our strategy is
> that when a conflict occurs, we drop the entire set of courtesy locks
> so that we don't have to deal with the "some locks were revoked"
> scenario. The reason is that when we originally implemented courtesy
> locks, the Linux NFSv4 client support for lock revocation was a lot
> less sophisticated than today. My suggestion is that you might
> therefore consider starting along this path, and then refining the
> support to make revocation more nuanced once you are confident that the
> coarser strategy is working as expected.

Dai’s implementation does all that, and takes the coarser approach at the moment. There are plans to explore the more nuanced behavior (by revoking only the conflicting lock instead of dropping the whole lease) after this initial work is merged.

The issue is there are certain pathological client behaviors (whether malicious or accidental) that can run the server out of resources, since it is holding onto lease state for a much longer time. We are simply trying to design a lease garbage collection scheme to meet that challenge.

I think limiting the number of courteous clients is a simple way to do this, but we could also shorten the courtesy lifetime as more clients enter that state, to ensure that they don’t overrun the server’s memory. Another approach might be to add a shrinker that purges the oldest courteous clients when the server comes under memory pressure.


2021-11-30 04:57:17

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH RFC v5 0/2] nfsd: Initial implementation of NFSv4 Courteous Server

On Tue, 2021-11-30 at 04:47 +0000, Chuck Lever III wrote:
>
> > On Nov 29, 2021, at 11:08 PM, Trond Myklebust
> > <[email protected]> wrote:
> >
> > On Tue, 2021-11-30 at 01:42 +0000, Chuck Lever III wrote:
> > >
> > > > > On Nov 29, 2021, at 7:11 PM, Dai Ngo <[email protected]>
> > > > > wrote:
> > > >
> > > > 
> > > > > On 11/29/21 1:10 PM, Chuck Lever III wrote:
> > > > >
> > > > > > > On Nov 29, 2021, at 2:36 PM, Dai Ngo <[email protected]>
> > > > > > > wrote:
> > > > > >
> > > > > >
> > > > > > On 11/29/21 11:03 AM, Chuck Lever III wrote:
> > > > > > > Hello Dai!
> > > > > > >
> > > > > > >
> > > > > > > > On Nov 29, 2021, at 1:32 PM, Dai Ngo
> > > > > > > > <[email protected]>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > > On 11/29/21 9:30 AM, J. Bruce Fields wrote:
> > > > > > > > > On Mon, Nov 29, 2021 at 09:13:16AM -0800,
> > > > > > > > > [email protected] wrote:
> > > > > > > > > > Hi Bruce,
> > > > > > > > > >
> > > > > > > > > > On 11/21/21 7:04 PM, [email protected] wrote:
> > > > > > > > > > > On 11/17/21 4:34 PM, J. Bruce Fields wrote:
> > > > > > > > > > > > On Wed, Nov 17, 2021 at 01:46:02PM -0800,
> > > > > > > > > > > > [email protected] wrote:
> > > > > > > > > > > > > On 11/17/21 9:59 AM,
> > > > > > > > > > > > > [email protected] wrote:
> > > > > > > > > > > > > > On 11/17/21 6:14 AM, J. Bruce Fields wrote:
> > > > > > > > > > > > > > > On Tue, Nov 16, 2021 at 03:06:32PM -0800,
> > > > > > > > > > > > > > > [email protected] wrote:
> > > > > > > > > > > > > > > > Just a reminder that this patch is
> > > > > > > > > > > > > > > > still
> > > > > > > > > > > > > > > > waiting for your review.
> > > > > > > > > > > > > > > Yeah, I was procrastinating and hoping
> > > > > > > > > > > > > > > yo'ud
> > > > > > > > > > > > > > > figure out the pynfs
> > > > > > > > > > > > > > > failure for me....
> > > > > > > > > > > > > > Last time I ran 4.0 OPEN18 test by itself
> > > > > > > > > > > > > > and
> > > > > > > > > > > > > > it passed. I will run
> > > > > > > > > > > > > > all OPEN tests together with 5.15-rc7 to
> > > > > > > > > > > > > > see if
> > > > > > > > > > > > > > the problem you've
> > > > > > > > > > > > > > seen still there.
> > > > > > > > > > > > > I ran all tests in nfsv4.1 and nfsv4.0 with
> > > > > > > > > > > > > courteous and non-courteous
> > > > > > > > > > > > > 5.15-rc7 server.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Nfs4.1 results are the same for both
> > > > > > > > > > > > > courteous
> > > > > > > > > > > > > and
> > > > > > > > > > > > > non-courteous server:
> > > > > > > > > > > > > > Of those: 0 Skipped, 0 Failed, 0 Warned,
> > > > > > > > > > > > > > 169
> > > > > > > > > > > > > > Passed
> > > > > > > > > > > > > Results of nfs4.0 with non-courteous server:
> > > > > > > > > > > > > > Of those: 8 Skipped, 1 Failed, 0 Warned,
> > > > > > > > > > > > > > 577
> > > > > > > > > > > > > > Passed
> > > > > > > > > > > > > test failed: LOCK24
> > > > > > > > > > > > >
> > > > > > > > > > > > > Results of nfs4.0 with courteous server:
> > > > > > > > > > > > > > Of those: 8 Skipped, 3 Failed, 0 Warned,
> > > > > > > > > > > > > > 575
> > > > > > > > > > > > > > Passed
> > > > > > > > > > > > > tests failed: LOCK24, OPEN18, OPEN30
> > > > > > > > > > > > >
> > > > > > > > > > > > > OPEN18 and OPEN30 test pass if each is run by
> > > > > > > > > > > > > itself.
> > > > > > > > > > > > Could well be a bug in the tests, I don't know.
> > > > > > > > > > > The reason OPEN18 failed was because the test
> > > > > > > > > > > timed
> > > > > > > > > > > out waiting for
> > > > > > > > > > > the reply of an OPEN call. The RPC connection
> > > > > > > > > > > used
> > > > > > > > > > > for the test was
> > > > > > > > > > > configured with 15 secs timeout. Note that OPEN18
> > > > > > > > > > > only fails when
> > > > > > > > > > > the tests were run with 'all' option, this test
> > > > > > > > > > > passes if it's run
> > > > > > > > > > > by itself.
> > > > > > > > > > >
> > > > > > > > > > > With courteous server, by the time OPEN18 runs,
> > > > > > > > > > > there
> > > > > > > > > > > are about 1026
> > > > > > > > > > > courtesy 4.0 clients on the server and all of
> > > > > > > > > > > these
> > > > > > > > > > > clients have opened
> > > > > > > > > > > the same file X with WRITE access. These clients
> > > > > > > > > > > were
> > > > > > > > > > > created by the
> > > > > > > > > > > previous tests. After each test completed, since
> > > > > > > > > > > 4.0
> > > > > > > > > > > does not have
> > > > > > > > > > > session, the client states are not cleaned up
> > > > > > > > > > > immediately on the
> > > > > > > > > > > server and are allowed to become courtesy
> > > > > > > > > > > clients.
> > > > > > > > > > >
> > > > > > > > > > > When OPEN18 runs (about 20 minutes after the 1st
> > > > > > > > > > > test
> > > > > > > > > > > started), it
> > > > > > > > > > > sends OPEN of file X with OPEN4_SHARE_DENY_WRITE
> > > > > > > > > > > which causes the
> > > > > > > > > > > server to check for conflicts with courtesy
> > > > > > > > > > > clients.
> > > > > > > > > > > The loop that
> > > > > > > > > > > checks 1026 courtesy clients for share/access
> > > > > > > > > > > conflict took less
> > > > > > > > > > > than 1 sec. But it took about 55 secs, on my VM,
> > > > > > > > > > > for
> > > > > > > > > > > the server
> > > > > > > > > > > to expire all 1026 courtesy clients.
> > > > > > > > > > >
> > > > > > > > > > > I modified pynfs to configure the 4.0 RPC
> > > > > > > > > > > connection
> > > > > > > > > > > with 60 seconds
> > > > > > > > > > > timeout and OPEN18 now consistently passed. The
> > > > > > > > > > > 4.0
> > > > > > > > > > > test results are
> > > > > > > > > > > now the same for courteous and non-courteous
> > > > > > > > > > > server:
> > > > > > > > > > >
> > > > > > > > > > > 8 Skipped, 1 Failed, 0 Warned, 577 Passed
> > > > > > > > > > >
> > > > > > > > > > > Note that 4.1 tests do not suffer this timeout
> > > > > > > > > > > problem because the
> > > > > > > > > > > 4.1 clients and sessions are destroyed after each
> > > > > > > > > > > test completes.
> > > > > > > > > > Do you want me to send the patch to increase the
> > > > > > > > > > timeout for pynfs?
> > > > > > > > > > or is there any other things you think we should
> > > > > > > > > > do?
> > > > > > > > > I don't know.
> > > > > > > > >
> > > > > > > > > 55 seconds to clean up 1026 clients is about 50ms per
> > > > > > > > > client, which is
> > > > > > > > > pretty slow.  I wonder why.  I guess it's probably
> > > > > > > > > updating the stable
> > > > > > > > > storage information.  Is /var/lib/nfs/ on your server
> > > > > > > > > backed by a hard
> > > > > > > > > drive or an SSD or something else?
> > > > > > > > My server is a virtualbox VM that has 1 CPU, 4GB RAM
> > > > > > > > and
> > > > > > > > 64GB of hard
> > > > > > > > disk. I think a production system that supports this
> > > > > > > > many
> > > > > > > > clients should
> > > > > > > > have faster CPUs, faster storage.
> > > > > > > >
> > > > > > > > > I wonder if that's an argument for limiting the
> > > > > > > > > number of
> > > > > > > > > courtesy
> > > > > > > > > clients.
> > > > > > > > I think we might want to treat 4.0 clients a bit
> > > > > > > > different
> > > > > > > > from 4.1
> > > > > > > > clients. With 4.0, every client will become a courtesy
> > > > > > > > client after
> > > > > > > > the client is done with the export and unmounts it.
> > > > > > > It should be safe for a server to purge a client's lease
> > > > > > > immediately
> > > > > > > if there is no open or lock state associated with it.
> > > > > > In this case, each client has opened files so there are
> > > > > > open
> > > > > > states
> > > > > > associated with them.
> > > > > >
> > > > > > > When an NFSv4.0 client unmounts, all files should be
> > > > > > > closed
> > > > > > > at that
> > > > > > > point,
> > > > > > I'm not sure pynfs does proper clean up after each subtest,
> > > > > > I
> > > > > > will
> > > > > > check. There must be state associated with the client in
> > > > > > order
> > > > > > for
> > > > > > it to become courtesy client.
> > > > > Makes sense. Then a synthetic client like pynfs can DoS a
> > > > > courteous
> > > > > server.
> > > > >
> > > > >
> > > > > > > so the server can wait for the lease to expire and purge
> > > > > > > it
> > > > > > > normally. Or am I missing something?
> > > > > > When 4.0 client lease expires and there are still states
> > > > > > associated
> > > > > > with the client then the server allows this client to
> > > > > > become
> > > > > > courtesy
> > > > > > client.
> > > > > I think the same thing happens if an NFSv4.1 client neglects
> > > > > to
> > > > > send
> > > > > DESTROY_SESSION / DESTROY_CLIENTID. Either such a client is
> > > > > broken
> > > > > or malicious, but the server faces the same issue of
> > > > > protecting
> > > > > itself from a DoS attack.
> > > > >
> > > > > IMO you should consider limiting the number of courteous
> > > > > clients
> > > > > the server can hold onto. Let's say that number is 1000. When
> > > > > the
> > > > > server wants to turn a 1001st client into a courteous client,
> > > > > it
> > > > > can simply expire and purge the oldest courteous client on
> > > > > its
> > > > > list. Otherwise, over time, the 24-hour expiry will reduce
> > > > > the
> > > > > set of courteous clients back to zero.
> > > > >
> > > > > What do you think?
> > > >
> > > > Limiting the number of courteous clients to handle the cases of
> > > > broken/malicious 4.1 clients seems reasonable as the last
> > > > resort.
> > > >
> > > > I think if a malicious 4.1 clients could mount the server's
> > > > export,
> > > > opens a file (to create state) and repeats the same with a
> > > > different
> > > > client id then it seems like some basic security was already
> > > > broken;
> > > > allowing unauthorized clients to mount server's exports.
> > >
> > > You can do this today with AUTH_SYS. I consider it a genuine
> > > attack
> > > surface.
> > >
> > >
> > > > I think if we have to enforce a limit, then it's only for
> > > > handling
> > > > of seriously buggy 4.1 clients which should not be the norm.
> > > > The
> > > > issue with this is how to pick an optimal number that is
> > > > suitable
> > > > for the running server which can be a very slow or a very fast
> > > > server.
> > > >
> > > > Note that even if we impose an limit, that does not completely
> > > > solve
> > > > the problem with pynfs 4.0 test since its RPC timeout is
> > > > configured
> > > > with 15 secs which just enough to expire 277 clients based on
> > > > 53ms
> > > > for each client, unless we limit it ~270 clients which I think
> > > > it's
> > > > too low.
> > > >
> > > > This is what I plan to do:
> > > >
> > > > 1. do not support 4.0 courteous clients, for sure.
> > >
> > > Not supporting 4.0 isn’t an option, IMHO. It is a fully supported
> > > protocol at this time, and the same exposure exists for 4.1, it’s
> > > just a little harder to exploit.
> > >
> > > If you submit the courteous server patch without support for 4.0,
> > > I
> > > think it needs to include a plan for how 4.0 will be added later.
> > >
> > > >
> >
> > Why is there a problem here? The requirements are the same for 4.0
> > and
> > 4.1 (or 4.2). If the lease under which the courtesy lock was
> > established has expired, then that courtesy lock must be released
> > if
> > some other client requests a lock that conflicts with the cached
> > lock
> > (unless the client breaks the courtesy framework by renewing that
> > original lease before the conflict occurs). Otherwise, it is
> > completely
> > up to the server when it decides to actually release the lock.
> >
> > For NFSv4.1 and NFSv4.2, we have DESTROY_CLIENTID, which tells the
> > server when the client is actually done with the lease, making it
> > easy
> > to determine when it is safe to release all the courtesy locks.
> > However
> > if the client does not send DESTROY_CLIENTID, then we're in the
> > same
> > situation with 4.x (x>0) as we would be with bog standard NFSv4.0.
> > The
> > lease has expired, and so the courtesy locks are liable to being
> > dropped.
>
> I agree the situation is the same for all minor versions.
>
>
> > At Hammerspace we have implemented courtesy locks, and our strategy
> > is
> > that when a conflict occurs, we drop the entire set of courtesy
> > locks
> > so that we don't have to deal with the "some locks were revoked"
> > scenario. The reason is that when we originally implemented
> > courtesy
> > locks, the Linux NFSv4 client support for lock revocation was a lot
> > less sophisticated than today. My suggestion is that you might
> > therefore consider starting along this path, and then refining the
> > support to make revocation more nuanced once you are confident that
> > the
> > coarser strategy is working as expected.
>
> Dai’s implementation does all that, and takes the coarser approach at
> the moment. There are plans to explore the more nuanced behavior (by
> revoking only the conflicting lock instead of dropping the whole
> lease) after this initial work is merged.
>
> The issue is there are certain pathological client behaviors (whether
> malicious or accidental) that can run the server out of resources,
> since it is holding onto lease state for a much longer time. We are
> simply trying to design a lease garbage collection scheme to meet
> that challenge.
>
> I think limiting the number of courteous clients is a simple way to
> do this, but we could also shorten the courtesy lifetime as more
> clients enter that state, to ensure that they don’t overrun the
> server’s memory. Another approach might be to add a shrinker that
> purges the oldest courteous clients when the server comes under
> memory pressure.
>
>

We already have a scanner that tries to release all client state after
1 lease period. Just extend that to do it after 10 lease periods. If a
network partition hasn't recovered after 10 minutes, you probably have
bigger problems.

You can limit the number of clients as well, but that leads into a rats
nest of other issues that have nothing to do with courtesy locks and
everything to do with the fact that any client can hold a lot of state.

--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]


2021-11-30 07:13:49

by Dai Ngo

[permalink] [raw]
Subject: Re: [PATCH RFC v5 0/2] nfsd: Initial implementation of NFSv4 Courteous Server


On 11/29/21 5:42 PM, Chuck Lever III wrote:
>> On Nov 29, 2021, at 7:11 PM, Dai Ngo <[email protected]> wrote:
>>
>> 
>>> On 11/29/21 1:10 PM, Chuck Lever III wrote:
>>>
>>>>> On Nov 29, 2021, at 2:36 PM, Dai Ngo <[email protected]> wrote:
>>>>
>>>> On 11/29/21 11:03 AM, Chuck Lever III wrote:
>>>>> Hello Dai!
>>>>>
>>>>>
>>>>>> On Nov 29, 2021, at 1:32 PM, Dai Ngo <[email protected]> wrote:
>>>>>>
>>>>>>
>>>>>> On 11/29/21 9:30 AM, J. Bruce Fields wrote:
>>>>>>> On Mon, Nov 29, 2021 at 09:13:16AM -0800, [email protected] wrote:
>>>>>>>> Hi Bruce,
>>>>>>>>
>>>>>>>> On 11/21/21 7:04 PM, [email protected] wrote:
>>>>>>>>> On 11/17/21 4:34 PM, J. Bruce Fields wrote:
>>>>>>>>>> On Wed, Nov 17, 2021 at 01:46:02PM -0800, [email protected] wrote:
>>>>>>>>>>> On 11/17/21 9:59 AM, [email protected] wrote:
>>>>>>>>>>>> On 11/17/21 6:14 AM, J. Bruce Fields wrote:
>>>>>>>>>>>>> On Tue, Nov 16, 2021 at 03:06:32PM -0800, [email protected] wrote:
>>>>>>>>>>>>>> Just a reminder that this patch is still waiting for your review.
>>>>>>>>>>>>> Yeah, I was procrastinating and hoping yo'ud figure out the pynfs
>>>>>>>>>>>>> failure for me....
>>>>>>>>>>>> Last time I ran 4.0 OPEN18 test by itself and it passed. I will run
>>>>>>>>>>>> all OPEN tests together with 5.15-rc7 to see if the problem you've
>>>>>>>>>>>> seen still there.
>>>>>>>>>>> I ran all tests in nfsv4.1 and nfsv4.0 with courteous and non-courteous
>>>>>>>>>>> 5.15-rc7 server.
>>>>>>>>>>>
>>>>>>>>>>> Nfs4.1 results are the same for both courteous and
>>>>>>>>>>> non-courteous server:
>>>>>>>>>>>> Of those: 0 Skipped, 0 Failed, 0 Warned, 169 Passed
>>>>>>>>>>> Results of nfs4.0 with non-courteous server:
>>>>>>>>>>>> Of those: 8 Skipped, 1 Failed, 0 Warned, 577 Passed
>>>>>>>>>>> test failed: LOCK24
>>>>>>>>>>>
>>>>>>>>>>> Results of nfs4.0 with courteous server:
>>>>>>>>>>>> Of those: 8 Skipped, 3 Failed, 0 Warned, 575 Passed
>>>>>>>>>>> tests failed: LOCK24, OPEN18, OPEN30
>>>>>>>>>>>
>>>>>>>>>>> OPEN18 and OPEN30 test pass if each is run by itself.
>>>>>>>>>> Could well be a bug in the tests, I don't know.
>>>>>>>>> The reason OPEN18 failed was because the test timed out waiting for
>>>>>>>>> the reply of an OPEN call. The RPC connection used for the test was
>>>>>>>>> configured with 15 secs timeout. Note that OPEN18 only fails when
>>>>>>>>> the tests were run with 'all' option, this test passes if it's run
>>>>>>>>> by itself.
>>>>>>>>>
>>>>>>>>> With courteous server, by the time OPEN18 runs, there are about 1026
>>>>>>>>> courtesy 4.0 clients on the server and all of these clients have opened
>>>>>>>>> the same file X with WRITE access. These clients were created by the
>>>>>>>>> previous tests. After each test completed, since 4.0 does not have
>>>>>>>>> session, the client states are not cleaned up immediately on the
>>>>>>>>> server and are allowed to become courtesy clients.
>>>>>>>>>
>>>>>>>>> When OPEN18 runs (about 20 minutes after the 1st test started), it
>>>>>>>>> sends OPEN of file X with OPEN4_SHARE_DENY_WRITE which causes the
>>>>>>>>> server to check for conflicts with courtesy clients. The loop that
>>>>>>>>> checks 1026 courtesy clients for share/access conflict took less
>>>>>>>>> than 1 sec. But it took about 55 secs, on my VM, for the server
>>>>>>>>> to expire all 1026 courtesy clients.
>>>>>>>>>
>>>>>>>>> I modified pynfs to configure the 4.0 RPC connection with 60 seconds
>>>>>>>>> timeout and OPEN18 now consistently passed. The 4.0 test results are
>>>>>>>>> now the same for courteous and non-courteous server:
>>>>>>>>>
>>>>>>>>> 8 Skipped, 1 Failed, 0 Warned, 577 Passed
>>>>>>>>>
>>>>>>>>> Note that 4.1 tests do not suffer this timeout problem because the
>>>>>>>>> 4.1 clients and sessions are destroyed after each test completes.
>>>>>>>> Do you want me to send the patch to increase the timeout for pynfs?
>>>>>>>> or is there any other things you think we should do?
>>>>>>> I don't know.
>>>>>>>
>>>>>>> 55 seconds to clean up 1026 clients is about 50ms per client, which is
>>>>>>> pretty slow. I wonder why. I guess it's probably updating the stable
>>>>>>> storage information. Is /var/lib/nfs/ on your server backed by a hard
>>>>>>> drive or an SSD or something else?
>>>>>> My server is a virtualbox VM that has 1 CPU, 4GB RAM and 64GB of hard
>>>>>> disk. I think a production system that supports this many clients should
>>>>>> have faster CPUs, faster storage.
>>>>>>
>>>>>>> I wonder if that's an argument for limiting the number of courtesy
>>>>>>> clients.
>>>>>> I think we might want to treat 4.0 clients a bit different from 4.1
>>>>>> clients. With 4.0, every client will become a courtesy client after
>>>>>> the client is done with the export and unmounts it.
>>>>> It should be safe for a server to purge a client's lease immediately
>>>>> if there is no open or lock state associated with it.
>>>> In this case, each client has opened files so there are open states
>>>> associated with them.
>>>>
>>>>> When an NFSv4.0 client unmounts, all files should be closed at that
>>>>> point,
>>>> I'm not sure pynfs does proper clean up after each subtest, I will
>>>> check. There must be state associated with the client in order for
>>>> it to become courtesy client.
>>> Makes sense. Then a synthetic client like pynfs can DoS a courteous
>>> server.
>>>
>>>
>>>>> so the server can wait for the lease to expire and purge it
>>>>> normally. Or am I missing something?
>>>> When 4.0 client lease expires and there are still states associated
>>>> with the client then the server allows this client to become courtesy
>>>> client.
>>> I think the same thing happens if an NFSv4.1 client neglects to send
>>> DESTROY_SESSION / DESTROY_CLIENTID. Either such a client is broken
>>> or malicious, but the server faces the same issue of protecting
>>> itself from a DoS attack.
>>>
>>> IMO you should consider limiting the number of courteous clients
>>> the server can hold onto. Let's say that number is 1000. When the
>>> server wants to turn a 1001st client into a courteous client, it
>>> can simply expire and purge the oldest courteous client on its
>>> list. Otherwise, over time, the 24-hour expiry will reduce the
>>> set of courteous clients back to zero.
>>>
>>> What do you think?
>> Limiting the number of courteous clients to handle the cases of
>> broken/malicious 4.1 clients seems reasonable as the last resort.
>>
>> I think if a malicious 4.1 clients could mount the server's export,
>> opens a file (to create state) and repeats the same with a different
>> client id then it seems like some basic security was already broken;
>> allowing unauthorized clients to mount server's exports.
> You can do this today with AUTH_SYS. I consider it a genuine attack surface.
>
>
>> I think if we have to enforce a limit, then it's only for handling
>> of seriously buggy 4.1 clients which should not be the norm. The
>> issue with this is how to pick an optimal number that is suitable
>> for the running server which can be a very slow or a very fast server.
>>
>> Note that even if we impose an limit, that does not completely solve
>> the problem with pynfs 4.0 test since its RPC timeout is configured
>> with 15 secs which just enough to expire 277 clients based on 53ms
>> for each client, unless we limit it ~270 clients which I think it's
>> too low.
>>
>> This is what I plan to do:
>>
>> 1. do not support 4.0 courteous clients, for sure.
> Not supporting 4.0 isn’t an option, IMHO. It is a fully supported protocol at this time, and the same exposure exists for 4.1, it’s just a little harder to exploit.
>
> If you submit the courteous server patch without support for 4.0, I think it needs to include a plan for how 4.0 will be added later.

Seems like we should support both 4.0 and 4.x (x>=1) at the same time.

>
>
>> 2. limit the number of courteous clients to 1000 (?), if you still
>> think we need it.
> I think this limit is necessary. It can be set based on the server’s physical memory size if a dynamic limit is desired.

Just to be clear, the problem of pynfs with 4.0 is that the server takes
~55 secs to expire 1026 4.0 courteous clients, which comes out to ~50ms
per client. This causes the test to time out in waiting for RPC reply of
the OPEN that triggers the conflicts.

I don't know exactly where the time spent in the process of expiring a
client. But as Bruce mentioned, it could be related to the time to access
/var/lib/nfs to remove the client's persistent record. I think that is most
likely the case because the number of states owned by each client should be
small since each test is short and does simple ops. So I think this problem
is related to the number of clients and not number of states owned by the
clients. This is not the memory resource shortage problem due to too many
state which we have planned to address it after the initial phase.

I'd vote to use a static limit for now, say 1000 clients, to avoid
complicating the courteous server code for something that would not
happen most of the time.

-Dai

>
>
>> Pls let me know what you think.
>>
>> Thanks,
>> -Dai
>>
>>>
>>>>>> Since there is
>>>>>> no destroy session/client with 4.0, the courteous server allows the
>>>>>> client to be around and becomes a courtesy client. So after awhile,
>>>>>> even with normal usage, there will be lots 4.0 courtesy clients
>>>>>> hanging around and these clients won't be destroyed until 24hrs
>>>>>> later, or until they cause conflicts with other clients.
>>>>>>
>>>>>> We can reduce the courtesy_client_expiry time for 4.0 clients from
>>>>>> 24hrs to 15/20 mins, enough for most network partition to heal?,
>>>>>> or limit the number of 4.0 courtesy clients. Or don't support 4.0
>>>>>> clients at all which is my preference since I think in general users
>>>>>> should skip 4.0 and use 4.1 instead.
>>>>>>
>>>>>> -Dai
>>>>> --
>>>>> Chuck Lever
>>>>>
>>>>>
>>>>>
>>> --
>>> Chuck Lever
>>>
>>>
>>>

2021-11-30 07:22:44

by Dai Ngo

[permalink] [raw]
Subject: Re: [PATCH RFC v5 0/2] nfsd: Initial implementation of NFSv4 Courteous Server


On 11/29/21 8:57 PM, Trond Myklebust wrote:
> On Tue, 2021-11-30 at 04:47 +0000, Chuck Lever III wrote:
>>> On Nov 29, 2021, at 11:08 PM, Trond Myklebust
>>> <[email protected]> wrote:
>>>
>>> On Tue, 2021-11-30 at 01:42 +0000, Chuck Lever III wrote:
>>>>>> On Nov 29, 2021, at 7:11 PM, Dai Ngo <[email protected]>
>>>>>> wrote:
>>>>> 
>>>>>> On 11/29/21 1:10 PM, Chuck Lever III wrote:
>>>>>>
>>>>>>>> On Nov 29, 2021, at 2:36 PM, Dai Ngo <[email protected]>
>>>>>>>> wrote:
>>>>>>>
>>>>>>> On 11/29/21 11:03 AM, Chuck Lever III wrote:
>>>>>>>> Hello Dai!
>>>>>>>>
>>>>>>>>
>>>>>>>>> On Nov 29, 2021, at 1:32 PM, Dai Ngo
>>>>>>>>> <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 11/29/21 9:30 AM, J. Bruce Fields wrote:
>>>>>>>>>> On Mon, Nov 29, 2021 at 09:13:16AM -0800,
>>>>>>>>>> [email protected] wrote:
>>>>>>>>>>> Hi Bruce,
>>>>>>>>>>>
>>>>>>>>>>> On 11/21/21 7:04 PM, [email protected] wrote:
>>>>>>>>>>>> On 11/17/21 4:34 PM, J. Bruce Fields wrote:
>>>>>>>>>>>>> On Wed, Nov 17, 2021 at 01:46:02PM -0800,
>>>>>>>>>>>>> [email protected] wrote:
>>>>>>>>>>>>>> On 11/17/21 9:59 AM,
>>>>>>>>>>>>>> [email protected] wrote:
>>>>>>>>>>>>>>> On 11/17/21 6:14 AM, J. Bruce Fields wrote:
>>>>>>>>>>>>>>>> On Tue, Nov 16, 2021 at 03:06:32PM -0800,
>>>>>>>>>>>>>>>> [email protected] wrote:
>>>>>>>>>>>>>>>>> Just a reminder that this patch is
>>>>>>>>>>>>>>>>> still
>>>>>>>>>>>>>>>>> waiting for your review.
>>>>>>>>>>>>>>>> Yeah, I was procrastinating and hoping
>>>>>>>>>>>>>>>> yo'ud
>>>>>>>>>>>>>>>> figure out the pynfs
>>>>>>>>>>>>>>>> failure for me....
>>>>>>>>>>>>>>> Last time I ran 4.0 OPEN18 test by itself
>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>> it passed. I will run
>>>>>>>>>>>>>>> all OPEN tests together with 5.15-rc7 to
>>>>>>>>>>>>>>> see if
>>>>>>>>>>>>>>> the problem you've
>>>>>>>>>>>>>>> seen still there.
>>>>>>>>>>>>>> I ran all tests in nfsv4.1 and nfsv4.0 with
>>>>>>>>>>>>>> courteous and non-courteous
>>>>>>>>>>>>>> 5.15-rc7 server.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Nfs4.1 results are the same for both
>>>>>>>>>>>>>> courteous
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>> non-courteous server:
>>>>>>>>>>>>>>> Of those: 0 Skipped, 0 Failed, 0 Warned,
>>>>>>>>>>>>>>> 169
>>>>>>>>>>>>>>> Passed
>>>>>>>>>>>>>> Results of nfs4.0 with non-courteous server:
>>>>>>>>>>>>>>> Of those: 8 Skipped, 1 Failed, 0 Warned,
>>>>>>>>>>>>>>> 577
>>>>>>>>>>>>>>> Passed
>>>>>>>>>>>>>> test failed: LOCK24
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Results of nfs4.0 with courteous server:
>>>>>>>>>>>>>>> Of those: 8 Skipped, 3 Failed, 0 Warned,
>>>>>>>>>>>>>>> 575
>>>>>>>>>>>>>>> Passed
>>>>>>>>>>>>>> tests failed: LOCK24, OPEN18, OPEN30
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> OPEN18 and OPEN30 test pass if each is run by
>>>>>>>>>>>>>> itself.
>>>>>>>>>>>>> Could well be a bug in the tests, I don't know.
>>>>>>>>>>>> The reason OPEN18 failed was because the test
>>>>>>>>>>>> timed
>>>>>>>>>>>> out waiting for
>>>>>>>>>>>> the reply of an OPEN call. The RPC connection
>>>>>>>>>>>> used
>>>>>>>>>>>> for the test was
>>>>>>>>>>>> configured with 15 secs timeout. Note that OPEN18
>>>>>>>>>>>> only fails when
>>>>>>>>>>>> the tests were run with 'all' option, this test
>>>>>>>>>>>> passes if it's run
>>>>>>>>>>>> by itself.
>>>>>>>>>>>>
>>>>>>>>>>>> With courteous server, by the time OPEN18 runs,
>>>>>>>>>>>> there
>>>>>>>>>>>> are about 1026
>>>>>>>>>>>> courtesy 4.0 clients on the server and all of
>>>>>>>>>>>> these
>>>>>>>>>>>> clients have opened
>>>>>>>>>>>> the same file X with WRITE access. These clients
>>>>>>>>>>>> were
>>>>>>>>>>>> created by the
>>>>>>>>>>>> previous tests. After each test completed, since
>>>>>>>>>>>> 4.0
>>>>>>>>>>>> does not have
>>>>>>>>>>>> session, the client states are not cleaned up
>>>>>>>>>>>> immediately on the
>>>>>>>>>>>> server and are allowed to become courtesy
>>>>>>>>>>>> clients.
>>>>>>>>>>>>
>>>>>>>>>>>> When OPEN18 runs (about 20 minutes after the 1st
>>>>>>>>>>>> test
>>>>>>>>>>>> started), it
>>>>>>>>>>>> sends OPEN of file X with OPEN4_SHARE_DENY_WRITE
>>>>>>>>>>>> which causes the
>>>>>>>>>>>> server to check for conflicts with courtesy
>>>>>>>>>>>> clients.
>>>>>>>>>>>> The loop that
>>>>>>>>>>>> checks 1026 courtesy clients for share/access
>>>>>>>>>>>> conflict took less
>>>>>>>>>>>> than 1 sec. But it took about 55 secs, on my VM,
>>>>>>>>>>>> for
>>>>>>>>>>>> the server
>>>>>>>>>>>> to expire all 1026 courtesy clients.
>>>>>>>>>>>>
>>>>>>>>>>>> I modified pynfs to configure the 4.0 RPC
>>>>>>>>>>>> connection
>>>>>>>>>>>> with 60 seconds
>>>>>>>>>>>> timeout and OPEN18 now consistently passed. The
>>>>>>>>>>>> 4.0
>>>>>>>>>>>> test results are
>>>>>>>>>>>> now the same for courteous and non-courteous
>>>>>>>>>>>> server:
>>>>>>>>>>>>
>>>>>>>>>>>> 8 Skipped, 1 Failed, 0 Warned, 577 Passed
>>>>>>>>>>>>
>>>>>>>>>>>> Note that 4.1 tests do not suffer this timeout
>>>>>>>>>>>> problem because the
>>>>>>>>>>>> 4.1 clients and sessions are destroyed after each
>>>>>>>>>>>> test completes.
>>>>>>>>>>> Do you want me to send the patch to increase the
>>>>>>>>>>> timeout for pynfs?
>>>>>>>>>>> or is there any other things you think we should
>>>>>>>>>>> do?
>>>>>>>>>> I don't know.
>>>>>>>>>>
>>>>>>>>>> 55 seconds to clean up 1026 clients is about 50ms per
>>>>>>>>>> client, which is
>>>>>>>>>> pretty slow.  I wonder why.  I guess it's probably
>>>>>>>>>> updating the stable
>>>>>>>>>> storage information.  Is /var/lib/nfs/ on your server
>>>>>>>>>> backed by a hard
>>>>>>>>>> drive or an SSD or something else?
>>>>>>>>> My server is a virtualbox VM that has 1 CPU, 4GB RAM
>>>>>>>>> and
>>>>>>>>> 64GB of hard
>>>>>>>>> disk. I think a production system that supports this
>>>>>>>>> many
>>>>>>>>> clients should
>>>>>>>>> have faster CPUs, faster storage.
>>>>>>>>>
>>>>>>>>>> I wonder if that's an argument for limiting the
>>>>>>>>>> number of
>>>>>>>>>> courtesy
>>>>>>>>>> clients.
>>>>>>>>> I think we might want to treat 4.0 clients a bit
>>>>>>>>> different
>>>>>>>>> from 4.1
>>>>>>>>> clients. With 4.0, every client will become a courtesy
>>>>>>>>> client after
>>>>>>>>> the client is done with the export and unmounts it.
>>>>>>>> It should be safe for a server to purge a client's lease
>>>>>>>> immediately
>>>>>>>> if there is no open or lock state associated with it.
>>>>>>> In this case, each client has opened files so there are
>>>>>>> open
>>>>>>> states
>>>>>>> associated with them.
>>>>>>>
>>>>>>>> When an NFSv4.0 client unmounts, all files should be
>>>>>>>> closed
>>>>>>>> at that
>>>>>>>> point,
>>>>>>> I'm not sure pynfs does proper clean up after each subtest,
>>>>>>> I
>>>>>>> will
>>>>>>> check. There must be state associated with the client in
>>>>>>> order
>>>>>>> for
>>>>>>> it to become courtesy client.
>>>>>> Makes sense. Then a synthetic client like pynfs can DoS a
>>>>>> courteous
>>>>>> server.
>>>>>>
>>>>>>
>>>>>>>> so the server can wait for the lease to expire and purge
>>>>>>>> it
>>>>>>>> normally. Or am I missing something?
>>>>>>> When 4.0 client lease expires and there are still states
>>>>>>> associated
>>>>>>> with the client then the server allows this client to
>>>>>>> become
>>>>>>> courtesy
>>>>>>> client.
>>>>>> I think the same thing happens if an NFSv4.1 client neglects
>>>>>> to
>>>>>> send
>>>>>> DESTROY_SESSION / DESTROY_CLIENTID. Either such a client is
>>>>>> broken
>>>>>> or malicious, but the server faces the same issue of
>>>>>> protecting
>>>>>> itself from a DoS attack.
>>>>>>
>>>>>> IMO you should consider limiting the number of courteous
>>>>>> clients
>>>>>> the server can hold onto. Let's say that number is 1000. When
>>>>>> the
>>>>>> server wants to turn a 1001st client into a courteous client,
>>>>>> it
>>>>>> can simply expire and purge the oldest courteous client on
>>>>>> its
>>>>>> list. Otherwise, over time, the 24-hour expiry will reduce
>>>>>> the
>>>>>> set of courteous clients back to zero.
>>>>>>
>>>>>> What do you think?
>>>>> Limiting the number of courteous clients to handle the cases of
>>>>> broken/malicious 4.1 clients seems reasonable as the last
>>>>> resort.
>>>>>
>>>>> I think if a malicious 4.1 clients could mount the server's
>>>>> export,
>>>>> opens a file (to create state) and repeats the same with a
>>>>> different
>>>>> client id then it seems like some basic security was already
>>>>> broken;
>>>>> allowing unauthorized clients to mount server's exports.
>>>> You can do this today with AUTH_SYS. I consider it a genuine
>>>> attack
>>>> surface.
>>>>
>>>>
>>>>> I think if we have to enforce a limit, then it's only for
>>>>> handling
>>>>> of seriously buggy 4.1 clients which should not be the norm.
>>>>> The
>>>>> issue with this is how to pick an optimal number that is
>>>>> suitable
>>>>> for the running server which can be a very slow or a very fast
>>>>> server.
>>>>>
>>>>> Note that even if we impose an limit, that does not completely
>>>>> solve
>>>>> the problem with pynfs 4.0 test since its RPC timeout is
>>>>> configured
>>>>> with 15 secs which just enough to expire 277 clients based on
>>>>> 53ms
>>>>> for each client, unless we limit it ~270 clients which I think
>>>>> it's
>>>>> too low.
>>>>>
>>>>> This is what I plan to do:
>>>>>
>>>>> 1. do not support 4.0 courteous clients, for sure.
>>>> Not supporting 4.0 isn’t an option, IMHO. It is a fully supported
>>>> protocol at this time, and the same exposure exists for 4.1, it’s
>>>> just a little harder to exploit.
>>>>
>>>> If you submit the courteous server patch without support for 4.0,
>>>> I
>>>> think it needs to include a plan for how 4.0 will be added later.
>>>>
>>> Why is there a problem here? The requirements are the same for 4.0
>>> and
>>> 4.1 (or 4.2). If the lease under which the courtesy lock was
>>> established has expired, then that courtesy lock must be released
>>> if
>>> some other client requests a lock that conflicts with the cached
>>> lock
>>> (unless the client breaks the courtesy framework by renewing that
>>> original lease before the conflict occurs). Otherwise, it is
>>> completely
>>> up to the server when it decides to actually release the lock.
>>>
>>> For NFSv4.1 and NFSv4.2, we have DESTROY_CLIENTID, which tells the
>>> server when the client is actually done with the lease, making it
>>> easy
>>> to determine when it is safe to release all the courtesy locks.
>>> However
>>> if the client does not send DESTROY_CLIENTID, then we're in the
>>> same
>>> situation with 4.x (x>0) as we would be with bog standard NFSv4.0.
>>> The
>>> lease has expired, and so the courtesy locks are liable to being
>>> dropped.
>> I agree the situation is the same for all minor versions.
>>
>>
>>> At Hammerspace we have implemented courtesy locks, and our strategy
>>> is
>>> that when a conflict occurs, we drop the entire set of courtesy
>>> locks
>>> so that we don't have to deal with the "some locks were revoked"
>>> scenario. The reason is that when we originally implemented
>>> courtesy
>>> locks, the Linux NFSv4 client support for lock revocation was a lot
>>> less sophisticated than today. My suggestion is that you might
>>> therefore consider starting along this path, and then refining the
>>> support to make revocation more nuanced once you are confident that
>>> the
>>> coarser strategy is working as expected.
>> Dai’s implementation does all that, and takes the coarser approach at
>> the moment. There are plans to explore the more nuanced behavior (by
>> revoking only the conflicting lock instead of dropping the whole
>> lease) after this initial work is merged.
>>
>> The issue is there are certain pathological client behaviors (whether
>> malicious or accidental) that can run the server out of resources,
>> since it is holding onto lease state for a much longer time. We are
>> simply trying to design a lease garbage collection scheme to meet
>> that challenge.
>>
>> I think limiting the number of courteous clients is a simple way to
>> do this, but we could also shorten the courtesy lifetime as more
>> clients enter that state, to ensure that they don’t overrun the
>> server’s memory. Another approach might be to add a shrinker that
>> purges the oldest courteous clients when the server comes under
>> memory pressure.
>>
>>
> We already have a scanner that tries to release all client state after
> 1 lease period. Just extend that to do it after 10 lease periods. If a
> network partition hasn't recovered after 10 minutes, you probably have
> bigger problems.

Currently the courteous server allows 24hr for the network partition to
heal before releasing all client state. That seems to be excessive but
it was suggested for longer network partition conditions when switch/routers
being repaired/upgraded.

>
> You can limit the number of clients as well, but that leads into a rats
> nest of other issues that have nothing to do with courtesy locks and
> everything to do with the fact that any client can hold a lot of state.

The issue we currently have with courteous server and pynfs 4.0 tests
is the number of courteous 4.0 clients the server has to expire when a
share reservation conflict occurs when servicing the OPEN. Each client
owns only few state in this case so we think the server spent most time
for deleting client's record in /var/lib/nfs. This is why we plan to
limit the number of courteous clients for now. As a side effect, it might
also help to reduce resource consumption too.

-Dai

>

2021-11-30 13:38:02

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH RFC v5 0/2] nfsd: Initial implementation of NFSv4 Courteous Server

On Mon, 2021-11-29 at 23:22 -0800, [email protected] wrote:
>
> On 11/29/21 8:57 PM, Trond Myklebust wrote:
> > On Tue, 2021-11-30 at 04:47 +0000, Chuck Lever III wrote:
> > > > On Nov 29, 2021, at 11:08 PM, Trond Myklebust
> > > > <[email protected]> wrote:
> > > >
> > > > On Tue, 2021-11-30 at 01:42 +0000, Chuck Lever III wrote:
> > > > > > > On Nov 29, 2021, at 7:11 PM, Dai Ngo <[email protected]>
> > > > > > > wrote:
> > > > > > 
> > > > > > > On 11/29/21 1:10 PM, Chuck Lever III wrote:
> > > > > > >
> > > > > > > > > On Nov 29, 2021, at 2:36 PM, Dai Ngo
> > > > > > > > > <[email protected]>
> > > > > > > > > wrote:
> > > > > > > >
> > > > > > > > On 11/29/21 11:03 AM, Chuck Lever III wrote:
> > > > > > > > > Hello Dai!
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > On Nov 29, 2021, at 1:32 PM, Dai Ngo
> > > > > > > > > > <[email protected]>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On 11/29/21 9:30 AM, J. Bruce Fields wrote:
> > > > > > > > > > > On Mon, Nov 29, 2021 at 09:13:16AM -0800,
> > > > > > > > > > > [email protected] wrote:
> > > > > > > > > > > > Hi Bruce,
> > > > > > > > > > > >
> > > > > > > > > > > > On 11/21/21 7:04 PM, [email protected] wrote:
> > > > > > > > > > > > > On 11/17/21 4:34 PM, J. Bruce Fields wrote:
> > > > > > > > > > > > > > On Wed, Nov 17, 2021 at 01:46:02PM -0800,
> > > > > > > > > > > > > > [email protected] wrote:
> > > > > > > > > > > > > > > On 11/17/21 9:59 AM,
> > > > > > > > > > > > > > > [email protected] wrote:
> > > > > > > > > > > > > > > > On 11/17/21 6:14 AM, J. Bruce Fields
> > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > On Tue, Nov 16, 2021 at 03:06:32PM -
> > > > > > > > > > > > > > > > > 0800,
> > > > > > > > > > > > > > > > > [email protected] wrote:
> > > > > > > > > > > > > > > > > > Just a reminder that this patch is
> > > > > > > > > > > > > > > > > > still
> > > > > > > > > > > > > > > > > > waiting for your review.
> > > > > > > > > > > > > > > > > Yeah, I was procrastinating and
> > > > > > > > > > > > > > > > > hoping
> > > > > > > > > > > > > > > > > yo'ud
> > > > > > > > > > > > > > > > > figure out the pynfs
> > > > > > > > > > > > > > > > > failure for me....
> > > > > > > > > > > > > > > > Last time I ran 4.0 OPEN18 test by
> > > > > > > > > > > > > > > > itself
> > > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > it passed. I will run
> > > > > > > > > > > > > > > > all OPEN tests together with 5.15-rc7
> > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > see if
> > > > > > > > > > > > > > > > the problem you've
> > > > > > > > > > > > > > > > seen still there.
> > > > > > > > > > > > > > > I ran all tests in nfsv4.1 and nfsv4.0
> > > > > > > > > > > > > > > with
> > > > > > > > > > > > > > > courteous and non-courteous
> > > > > > > > > > > > > > > 5.15-rc7 server.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Nfs4.1 results are the same for both
> > > > > > > > > > > > > > > courteous
> > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > non-courteous server:
> > > > > > > > > > > > > > > > Of those: 0 Skipped, 0 Failed, 0
> > > > > > > > > > > > > > > > Warned,
> > > > > > > > > > > > > > > > 169
> > > > > > > > > > > > > > > > Passed
> > > > > > > > > > > > > > > Results of nfs4.0 with non-courteous
> > > > > > > > > > > > > > > server:
> > > > > > > > > > > > > > > > Of those: 8 Skipped, 1 Failed, 0
> > > > > > > > > > > > > > > > Warned,
> > > > > > > > > > > > > > > > 577
> > > > > > > > > > > > > > > > Passed
> > > > > > > > > > > > > > > test failed: LOCK24
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Results of nfs4.0 with courteous server:
> > > > > > > > > > > > > > > > Of those: 8 Skipped, 3 Failed, 0
> > > > > > > > > > > > > > > > Warned,
> > > > > > > > > > > > > > > > 575
> > > > > > > > > > > > > > > > Passed
> > > > > > > > > > > > > > > tests failed: LOCK24, OPEN18, OPEN30
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > OPEN18 and OPEN30 test pass if each is
> > > > > > > > > > > > > > > run by
> > > > > > > > > > > > > > > itself.
> > > > > > > > > > > > > > Could well be a bug in the tests, I don't
> > > > > > > > > > > > > > know.
> > > > > > > > > > > > > The reason OPEN18 failed was because the test
> > > > > > > > > > > > > timed
> > > > > > > > > > > > > out waiting for
> > > > > > > > > > > > > the reply of an OPEN call. The RPC connection
> > > > > > > > > > > > > used
> > > > > > > > > > > > > for the test was
> > > > > > > > > > > > > configured with 15 secs timeout. Note that
> > > > > > > > > > > > > OPEN18
> > > > > > > > > > > > > only fails when
> > > > > > > > > > > > > the tests were run with 'all' option, this
> > > > > > > > > > > > > test
> > > > > > > > > > > > > passes if it's run
> > > > > > > > > > > > > by itself.
> > > > > > > > > > > > >
> > > > > > > > > > > > > With courteous server, by the time OPEN18
> > > > > > > > > > > > > runs,
> > > > > > > > > > > > > there
> > > > > > > > > > > > > are about 1026
> > > > > > > > > > > > > courtesy 4.0 clients on the server and all of
> > > > > > > > > > > > > these
> > > > > > > > > > > > > clients have opened
> > > > > > > > > > > > > the same file X with WRITE access. These
> > > > > > > > > > > > > clients
> > > > > > > > > > > > > were
> > > > > > > > > > > > > created by the
> > > > > > > > > > > > > previous tests. After each test completed,
> > > > > > > > > > > > > since
> > > > > > > > > > > > > 4.0
> > > > > > > > > > > > > does not have
> > > > > > > > > > > > > session, the client states are not cleaned up
> > > > > > > > > > > > > immediately on the
> > > > > > > > > > > > > server and are allowed to become courtesy
> > > > > > > > > > > > > clients.
> > > > > > > > > > > > >
> > > > > > > > > > > > > When OPEN18 runs (about 20 minutes after the
> > > > > > > > > > > > > 1st
> > > > > > > > > > > > > test
> > > > > > > > > > > > > started), it
> > > > > > > > > > > > > sends OPEN of file X with
> > > > > > > > > > > > > OPEN4_SHARE_DENY_WRITE
> > > > > > > > > > > > > which causes the
> > > > > > > > > > > > > server to check for conflicts with courtesy
> > > > > > > > > > > > > clients.
> > > > > > > > > > > > > The loop that
> > > > > > > > > > > > > checks 1026 courtesy clients for share/access
> > > > > > > > > > > > > conflict took less
> > > > > > > > > > > > > than 1 sec. But it took about 55 secs, on my
> > > > > > > > > > > > > VM,
> > > > > > > > > > > > > for
> > > > > > > > > > > > > the server
> > > > > > > > > > > > > to expire all 1026 courtesy clients.
> > > > > > > > > > > > >
> > > > > > > > > > > > > I modified pynfs to configure the 4.0 RPC
> > > > > > > > > > > > > connection
> > > > > > > > > > > > > with 60 seconds
> > > > > > > > > > > > > timeout and OPEN18 now consistently passed.
> > > > > > > > > > > > > The
> > > > > > > > > > > > > 4.0
> > > > > > > > > > > > > test results are
> > > > > > > > > > > > > now the same for courteous and non-courteous
> > > > > > > > > > > > > server:
> > > > > > > > > > > > >
> > > > > > > > > > > > > 8 Skipped, 1 Failed, 0 Warned, 577 Passed
> > > > > > > > > > > > >
> > > > > > > > > > > > > Note that 4.1 tests do not suffer this
> > > > > > > > > > > > > timeout
> > > > > > > > > > > > > problem because the
> > > > > > > > > > > > > 4.1 clients and sessions are destroyed after
> > > > > > > > > > > > > each
> > > > > > > > > > > > > test completes.
> > > > > > > > > > > > Do you want me to send the patch to increase
> > > > > > > > > > > > the
> > > > > > > > > > > > timeout for pynfs?
> > > > > > > > > > > > or is there any other things you think we
> > > > > > > > > > > > should
> > > > > > > > > > > > do?
> > > > > > > > > > > I don't know.
> > > > > > > > > > >
> > > > > > > > > > > 55 seconds to clean up 1026 clients is about 50ms
> > > > > > > > > > > per
> > > > > > > > > > > client, which is
> > > > > > > > > > > pretty slow.  I wonder why.  I guess it's
> > > > > > > > > > > probably
> > > > > > > > > > > updating the stable
> > > > > > > > > > > storage information.  Is /var/lib/nfs/ on your
> > > > > > > > > > > server
> > > > > > > > > > > backed by a hard
> > > > > > > > > > > drive or an SSD or something else?
> > > > > > > > > > My server is a virtualbox VM that has 1 CPU, 4GB
> > > > > > > > > > RAM
> > > > > > > > > > and
> > > > > > > > > > 64GB of hard
> > > > > > > > > > disk. I think a production system that supports
> > > > > > > > > > this
> > > > > > > > > > many
> > > > > > > > > > clients should
> > > > > > > > > > have faster CPUs, faster storage.
> > > > > > > > > >
> > > > > > > > > > > I wonder if that's an argument for limiting the
> > > > > > > > > > > number of
> > > > > > > > > > > courtesy
> > > > > > > > > > > clients.
> > > > > > > > > > I think we might want to treat 4.0 clients a bit
> > > > > > > > > > different
> > > > > > > > > > from 4.1
> > > > > > > > > > clients. With 4.0, every client will become a
> > > > > > > > > > courtesy
> > > > > > > > > > client after
> > > > > > > > > > the client is done with the export and unmounts it.
> > > > > > > > > It should be safe for a server to purge a client's
> > > > > > > > > lease
> > > > > > > > > immediately
> > > > > > > > > if there is no open or lock state associated with it.
> > > > > > > > In this case, each client has opened files so there are
> > > > > > > > open
> > > > > > > > states
> > > > > > > > associated with them.
> > > > > > > >
> > > > > > > > > When an NFSv4.0 client unmounts, all files should be
> > > > > > > > > closed
> > > > > > > > > at that
> > > > > > > > > point,
> > > > > > > > I'm not sure pynfs does proper clean up after each
> > > > > > > > subtest,
> > > > > > > > I
> > > > > > > > will
> > > > > > > > check. There must be state associated with the client
> > > > > > > > in
> > > > > > > > order
> > > > > > > > for
> > > > > > > > it to become courtesy client.
> > > > > > > Makes sense. Then a synthetic client like pynfs can DoS a
> > > > > > > courteous
> > > > > > > server.
> > > > > > >
> > > > > > >
> > > > > > > > > so the server can wait for the lease to expire and
> > > > > > > > > purge
> > > > > > > > > it
> > > > > > > > > normally. Or am I missing something?
> > > > > > > > When 4.0 client lease expires and there are still
> > > > > > > > states
> > > > > > > > associated
> > > > > > > > with the client then the server allows this client to
> > > > > > > > become
> > > > > > > > courtesy
> > > > > > > > client.
> > > > > > > I think the same thing happens if an NFSv4.1 client
> > > > > > > neglects
> > > > > > > to
> > > > > > > send
> > > > > > > DESTROY_SESSION / DESTROY_CLIENTID. Either such a client
> > > > > > > is
> > > > > > > broken
> > > > > > > or malicious, but the server faces the same issue of
> > > > > > > protecting
> > > > > > > itself from a DoS attack.
> > > > > > >
> > > > > > > IMO you should consider limiting the number of courteous
> > > > > > > clients
> > > > > > > the server can hold onto. Let's say that number is 1000.
> > > > > > > When
> > > > > > > the
> > > > > > > server wants to turn a 1001st client into a courteous
> > > > > > > client,
> > > > > > > it
> > > > > > > can simply expire and purge the oldest courteous client
> > > > > > > on
> > > > > > > its
> > > > > > > list. Otherwise, over time, the 24-hour expiry will
> > > > > > > reduce
> > > > > > > the
> > > > > > > set of courteous clients back to zero.
> > > > > > >
> > > > > > > What do you think?
> > > > > > Limiting the number of courteous clients to handle the
> > > > > > cases of
> > > > > > broken/malicious 4.1 clients seems reasonable as the last
> > > > > > resort.
> > > > > >
> > > > > > I think if a malicious 4.1 clients could mount the server's
> > > > > > export,
> > > > > > opens a file (to create state) and repeats the same with a
> > > > > > different
> > > > > > client id then it seems like some basic security was
> > > > > > already
> > > > > > broken;
> > > > > > allowing unauthorized clients to mount server's exports.
> > > > > You can do this today with AUTH_SYS. I consider it a genuine
> > > > > attack
> > > > > surface.
> > > > >
> > > > >
> > > > > > I think if we have to enforce a limit, then it's only for
> > > > > > handling
> > > > > > of seriously buggy 4.1 clients which should not be the
> > > > > > norm.
> > > > > > The
> > > > > > issue with this is how to pick an optimal number that is
> > > > > > suitable
> > > > > > for the running server which can be a very slow or a very
> > > > > > fast
> > > > > > server.
> > > > > >
> > > > > > Note that even if we impose an limit, that does not
> > > > > > completely
> > > > > > solve
> > > > > > the problem with pynfs 4.0 test since its RPC timeout is
> > > > > > configured
> > > > > > with 15 secs which just enough to expire 277 clients based
> > > > > > on
> > > > > > 53ms
> > > > > > for each client, unless we limit it ~270 clients which I
> > > > > > think
> > > > > > it's
> > > > > > too low.
> > > > > >
> > > > > > This is what I plan to do:
> > > > > >
> > > > > > 1. do not support 4.0 courteous clients, for sure.
> > > > > Not supporting 4.0 isn’t an option, IMHO. It is a fully
> > > > > supported
> > > > > protocol at this time, and the same exposure exists for 4.1,
> > > > > it’s
> > > > > just a little harder to exploit.
> > > > >
> > > > > If you submit the courteous server patch without support for
> > > > > 4.0,
> > > > > I
> > > > > think it needs to include a plan for how 4.0 will be added
> > > > > later.
> > > > >
> > > > Why is there a problem here? The requirements are the same for
> > > > 4.0
> > > > and
> > > > 4.1 (or 4.2). If the lease under which the courtesy lock was
> > > > established has expired, then that courtesy lock must be
> > > > released
> > > > if
> > > > some other client requests a lock that conflicts with the
> > > > cached
> > > > lock
> > > > (unless the client breaks the courtesy framework by renewing
> > > > that
> > > > original lease before the conflict occurs). Otherwise, it is
> > > > completely
> > > > up to the server when it decides to actually release the lock.
> > > >
> > > > For NFSv4.1 and NFSv4.2, we have DESTROY_CLIENTID, which tells
> > > > the
> > > > server when the client is actually done with the lease, making
> > > > it
> > > > easy
> > > > to determine when it is safe to release all the courtesy locks.
> > > > However
> > > > if the client does not send DESTROY_CLIENTID, then we're in the
> > > > same
> > > > situation with 4.x (x>0) as we would be with bog standard
> > > > NFSv4.0.
> > > > The
> > > > lease has expired, and so the courtesy locks are liable to
> > > > being
> > > > dropped.
> > > I agree the situation is the same for all minor versions.
> > >
> > >
> > > > At Hammerspace we have implemented courtesy locks, and our
> > > > strategy
> > > > is
> > > > that when a conflict occurs, we drop the entire set of courtesy
> > > > locks
> > > > so that we don't have to deal with the "some locks were
> > > > revoked"
> > > > scenario. The reason is that when we originally implemented
> > > > courtesy
> > > > locks, the Linux NFSv4 client support for lock revocation was a
> > > > lot
> > > > less sophisticated than today. My suggestion is that you might
> > > > therefore consider starting along this path, and then refining
> > > > the
> > > > support to make revocation more nuanced once you are confident
> > > > that
> > > > the
> > > > coarser strategy is working as expected.
> > > Dai’s implementation does all that, and takes the coarser
> > > approach at
> > > the moment. There are plans to explore the more nuanced behavior
> > > (by
> > > revoking only the conflicting lock instead of dropping the whole
> > > lease) after this initial work is merged.
> > >
> > > The issue is there are certain pathological client behaviors
> > > (whether
> > > malicious or accidental) that can run the server out of
> > > resources,
> > > since it is holding onto lease state for a much longer time. We
> > > are
> > > simply trying to design a lease garbage collection scheme to meet
> > > that challenge.
> > >
> > > I think limiting the number of courteous clients is a simple way
> > > to
> > > do this, but we could also shorten the courtesy lifetime as more
> > > clients enter that state, to ensure that they don’t overrun the
> > > server’s memory. Another approach might be to add a shrinker that
> > > purges the oldest courteous clients when the server comes under
> > > memory pressure.
> > >
> > >
> > We already have a scanner that tries to release all client state
> > after
> > 1 lease period. Just extend that to do it after 10 lease periods.
> > If a
> > network partition hasn't recovered after 10 minutes, you probably
> > have
> > bigger problems.
>
> Currently the courteous server allows 24hr for the network partition
> to
> heal before releasing all client state. That seems to be excessive
> but
> it was suggested for longer network partition conditions when
> switch/routers
> being repaired/upgraded.
>
> >
> > You can limit the number of clients as well, but that leads into a
> > rats
> > nest of other issues that have nothing to do with courtesy locks
> > and
> > everything to do with the fact that any client can hold a lot of
> > state.
>
> The issue we currently have with courteous server and pynfs 4.0 tests
> is the number of courteous 4.0 clients the server has to expire when
> a
> share reservation conflict occurs when servicing the OPEN. Each
> client
> owns only few state in this case so we think the server spent most
> time
> for deleting client's record in /var/lib/nfs. This is why we plan to
> limit the number of courteous clients for now. As a side effect, it
> might
> also help to reduce resource consumption too.

Then kick off a thread or work item to do that asynchronously in the
background, and return NFS4ERR_DELAY to the clients that were trying to
grab locks in the meantime.

The above process is hardly just confined to NFSv4.0 clients. If there
is a network partition, then the exact same record deleting needs to be
applied to all NFSv4.1 and NFSv4.2 clients that hold locks and are
unable to renew their leases, so you might as well make it work for
everyone.

--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]


2021-11-30 15:32:16

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH RFC v5 0/2] nfsd: Initial implementation of NFSv4 Courteous Server

On Mon, Nov 29, 2021 at 11:13:34PM -0800, [email protected] wrote:
> Just to be clear, the problem of pynfs with 4.0 is that the server takes
> ~55 secs to expire 1026 4.0 courteous clients, which comes out to ~50ms
> per client. This causes the test to time out in waiting for RPC reply of
> the OPEN that triggers the conflicts.
>
> I don't know exactly where the time spent in the process of expiring a
> client. But as Bruce mentioned, it could be related to the time to access
> /var/lib/nfs to remove the client's persistent record.

Could you try something like

strace -r -$(pidof) -oTRACE

and maybe we could take a look at TRACE? My hope would be that there'd
be a clear set of syscalls whose time, multiplied by 1026, explains most
of that 55 seconds. Then it might be worth checking whether there are
any easy optimizations possible.

--b.

2021-11-30 15:37:00

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH RFC v5 0/2] nfsd: Initial implementation of NFSv4 Courteous Server



> On Nov 30, 2021, at 2:22 AM, Dai Ngo <[email protected]> wrote:
>
>
> On 11/29/21 8:57 PM, Trond Myklebust wrote:
>> On Tue, 2021-11-30 at 04:47 +0000, Chuck Lever III wrote:
>>>> On Nov 29, 2021, at 11:08 PM, Trond Myklebust
>>>> <[email protected]> wrote:
>>>>
>>>> On Tue, 2021-11-30 at 01:42 +0000, Chuck Lever III wrote:
>>>>>>> On Nov 29, 2021, at 7:11 PM, Dai Ngo <[email protected]>
>>>>>>> wrote:
>>>>>> 
>>>>>>> On 11/29/21 1:10 PM, Chuck Lever III wrote:
>>>>>>>
>>>>>>>>> On Nov 29, 2021, at 2:36 PM, Dai Ngo <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> On 11/29/21 11:03 AM, Chuck Lever III wrote:
>>>>>>>>> Hello Dai!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> On Nov 29, 2021, at 1:32 PM, Dai Ngo
>>>>>>>>>> <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 11/29/21 9:30 AM, J. Bruce Fields wrote:
>>>>>>>>>>> On Mon, Nov 29, 2021 at 09:13:16AM -0800,
>>>>>>>>>>> [email protected] wrote:
>>>>>>>>>>>> Hi Bruce,
>>>>>>>>>>>>
>>>>>>>>>>>> On 11/21/21 7:04 PM, [email protected] wrote:
>>>>>>>>>>>>> On 11/17/21 4:34 PM, J. Bruce Fields wrote:
>>>>>>>>>>>>>> On Wed, Nov 17, 2021 at 01:46:02PM -0800,
>>>>>>>>>>>>>> [email protected] wrote:
>>>>>>>>>>>>>>> On 11/17/21 9:59 AM,
>>>>>>>>>>>>>>> [email protected] wrote:
>>>>>>>>>>>>>>>> On 11/17/21 6:14 AM, J. Bruce Fields wrote:
>>>>>>>>>>>>>>>>> On Tue, Nov 16, 2021 at 03:06:32PM -0800,
>>>>>>>>>>>>>>>>> [email protected] wrote:
>>>>>>>>>>>>>>>>>> Just a reminder that this patch is
>>>>>>>>>>>>>>>>>> still
>>>>>>>>>>>>>>>>>> waiting for your review.
>>>>>>>>>>>>>>>>> Yeah, I was procrastinating and hoping
>>>>>>>>>>>>>>>>> yo'ud
>>>>>>>>>>>>>>>>> figure out the pynfs
>>>>>>>>>>>>>>>>> failure for me....
>>>>>>>>>>>>>>>> Last time I ran 4.0 OPEN18 test by itself
>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>> it passed. I will run
>>>>>>>>>>>>>>>> all OPEN tests together with 5.15-rc7 to
>>>>>>>>>>>>>>>> see if
>>>>>>>>>>>>>>>> the problem you've
>>>>>>>>>>>>>>>> seen still there.
>>>>>>>>>>>>>>> I ran all tests in nfsv4.1 and nfsv4.0 with
>>>>>>>>>>>>>>> courteous and non-courteous
>>>>>>>>>>>>>>> 5.15-rc7 server.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Nfs4.1 results are the same for both
>>>>>>>>>>>>>>> courteous
>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>> non-courteous server:
>>>>>>>>>>>>>>>> Of those: 0 Skipped, 0 Failed, 0 Warned,
>>>>>>>>>>>>>>>> 169
>>>>>>>>>>>>>>>> Passed
>>>>>>>>>>>>>>> Results of nfs4.0 with non-courteous server:
>>>>>>>>>>>>>>>> Of those: 8 Skipped, 1 Failed, 0 Warned,
>>>>>>>>>>>>>>>> 577
>>>>>>>>>>>>>>>> Passed
>>>>>>>>>>>>>>> test failed: LOCK24
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Results of nfs4.0 with courteous server:
>>>>>>>>>>>>>>>> Of those: 8 Skipped, 3 Failed, 0 Warned,
>>>>>>>>>>>>>>>> 575
>>>>>>>>>>>>>>>> Passed
>>>>>>>>>>>>>>> tests failed: LOCK24, OPEN18, OPEN30
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> OPEN18 and OPEN30 test pass if each is run by
>>>>>>>>>>>>>>> itself.
>>>>>>>>>>>>>> Could well be a bug in the tests, I don't know.
>>>>>>>>>>>>> The reason OPEN18 failed was because the test
>>>>>>>>>>>>> timed
>>>>>>>>>>>>> out waiting for
>>>>>>>>>>>>> the reply of an OPEN call. The RPC connection
>>>>>>>>>>>>> used
>>>>>>>>>>>>> for the test was
>>>>>>>>>>>>> configured with 15 secs timeout. Note that OPEN18
>>>>>>>>>>>>> only fails when
>>>>>>>>>>>>> the tests were run with 'all' option, this test
>>>>>>>>>>>>> passes if it's run
>>>>>>>>>>>>> by itself.
>>>>>>>>>>>>>
>>>>>>>>>>>>> With courteous server, by the time OPEN18 runs,
>>>>>>>>>>>>> there
>>>>>>>>>>>>> are about 1026
>>>>>>>>>>>>> courtesy 4.0 clients on the server and all of
>>>>>>>>>>>>> these
>>>>>>>>>>>>> clients have opened
>>>>>>>>>>>>> the same file X with WRITE access. These clients
>>>>>>>>>>>>> were
>>>>>>>>>>>>> created by the
>>>>>>>>>>>>> previous tests. After each test completed, since
>>>>>>>>>>>>> 4.0
>>>>>>>>>>>>> does not have
>>>>>>>>>>>>> session, the client states are not cleaned up
>>>>>>>>>>>>> immediately on the
>>>>>>>>>>>>> server and are allowed to become courtesy
>>>>>>>>>>>>> clients.
>>>>>>>>>>>>>
>>>>>>>>>>>>> When OPEN18 runs (about 20 minutes after the 1st
>>>>>>>>>>>>> test
>>>>>>>>>>>>> started), it
>>>>>>>>>>>>> sends OPEN of file X with OPEN4_SHARE_DENY_WRITE
>>>>>>>>>>>>> which causes the
>>>>>>>>>>>>> server to check for conflicts with courtesy
>>>>>>>>>>>>> clients.
>>>>>>>>>>>>> The loop that
>>>>>>>>>>>>> checks 1026 courtesy clients for share/access
>>>>>>>>>>>>> conflict took less
>>>>>>>>>>>>> than 1 sec. But it took about 55 secs, on my VM,
>>>>>>>>>>>>> for
>>>>>>>>>>>>> the server
>>>>>>>>>>>>> to expire all 1026 courtesy clients.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I modified pynfs to configure the 4.0 RPC
>>>>>>>>>>>>> connection
>>>>>>>>>>>>> with 60 seconds
>>>>>>>>>>>>> timeout and OPEN18 now consistently passed. The
>>>>>>>>>>>>> 4.0
>>>>>>>>>>>>> test results are
>>>>>>>>>>>>> now the same for courteous and non-courteous
>>>>>>>>>>>>> server:
>>>>>>>>>>>>>
>>>>>>>>>>>>> 8 Skipped, 1 Failed, 0 Warned, 577 Passed
>>>>>>>>>>>>>
>>>>>>>>>>>>> Note that 4.1 tests do not suffer this timeout
>>>>>>>>>>>>> problem because the
>>>>>>>>>>>>> 4.1 clients and sessions are destroyed after each
>>>>>>>>>>>>> test completes.
>>>>>>>>>>>> Do you want me to send the patch to increase the
>>>>>>>>>>>> timeout for pynfs?
>>>>>>>>>>>> or is there any other things you think we should
>>>>>>>>>>>> do?
>>>>>>>>>>> I don't know.
>>>>>>>>>>>
>>>>>>>>>>> 55 seconds to clean up 1026 clients is about 50ms per
>>>>>>>>>>> client, which is
>>>>>>>>>>> pretty slow. I wonder why. I guess it's probably
>>>>>>>>>>> updating the stable
>>>>>>>>>>> storage information. Is /var/lib/nfs/ on your server
>>>>>>>>>>> backed by a hard
>>>>>>>>>>> drive or an SSD or something else?
>>>>>>>>>> My server is a virtualbox VM that has 1 CPU, 4GB RAM
>>>>>>>>>> and
>>>>>>>>>> 64GB of hard
>>>>>>>>>> disk. I think a production system that supports this
>>>>>>>>>> many
>>>>>>>>>> clients should
>>>>>>>>>> have faster CPUs, faster storage.
>>>>>>>>>>
>>>>>>>>>>> I wonder if that's an argument for limiting the
>>>>>>>>>>> number of
>>>>>>>>>>> courtesy
>>>>>>>>>>> clients.
>>>>>>>>>> I think we might want to treat 4.0 clients a bit
>>>>>>>>>> different
>>>>>>>>>> from 4.1
>>>>>>>>>> clients. With 4.0, every client will become a courtesy
>>>>>>>>>> client after
>>>>>>>>>> the client is done with the export and unmounts it.
>>>>>>>>> It should be safe for a server to purge a client's lease
>>>>>>>>> immediately
>>>>>>>>> if there is no open or lock state associated with it.
>>>>>>>> In this case, each client has opened files so there are
>>>>>>>> open
>>>>>>>> states
>>>>>>>> associated with them.
>>>>>>>>
>>>>>>>>> When an NFSv4.0 client unmounts, all files should be
>>>>>>>>> closed
>>>>>>>>> at that
>>>>>>>>> point,
>>>>>>>> I'm not sure pynfs does proper clean up after each subtest,
>>>>>>>> I
>>>>>>>> will
>>>>>>>> check. There must be state associated with the client in
>>>>>>>> order
>>>>>>>> for
>>>>>>>> it to become courtesy client.
>>>>>>> Makes sense. Then a synthetic client like pynfs can DoS a
>>>>>>> courteous
>>>>>>> server.
>>>>>>>
>>>>>>>
>>>>>>>>> so the server can wait for the lease to expire and purge
>>>>>>>>> it
>>>>>>>>> normally. Or am I missing something?
>>>>>>>> When 4.0 client lease expires and there are still states
>>>>>>>> associated
>>>>>>>> with the client then the server allows this client to
>>>>>>>> become
>>>>>>>> courtesy
>>>>>>>> client.
>>>>>>> I think the same thing happens if an NFSv4.1 client neglects
>>>>>>> to
>>>>>>> send
>>>>>>> DESTROY_SESSION / DESTROY_CLIENTID. Either such a client is
>>>>>>> broken
>>>>>>> or malicious, but the server faces the same issue of
>>>>>>> protecting
>>>>>>> itself from a DoS attack.
>>>>>>>
>>>>>>> IMO you should consider limiting the number of courteous
>>>>>>> clients
>>>>>>> the server can hold onto. Let's say that number is 1000. When
>>>>>>> the
>>>>>>> server wants to turn a 1001st client into a courteous client,
>>>>>>> it
>>>>>>> can simply expire and purge the oldest courteous client on
>>>>>>> its
>>>>>>> list. Otherwise, over time, the 24-hour expiry will reduce
>>>>>>> the
>>>>>>> set of courteous clients back to zero.
>>>>>>>
>>>>>>> What do you think?
>>>>>> Limiting the number of courteous clients to handle the cases of
>>>>>> broken/malicious 4.1 clients seems reasonable as the last
>>>>>> resort.
>>>>>>
>>>>>> I think if a malicious 4.1 clients could mount the server's
>>>>>> export,
>>>>>> opens a file (to create state) and repeats the same with a
>>>>>> different
>>>>>> client id then it seems like some basic security was already
>>>>>> broken;
>>>>>> allowing unauthorized clients to mount server's exports.
>>>>> You can do this today with AUTH_SYS. I consider it a genuine
>>>>> attack
>>>>> surface.
>>>>>
>>>>>
>>>>>> I think if we have to enforce a limit, then it's only for
>>>>>> handling
>>>>>> of seriously buggy 4.1 clients which should not be the norm.
>>>>>> The
>>>>>> issue with this is how to pick an optimal number that is
>>>>>> suitable
>>>>>> for the running server which can be a very slow or a very fast
>>>>>> server.
>>>>>>
>>>>>> Note that even if we impose an limit, that does not completely
>>>>>> solve
>>>>>> the problem with pynfs 4.0 test since its RPC timeout is
>>>>>> configured
>>>>>> with 15 secs which just enough to expire 277 clients based on
>>>>>> 53ms
>>>>>> for each client, unless we limit it ~270 clients which I think
>>>>>> it's
>>>>>> too low.
>>>>>>
>>>>>> This is what I plan to do:
>>>>>>
>>>>>> 1. do not support 4.0 courteous clients, for sure.
>>>>> Not supporting 4.0 isn’t an option, IMHO. It is a fully supported
>>>>> protocol at this time, and the same exposure exists for 4.1, it’s
>>>>> just a little harder to exploit.
>>>>>
>>>>> If you submit the courteous server patch without support for 4.0,
>>>>> I
>>>>> think it needs to include a plan for how 4.0 will be added later.
>>>>>
>>>> Why is there a problem here? The requirements are the same for 4.0
>>>> and
>>>> 4.1 (or 4.2). If the lease under which the courtesy lock was
>>>> established has expired, then that courtesy lock must be released
>>>> if
>>>> some other client requests a lock that conflicts with the cached
>>>> lock
>>>> (unless the client breaks the courtesy framework by renewing that
>>>> original lease before the conflict occurs). Otherwise, it is
>>>> completely
>>>> up to the server when it decides to actually release the lock.
>>>>
>>>> For NFSv4.1 and NFSv4.2, we have DESTROY_CLIENTID, which tells the
>>>> server when the client is actually done with the lease, making it
>>>> easy
>>>> to determine when it is safe to release all the courtesy locks.
>>>> However
>>>> if the client does not send DESTROY_CLIENTID, then we're in the
>>>> same
>>>> situation with 4.x (x>0) as we would be with bog standard NFSv4.0.
>>>> The
>>>> lease has expired, and so the courtesy locks are liable to being
>>>> dropped.
>>> I agree the situation is the same for all minor versions.
>>>
>>>
>>>> At Hammerspace we have implemented courtesy locks, and our strategy
>>>> is
>>>> that when a conflict occurs, we drop the entire set of courtesy
>>>> locks
>>>> so that we don't have to deal with the "some locks were revoked"
>>>> scenario. The reason is that when we originally implemented
>>>> courtesy
>>>> locks, the Linux NFSv4 client support for lock revocation was a lot
>>>> less sophisticated than today. My suggestion is that you might
>>>> therefore consider starting along this path, and then refining the
>>>> support to make revocation more nuanced once you are confident that
>>>> the
>>>> coarser strategy is working as expected.
>>> Dai’s implementation does all that, and takes the coarser approach at
>>> the moment. There are plans to explore the more nuanced behavior (by
>>> revoking only the conflicting lock instead of dropping the whole
>>> lease) after this initial work is merged.
>>>
>>> The issue is there are certain pathological client behaviors (whether
>>> malicious or accidental) that can run the server out of resources,
>>> since it is holding onto lease state for a much longer time. We are
>>> simply trying to design a lease garbage collection scheme to meet
>>> that challenge.
>>>
>>> I think limiting the number of courteous clients is a simple way to
>>> do this, but we could also shorten the courtesy lifetime as more
>>> clients enter that state, to ensure that they don’t overrun the
>>> server’s memory. Another approach might be to add a shrinker that
>>> purges the oldest courteous clients when the server comes under
>>> memory pressure.
>>>
>>>
>> We already have a scanner that tries to release all client state after
>> 1 lease period. Just extend that to do it after 10 lease periods. If a
>> network partition hasn't recovered after 10 minutes, you probably have
>> bigger problems.
>
> Currently the courteous server allows 24hr for the network partition to
> heal before releasing all client state. That seems to be excessive but
> it was suggested for longer network partition conditions when switch/routers
> being repaired/upgraded.

Sure, 24 hours is a long time.

For the benefit of others on the list, we have seen customer
failure scenarios where networks were partitioned for that
long.

But it's an arbitrary number, and there's no specification
for how long a server needs to hold a courtesy client. We
can make this number anything that is both convenient for
the server implementation and valuable for outage recovery.


>> You can limit the number of clients as well, but that leads into a rats
>> nest of other issues that have nothing to do with courtesy locks and
>> everything to do with the fact that any client can hold a lot of state.
>
> The issue we currently have with courteous server and pynfs 4.0 tests
> is the number of courteous 4.0 clients the server has to expire when a
> share reservation conflict occurs when servicing the OPEN. Each client
> owns only few state in this case so we think the server spent most time
> for deleting client's record in /var/lib/nfs. This is why we plan to
> limit the number of courteous clients for now. As a side effect, it might
> also help to reduce resource consumption too.

I am a little concerned that we are trying to optimize a case
that won't happen during practice. pynfs does not reflect any
kind of realistic or reasonable client behavior -- it's designed
to test very specific server operations.

All that needs to happen, IMO, is that the server needs to
protect itself from resource exhaustion (which may occur for
any of the minor versions).

So I'm taking a shine to the idea of using a shrinker to trim
the older courtesy clients, rather than placing an arbitrary
limit on the number of courtesy clients the server can hold at
once. A shrinker should take into account the wide variance in
the amount of lease state each client might have.


--
Chuck Lever



2021-11-30 16:05:18

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH RFC v5 0/2] nfsd: Initial implementation of NFSv4 Courteous Server

On Tue, Nov 30, 2021 at 03:36:43PM +0000, Chuck Lever III wrote:
> I am a little concerned that we are trying to optimize a case
> that won't happen during practice. pynfs does not reflect any
> kind of realistic or reasonable client behavior -- it's designed
> to test very specific server operations.

I wonder how hard this problem would be to hit in normal use. I mean, a
few hundred or a thousand clients doesn't sound that crazy. This case
depends on an open deny, but you could hit the same problem with file
locks. Would it be that weird to have a client trying to get a write
lock on a file read-locked by a bunch of other clients?

--b.

2021-11-30 16:14:14

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH RFC v5 0/2] nfsd: Initial implementation of NFSv4 Courteous Server

On Tue, 2021-11-30 at 11:05 -0500, Bruce Fields wrote:
> On Tue, Nov 30, 2021 at 03:36:43PM +0000, Chuck Lever III wrote:
> > I am a little concerned that we are trying to optimize a case
> > that won't happen during practice. pynfs does not reflect any
> > kind of realistic or reasonable client behavior -- it's designed
> > to test very specific server operations.
>
> I wonder how hard this problem would be to hit in normal use.  I
> mean, a
> few hundred or a thousand clients doesn't sound that crazy.  This
> case
> depends on an open deny, but you could hit the same problem with file
> locks.  Would it be that weird to have a client trying to get a write
> lock on a file read-locked by a bunch of other clients?
>

That's a scenario that is subject to starvation problems anyway.
Particularly so on NFSv4.0, which lacks CB_NOTIFY_LOCK.

--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]


2021-11-30 19:01:07

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH RFC v5 0/2] nfsd: Initial implementation of NFSv4 Courteous Server

On Tue, Nov 30, 2021 at 04:14:10PM +0000, Trond Myklebust wrote:
> On Tue, 2021-11-30 at 11:05 -0500, Bruce Fields wrote:
> > On Tue, Nov 30, 2021 at 03:36:43PM +0000, Chuck Lever III wrote:
> > > I am a little concerned that we are trying to optimize a case
> > > that won't happen during practice. pynfs does not reflect any
> > > kind of realistic or reasonable client behavior -- it's designed
> > > to test very specific server operations.
> >
> > I wonder how hard this problem would be to hit in normal use.  I
> > mean, a
> > few hundred or a thousand clients doesn't sound that crazy.  This
> > case
> > depends on an open deny, but you could hit the same problem with file
> > locks.  Would it be that weird to have a client trying to get a write
> > lock on a file read-locked by a bunch of other clients?
> >
>
> That's a scenario that is subject to starvation problems anyway.

Yes, if it's hundreds of clients continuously grabbing read locks. But
if it's something like: send all the readers a signal, then request a
write lock as a way to wait for them to finish; then you'd normally
expect to get it soon after the last client drops its lock.

I don't know, maybe that's uncommon.

--b.

2021-12-01 03:50:28

by Dai Ngo

[permalink] [raw]
Subject: Re: [PATCH RFC v5 0/2] nfsd: Initial implementation of NFSv4 Courteous Server


On 11/30/21 7:32 AM, Bruce Fields wrote:
> On Mon, Nov 29, 2021 at 11:13:34PM -0800, [email protected] wrote:
>> Just to be clear, the problem of pynfs with 4.0 is that the server takes
>> ~55 secs to expire 1026 4.0 courteous clients, which comes out to ~50ms
>> per client. This causes the test to time out in waiting for RPC reply of
>> the OPEN that triggers the conflicts.
>>
>> I don't know exactly where the time spent in the process of expiring a
>> client. But as Bruce mentioned, it could be related to the time to access
>> /var/lib/nfs to remove the client's persistent record.
> Could you try something like
>
> strace -r -$(pidof) -oTRACE

Strace does not have any info that shows where the server spent time when
expiring client state. The client record is removed by nfsd4_umh_cltrack_remove
doing upcall to user space helper /sbin/nfsdcltrack to do the job. I used
the low-tech debug tool, printk, to measure the time spent by
nfsd4_client_record_remove. Here is a sample of the output, START and END
are in milliseconds:

Nov 30 12:31:04 localhost kernel: nfsd4_client_record_remove: START [0x15d418] clp[ffff888119206040] client_tracking_ops[ffffffffa04bc2e0]
Nov 30 12:31:04 localhost kernel: nfsd4_client_record_remove: END [0x15d459] clp[ffff888119206040] client_tracking_ops[ffffffffa04bc2e0]
Nov 30 12:31:04 localhost kernel: nfsd4_client_record_remove: START [0x15d461] clp[ffff888119206740] client_tracking_ops[ffffffffa04bc2e0]
Nov 30 12:31:04 localhost kernel: nfsd4_client_record_remove: END [0x15d48e] clp[ffff888119206740] client_tracking_ops[ffffffffa04bc2e0]
Nov 30 12:31:04 localhost kernel: nfsd4_client_record_remove: START [0x15d49c] clp[ffff88811b54e000] client_tracking_ops[ffffffffa04bc2e0]
Nov 30 12:31:04 localhost kernel: nfsd4_client_record_remove: END [0x15d4c5] clp[ffff88811b54e000] client_tracking_ops[ffffffffa04bc2e0]

The average time spent to remove the client record is about ~50ms, matches
with the time reported by pynfs test. This confirms what Bruce suspected
earlier.

-Dai

>
> and maybe we could take a look at TRACE? My hope would be that there'd
> be a clear set of syscalls whose time, multiplied by 1026, explains most
> of that 55 seconds. Then it might be worth checking whether there are
> any easy optimizations possible.
>
> --b.

2021-12-01 03:52:22

by Dai Ngo

[permalink] [raw]
Subject: Re: [PATCH RFC v5 0/2] nfsd: Initial implementation of NFSv4 Courteous Server


On 11/30/21 5:37 AM, Trond Myklebust wrote:
> On Mon, 2021-11-29 at 23:22 -0800, [email protected] wrote:
>> On 11/29/21 8:57 PM, Trond Myklebust wrote:
>>> On Tue, 2021-11-30 at 04:47 +0000, Chuck Lever III wrote:
>>>>> On Nov 29, 2021, at 11:08 PM, Trond Myklebust
>>>>> <[email protected]> wrote:
>>>>>
>>>>> On Tue, 2021-11-30 at 01:42 +0000, Chuck Lever III wrote:
>>>>>>>> On Nov 29, 2021, at 7:11 PM, Dai Ngo <[email protected]>
>>>>>>>> wrote:
>>>>>>> 
>>>>>>>> On 11/29/21 1:10 PM, Chuck Lever III wrote:
>>>>>>>>
>>>>>>>>>> On Nov 29, 2021, at 2:36 PM, Dai Ngo
>>>>>>>>>> <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>> On 11/29/21 11:03 AM, Chuck Lever III wrote:
>>>>>>>>>> Hello Dai!
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> On Nov 29, 2021, at 1:32 PM, Dai Ngo
>>>>>>>>>>> <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 11/29/21 9:30 AM, J. Bruce Fields wrote:
>>>>>>>>>>>> On Mon, Nov 29, 2021 at 09:13:16AM -0800,
>>>>>>>>>>>> [email protected] wrote:
>>>>>>>>>>>>> Hi Bruce,
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 11/21/21 7:04 PM, [email protected] wrote:
>>>>>>>>>>>>>> On 11/17/21 4:34 PM, J. Bruce Fields wrote:
>>>>>>>>>>>>>>> On Wed, Nov 17, 2021 at 01:46:02PM -0800,
>>>>>>>>>>>>>>> [email protected] wrote:
>>>>>>>>>>>>>>>> On 11/17/21 9:59 AM,
>>>>>>>>>>>>>>>> [email protected] wrote:
>>>>>>>>>>>>>>>>> On 11/17/21 6:14 AM, J. Bruce Fields
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> On Tue, Nov 16, 2021 at 03:06:32PM -
>>>>>>>>>>>>>>>>>> 0800,
>>>>>>>>>>>>>>>>>> [email protected] wrote:
>>>>>>>>>>>>>>>>>>> Just a reminder that this patch is
>>>>>>>>>>>>>>>>>>> still
>>>>>>>>>>>>>>>>>>> waiting for your review.
>>>>>>>>>>>>>>>>>> Yeah, I was procrastinating and
>>>>>>>>>>>>>>>>>> hoping
>>>>>>>>>>>>>>>>>> yo'ud
>>>>>>>>>>>>>>>>>> figure out the pynfs
>>>>>>>>>>>>>>>>>> failure for me....
>>>>>>>>>>>>>>>>> Last time I ran 4.0 OPEN18 test by
>>>>>>>>>>>>>>>>> itself
>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>> it passed. I will run
>>>>>>>>>>>>>>>>> all OPEN tests together with 5.15-rc7
>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> see if
>>>>>>>>>>>>>>>>> the problem you've
>>>>>>>>>>>>>>>>> seen still there.
>>>>>>>>>>>>>>>> I ran all tests in nfsv4.1 and nfsv4.0
>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>> courteous and non-courteous
>>>>>>>>>>>>>>>> 5.15-rc7 server.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Nfs4.1 results are the same for both
>>>>>>>>>>>>>>>> courteous
>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>> non-courteous server:
>>>>>>>>>>>>>>>>> Of those: 0 Skipped, 0 Failed, 0
>>>>>>>>>>>>>>>>> Warned,
>>>>>>>>>>>>>>>>> 169
>>>>>>>>>>>>>>>>> Passed
>>>>>>>>>>>>>>>> Results of nfs4.0 with non-courteous
>>>>>>>>>>>>>>>> server:
>>>>>>>>>>>>>>>>> Of those: 8 Skipped, 1 Failed, 0
>>>>>>>>>>>>>>>>> Warned,
>>>>>>>>>>>>>>>>> 577
>>>>>>>>>>>>>>>>> Passed
>>>>>>>>>>>>>>>> test failed: LOCK24
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Results of nfs4.0 with courteous server:
>>>>>>>>>>>>>>>>> Of those: 8 Skipped, 3 Failed, 0
>>>>>>>>>>>>>>>>> Warned,
>>>>>>>>>>>>>>>>> 575
>>>>>>>>>>>>>>>>> Passed
>>>>>>>>>>>>>>>> tests failed: LOCK24, OPEN18, OPEN30
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> OPEN18 and OPEN30 test pass if each is
>>>>>>>>>>>>>>>> run by
>>>>>>>>>>>>>>>> itself.
>>>>>>>>>>>>>>> Could well be a bug in the tests, I don't
>>>>>>>>>>>>>>> know.
>>>>>>>>>>>>>> The reason OPEN18 failed was because the test
>>>>>>>>>>>>>> timed
>>>>>>>>>>>>>> out waiting for
>>>>>>>>>>>>>> the reply of an OPEN call. The RPC connection
>>>>>>>>>>>>>> used
>>>>>>>>>>>>>> for the test was
>>>>>>>>>>>>>> configured with 15 secs timeout. Note that
>>>>>>>>>>>>>> OPEN18
>>>>>>>>>>>>>> only fails when
>>>>>>>>>>>>>> the tests were run with 'all' option, this
>>>>>>>>>>>>>> test
>>>>>>>>>>>>>> passes if it's run
>>>>>>>>>>>>>> by itself.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> With courteous server, by the time OPEN18
>>>>>>>>>>>>>> runs,
>>>>>>>>>>>>>> there
>>>>>>>>>>>>>> are about 1026
>>>>>>>>>>>>>> courtesy 4.0 clients on the server and all of
>>>>>>>>>>>>>> these
>>>>>>>>>>>>>> clients have opened
>>>>>>>>>>>>>> the same file X with WRITE access. These
>>>>>>>>>>>>>> clients
>>>>>>>>>>>>>> were
>>>>>>>>>>>>>> created by the
>>>>>>>>>>>>>> previous tests. After each test completed,
>>>>>>>>>>>>>> since
>>>>>>>>>>>>>> 4.0
>>>>>>>>>>>>>> does not have
>>>>>>>>>>>>>> session, the client states are not cleaned up
>>>>>>>>>>>>>> immediately on the
>>>>>>>>>>>>>> server and are allowed to become courtesy
>>>>>>>>>>>>>> clients.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> When OPEN18 runs (about 20 minutes after the
>>>>>>>>>>>>>> 1st
>>>>>>>>>>>>>> test
>>>>>>>>>>>>>> started), it
>>>>>>>>>>>>>> sends OPEN of file X with
>>>>>>>>>>>>>> OPEN4_SHARE_DENY_WRITE
>>>>>>>>>>>>>> which causes the
>>>>>>>>>>>>>> server to check for conflicts with courtesy
>>>>>>>>>>>>>> clients.
>>>>>>>>>>>>>> The loop that
>>>>>>>>>>>>>> checks 1026 courtesy clients for share/access
>>>>>>>>>>>>>> conflict took less
>>>>>>>>>>>>>> than 1 sec. But it took about 55 secs, on my
>>>>>>>>>>>>>> VM,
>>>>>>>>>>>>>> for
>>>>>>>>>>>>>> the server
>>>>>>>>>>>>>> to expire all 1026 courtesy clients.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I modified pynfs to configure the 4.0 RPC
>>>>>>>>>>>>>> connection
>>>>>>>>>>>>>> with 60 seconds
>>>>>>>>>>>>>> timeout and OPEN18 now consistently passed.
>>>>>>>>>>>>>> The
>>>>>>>>>>>>>> 4.0
>>>>>>>>>>>>>> test results are
>>>>>>>>>>>>>> now the same for courteous and non-courteous
>>>>>>>>>>>>>> server:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 8 Skipped, 1 Failed, 0 Warned, 577 Passed
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Note that 4.1 tests do not suffer this
>>>>>>>>>>>>>> timeout
>>>>>>>>>>>>>> problem because the
>>>>>>>>>>>>>> 4.1 clients and sessions are destroyed after
>>>>>>>>>>>>>> each
>>>>>>>>>>>>>> test completes.
>>>>>>>>>>>>> Do you want me to send the patch to increase
>>>>>>>>>>>>> the
>>>>>>>>>>>>> timeout for pynfs?
>>>>>>>>>>>>> or is there any other things you think we
>>>>>>>>>>>>> should
>>>>>>>>>>>>> do?
>>>>>>>>>>>> I don't know.
>>>>>>>>>>>>
>>>>>>>>>>>> 55 seconds to clean up 1026 clients is about 50ms
>>>>>>>>>>>> per
>>>>>>>>>>>> client, which is
>>>>>>>>>>>> pretty slow.  I wonder why.  I guess it's
>>>>>>>>>>>> probably
>>>>>>>>>>>> updating the stable
>>>>>>>>>>>> storage information.  Is /var/lib/nfs/ on your
>>>>>>>>>>>> server
>>>>>>>>>>>> backed by a hard
>>>>>>>>>>>> drive or an SSD or something else?
>>>>>>>>>>> My server is a virtualbox VM that has 1 CPU, 4GB
>>>>>>>>>>> RAM
>>>>>>>>>>> and
>>>>>>>>>>> 64GB of hard
>>>>>>>>>>> disk. I think a production system that supports
>>>>>>>>>>> this
>>>>>>>>>>> many
>>>>>>>>>>> clients should
>>>>>>>>>>> have faster CPUs, faster storage.
>>>>>>>>>>>
>>>>>>>>>>>> I wonder if that's an argument for limiting the
>>>>>>>>>>>> number of
>>>>>>>>>>>> courtesy
>>>>>>>>>>>> clients.
>>>>>>>>>>> I think we might want to treat 4.0 clients a bit
>>>>>>>>>>> different
>>>>>>>>>>> from 4.1
>>>>>>>>>>> clients. With 4.0, every client will become a
>>>>>>>>>>> courtesy
>>>>>>>>>>> client after
>>>>>>>>>>> the client is done with the export and unmounts it.
>>>>>>>>>> It should be safe for a server to purge a client's
>>>>>>>>>> lease
>>>>>>>>>> immediately
>>>>>>>>>> if there is no open or lock state associated with it.
>>>>>>>>> In this case, each client has opened files so there are
>>>>>>>>> open
>>>>>>>>> states
>>>>>>>>> associated with them.
>>>>>>>>>
>>>>>>>>>> When an NFSv4.0 client unmounts, all files should be
>>>>>>>>>> closed
>>>>>>>>>> at that
>>>>>>>>>> point,
>>>>>>>>> I'm not sure pynfs does proper clean up after each
>>>>>>>>> subtest,
>>>>>>>>> I
>>>>>>>>> will
>>>>>>>>> check. There must be state associated with the client
>>>>>>>>> in
>>>>>>>>> order
>>>>>>>>> for
>>>>>>>>> it to become courtesy client.
>>>>>>>> Makes sense. Then a synthetic client like pynfs can DoS a
>>>>>>>> courteous
>>>>>>>> server.
>>>>>>>>
>>>>>>>>
>>>>>>>>>> so the server can wait for the lease to expire and
>>>>>>>>>> purge
>>>>>>>>>> it
>>>>>>>>>> normally. Or am I missing something?
>>>>>>>>> When 4.0 client lease expires and there are still
>>>>>>>>> states
>>>>>>>>> associated
>>>>>>>>> with the client then the server allows this client to
>>>>>>>>> become
>>>>>>>>> courtesy
>>>>>>>>> client.
>>>>>>>> I think the same thing happens if an NFSv4.1 client
>>>>>>>> neglects
>>>>>>>> to
>>>>>>>> send
>>>>>>>> DESTROY_SESSION / DESTROY_CLIENTID. Either such a client
>>>>>>>> is
>>>>>>>> broken
>>>>>>>> or malicious, but the server faces the same issue of
>>>>>>>> protecting
>>>>>>>> itself from a DoS attack.
>>>>>>>>
>>>>>>>> IMO you should consider limiting the number of courteous
>>>>>>>> clients
>>>>>>>> the server can hold onto. Let's say that number is 1000.
>>>>>>>> When
>>>>>>>> the
>>>>>>>> server wants to turn a 1001st client into a courteous
>>>>>>>> client,
>>>>>>>> it
>>>>>>>> can simply expire and purge the oldest courteous client
>>>>>>>> on
>>>>>>>> its
>>>>>>>> list. Otherwise, over time, the 24-hour expiry will
>>>>>>>> reduce
>>>>>>>> the
>>>>>>>> set of courteous clients back to zero.
>>>>>>>>
>>>>>>>> What do you think?
>>>>>>> Limiting the number of courteous clients to handle the
>>>>>>> cases of
>>>>>>> broken/malicious 4.1 clients seems reasonable as the last
>>>>>>> resort.
>>>>>>>
>>>>>>> I think if a malicious 4.1 clients could mount the server's
>>>>>>> export,
>>>>>>> opens a file (to create state) and repeats the same with a
>>>>>>> different
>>>>>>> client id then it seems like some basic security was
>>>>>>> already
>>>>>>> broken;
>>>>>>> allowing unauthorized clients to mount server's exports.
>>>>>> You can do this today with AUTH_SYS. I consider it a genuine
>>>>>> attack
>>>>>> surface.
>>>>>>
>>>>>>
>>>>>>> I think if we have to enforce a limit, then it's only for
>>>>>>> handling
>>>>>>> of seriously buggy 4.1 clients which should not be the
>>>>>>> norm.
>>>>>>> The
>>>>>>> issue with this is how to pick an optimal number that is
>>>>>>> suitable
>>>>>>> for the running server which can be a very slow or a very
>>>>>>> fast
>>>>>>> server.
>>>>>>>
>>>>>>> Note that even if we impose an limit, that does not
>>>>>>> completely
>>>>>>> solve
>>>>>>> the problem with pynfs 4.0 test since its RPC timeout is
>>>>>>> configured
>>>>>>> with 15 secs which just enough to expire 277 clients based
>>>>>>> on
>>>>>>> 53ms
>>>>>>> for each client, unless we limit it ~270 clients which I
>>>>>>> think
>>>>>>> it's
>>>>>>> too low.
>>>>>>>
>>>>>>> This is what I plan to do:
>>>>>>>
>>>>>>> 1. do not support 4.0 courteous clients, for sure.
>>>>>> Not supporting 4.0 isn’t an option, IMHO. It is a fully
>>>>>> supported
>>>>>> protocol at this time, and the same exposure exists for 4.1,
>>>>>> it’s
>>>>>> just a little harder to exploit.
>>>>>>
>>>>>> If you submit the courteous server patch without support for
>>>>>> 4.0,
>>>>>> I
>>>>>> think it needs to include a plan for how 4.0 will be added
>>>>>> later.
>>>>>>
>>>>> Why is there a problem here? The requirements are the same for
>>>>> 4.0
>>>>> and
>>>>> 4.1 (or 4.2). If the lease under which the courtesy lock was
>>>>> established has expired, then that courtesy lock must be
>>>>> released
>>>>> if
>>>>> some other client requests a lock that conflicts with the
>>>>> cached
>>>>> lock
>>>>> (unless the client breaks the courtesy framework by renewing
>>>>> that
>>>>> original lease before the conflict occurs). Otherwise, it is
>>>>> completely
>>>>> up to the server when it decides to actually release the lock.
>>>>>
>>>>> For NFSv4.1 and NFSv4.2, we have DESTROY_CLIENTID, which tells
>>>>> the
>>>>> server when the client is actually done with the lease, making
>>>>> it
>>>>> easy
>>>>> to determine when it is safe to release all the courtesy locks.
>>>>> However
>>>>> if the client does not send DESTROY_CLIENTID, then we're in the
>>>>> same
>>>>> situation with 4.x (x>0) as we would be with bog standard
>>>>> NFSv4.0.
>>>>> The
>>>>> lease has expired, and so the courtesy locks are liable to
>>>>> being
>>>>> dropped.
>>>> I agree the situation is the same for all minor versions.
>>>>
>>>>
>>>>> At Hammerspace we have implemented courtesy locks, and our
>>>>> strategy
>>>>> is
>>>>> that when a conflict occurs, we drop the entire set of courtesy
>>>>> locks
>>>>> so that we don't have to deal with the "some locks were
>>>>> revoked"
>>>>> scenario. The reason is that when we originally implemented
>>>>> courtesy
>>>>> locks, the Linux NFSv4 client support for lock revocation was a
>>>>> lot
>>>>> less sophisticated than today. My suggestion is that you might
>>>>> therefore consider starting along this path, and then refining
>>>>> the
>>>>> support to make revocation more nuanced once you are confident
>>>>> that
>>>>> the
>>>>> coarser strategy is working as expected.
>>>> Dai’s implementation does all that, and takes the coarser
>>>> approach at
>>>> the moment. There are plans to explore the more nuanced behavior
>>>> (by
>>>> revoking only the conflicting lock instead of dropping the whole
>>>> lease) after this initial work is merged.
>>>>
>>>> The issue is there are certain pathological client behaviors
>>>> (whether
>>>> malicious or accidental) that can run the server out of
>>>> resources,
>>>> since it is holding onto lease state for a much longer time. We
>>>> are
>>>> simply trying to design a lease garbage collection scheme to meet
>>>> that challenge.
>>>>
>>>> I think limiting the number of courteous clients is a simple way
>>>> to
>>>> do this, but we could also shorten the courtesy lifetime as more
>>>> clients enter that state, to ensure that they don’t overrun the
>>>> server’s memory. Another approach might be to add a shrinker that
>>>> purges the oldest courteous clients when the server comes under
>>>> memory pressure.
>>>>
>>>>
>>> We already have a scanner that tries to release all client state
>>> after
>>> 1 lease period. Just extend that to do it after 10 lease periods.
>>> If a
>>> network partition hasn't recovered after 10 minutes, you probably
>>> have
>>> bigger problems.
>> Currently the courteous server allows 24hr for the network partition
>> to
>> heal before releasing all client state. That seems to be excessive
>> but
>> it was suggested for longer network partition conditions when
>> switch/routers
>> being repaired/upgraded.
>>
>>> You can limit the number of clients as well, but that leads into a
>>> rats
>>> nest of other issues that have nothing to do with courtesy locks
>>> and
>>> everything to do with the fact that any client can hold a lot of
>>> state.
>> The issue we currently have with courteous server and pynfs 4.0 tests
>> is the number of courteous 4.0 clients the server has to expire when
>> a
>> share reservation conflict occurs when servicing the OPEN. Each
>> client
>> owns only few state in this case so we think the server spent most
>> time
>> for deleting client's record in /var/lib/nfs. This is why we plan to
>> limit the number of courteous clients for now. As a side effect, it
>> might
>> also help to reduce resource consumption too.
> Then kick off a thread or work item to do that asynchronously in the
> background, and return NFS4ERR_DELAY to the clients that were trying to
> grab locks in the meantime.

Thanks Trond, I think this is a reasonable approach. The behavior would
be similar to a delegation recall during the OPEN.

My plan is:

1. If the number of conflict clients is less than 100 (some numbers that
cover realistic usage) then release all their state synchronously in
the OPEN call, and returns NFS4_OK to the NFS client. Most of conflicts
should be handled by this case.

2. If the number of conflict clients is more than 100 then release the
state of the 1st 100 clients as in (1) and trigger the laundromat thread
to release state of the rest of the conflict clients, and return
NFS4ERR_DELAY to the NFS client. This should be a rare condition.

-Dai

>
> The above process is hardly just confined to NFSv4.0 clients. If there
> is a network partition, then the exact same record deleting needs to be
> applied to all NFSv4.1 and NFSv4.2 clients that hold locks and are
> unable to renew their leases, so you might as well make it work for
> everyone.
>

2021-12-01 14:19:24

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH RFC v5 0/2] nfsd: Initial implementation of NFSv4 Courteous Server

On Tue, Nov 30, 2021 at 07:52:10PM -0800, [email protected] wrote:
> On 11/30/21 5:37 AM, Trond Myklebust wrote:
> >Then kick off a thread or work item to do that asynchronously in the
> >background, and return NFS4ERR_DELAY to the clients that were trying to
> >grab locks in the meantime.
>
> Thanks Trond, I think this is a reasonable approach. The behavior would
> be similar to a delegation recall during the OPEN.
>
> My plan is:
>
> 1. If the number of conflict clients is less than 100 (some numbers that
> cover realistic usage) then release all their state synchronously in
> the OPEN call, and returns NFS4_OK to the NFS client. Most of conflicts
> should be handled by this case.
>
> 2. If the number of conflict clients is more than 100 then release the
> state of the 1st 100 clients as in (1) and trigger the laundromat thread
> to release state of the rest of the conflict clients, and return
> NFS4ERR_DELAY to the NFS client. This should be a rare condition.

Honestly, conflict with a courtesy client is itself not going to be that
common, so personally I'd start simple and handle everything with the
asynchronous approach.

--b.

2021-12-01 14:39:09

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH RFC v5 0/2] nfsd: Initial implementation of NFSv4 Courteous Server

On Tue, Nov 30, 2021 at 07:50:13PM -0800, [email protected] wrote:
>
> On 11/30/21 7:32 AM, Bruce Fields wrote:
> >On Mon, Nov 29, 2021 at 11:13:34PM -0800, [email protected] wrote:
> >>Just to be clear, the problem of pynfs with 4.0 is that the server takes
> >>~55 secs to expire 1026 4.0 courteous clients, which comes out to ~50ms
> >>per client. This causes the test to time out in waiting for RPC reply of
> >>the OPEN that triggers the conflicts.
> >>
> >>I don't know exactly where the time spent in the process of expiring a
> >>client. But as Bruce mentioned, it could be related to the time to access
> >>/var/lib/nfs to remove the client's persistent record.
> >Could you try something like
> >
> > strace -r -$(pidof) -oTRACE

Oops, I mean $(pidof nfsdcld). But, your system isn't using that:

>
> Strace does not have any info that shows where the server spent time when
> expiring client state. The client record is removed by nfsd4_umh_cltrack_remove
> doing upcall to user space helper /sbin/nfsdcltrack to do the job. I used
> the low-tech debug tool, printk, to measure the time spent by
> nfsd4_client_record_remove. Here is a sample of the output, START and END
> are in milliseconds:
>
> Nov 30 12:31:04 localhost kernel: nfsd4_client_record_remove: START [0x15d418] clp[ffff888119206040] client_tracking_ops[ffffffffa04bc2e0]
> Nov 30 12:31:04 localhost kernel: nfsd4_client_record_remove: END [0x15d459] clp[ffff888119206040] client_tracking_ops[ffffffffa04bc2e0]
> Nov 30 12:31:04 localhost kernel: nfsd4_client_record_remove: START [0x15d461] clp[ffff888119206740] client_tracking_ops[ffffffffa04bc2e0]
> Nov 30 12:31:04 localhost kernel: nfsd4_client_record_remove: END [0x15d48e] clp[ffff888119206740] client_tracking_ops[ffffffffa04bc2e0]
> Nov 30 12:31:04 localhost kernel: nfsd4_client_record_remove: START [0x15d49c] clp[ffff88811b54e000] client_tracking_ops[ffffffffa04bc2e0]
> Nov 30 12:31:04 localhost kernel: nfsd4_client_record_remove: END [0x15d4c5] clp[ffff88811b54e000] client_tracking_ops[ffffffffa04bc2e0]
>
> The average time spent to remove the client record is about ~50ms, matches
> with the time reported by pynfs test. This confirms what Bruce suspected
> earlier.

OK, good to know. It'd be interesting to dig into where nfsdcltrack is
spending its time, which we could do by replacing it with a wrapper that
runs the real nfsdcltrack under strace.

Though maybe it'd be better to do this on a system using nfsdcld, since
that's what we're transitioning to.

--b.

2021-12-01 14:51:21

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH RFC v5 0/2] nfsd: Initial implementation of NFSv4 Courteous Server

Do you have a public git tree with your latest patches?

--b.

2021-12-01 17:43:17

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH RFC v5 0/2] nfsd: Initial implementation of NFSv4 Courteous Server

On Wed, Dec 01, 2021 at 09:36:30AM -0500, Bruce Fields wrote:
> OK, good to know. It'd be interesting to dig into where nfsdcltrack is
> spending its time, which we could do by replacing it with a wrapper that
> runs the real nfsdcltrack under strace.
>
> Though maybe it'd be better to do this on a system using nfsdcld, since
> that's what we're transitioning to.

Trying that on a test VM here, I see each upcall doing 3 fdatasyncs() of
an sqlite-journal file. On my setup, each of those is taking a few
milliseconds. I wonder if it an do better.

--b.

2021-12-01 18:03:41

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH RFC v5 0/2] nfsd: Initial implementation of NFSv4 Courteous Server

On Wed, Dec 01, 2021 at 12:42:05PM -0500, Bruce Fields wrote:
> On Wed, Dec 01, 2021 at 09:36:30AM -0500, Bruce Fields wrote:
> > OK, good to know. It'd be interesting to dig into where nfsdcltrack is
> > spending its time, which we could do by replacing it with a wrapper that
> > runs the real nfsdcltrack under strace.
> >
> > Though maybe it'd be better to do this on a system using nfsdcld, since
> > that's what we're transitioning to.
>
> Trying that on a test VM here, I see each upcall doing 3 fdatasyncs() of
> an sqlite-journal file. On my setup, each of those is taking a few
> milliseconds. I wonder if it an do better.

If I understand the sqlite documentation correctly, I *think* that if we
use journal_mode WAL with synchronous FULL, we should get the assurances
nfsd needs with one sync per transaction.

--b.

2021-12-01 18:47:46

by Dai Ngo

[permalink] [raw]
Subject: Re: [PATCH RFC v5 0/2] nfsd: Initial implementation of NFSv4 Courteous Server


On 12/1/21 6:51 AM, Bruce Fields wrote:
> Do you have a public git tree with your latest patches?

No, I don't but I can push it to Chuck's public tree. I need to prepare the patch.

-Dai

>
> --b.

2021-12-01 19:26:30

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH RFC v5 0/2] nfsd: Initial implementation of NFSv4 Courteous Server

On Wed, Dec 01, 2021 at 10:47:28AM -0800, [email protected] wrote:
>
> On 12/1/21 6:51 AM, Bruce Fields wrote:
> >Do you have a public git tree with your latest patches?
>
> No, I don't but I can push it to Chuck's public tree. I need to prepare the patch.

OK, it's not a big deal.--b.

2021-12-01 19:50:55

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH RFC v5 0/2] nfsd: Initial implementation of NFSv4 Courteous Server

On Wed, Dec 01, 2021 at 01:03:39PM -0500, Bruce Fields wrote:
> On Wed, Dec 01, 2021 at 12:42:05PM -0500, Bruce Fields wrote:
> > On Wed, Dec 01, 2021 at 09:36:30AM -0500, Bruce Fields wrote:
> > > OK, good to know. It'd be interesting to dig into where nfsdcltrack is
> > > spending its time, which we could do by replacing it with a wrapper that
> > > runs the real nfsdcltrack under strace.
> > >
> > > Though maybe it'd be better to do this on a system using nfsdcld, since
> > > that's what we're transitioning to.
> >
> > Trying that on a test VM here, I see each upcall doing 3 fdatasyncs() of
> > an sqlite-journal file. On my setup, each of those is taking a few
> > milliseconds. I wonder if it an do better.
>
> If I understand the sqlite documentation correctly, I *think* that if we
> use journal_mode WAL with synchronous FULL, we should get the assurances
> nfsd needs with one sync per transaction.

So I *think* that would mean just doing something like (untested, don't have
much idea what I'm doing):

diff --git a/utils/nfsdcld/sqlite.c b/utils/nfsdcld/sqlite.c
index 03016fb95823..b30f2614497b 100644
--- a/utils/nfsdcld/sqlite.c
+++ b/utils/nfsdcld/sqlite.c
@@ -826,6 +826,13 @@ sqlite_prepare_dbh(const char *topdir)
goto out_close;
}

+ ret = sqlite3_exec(dbh, "PRAGMA journal_mode = WAL;", NULL, NULL, NULL);
+ if (ret)
+ goto out_close;
+ ret = sqlite3_exec(dbh, "PRAGMA synchronous = FULL;", NULL, NULL, NULL);
+ if (ret)
+ goto out_close;
+
ret = sqlite_query_schema_version();
switch (ret) {
case CLD_SQLITE_LATEST_SCHEMA_VERSION:

I also wonder how expensive may be the extra overhead of starting up
nfsdcltrack each time.

--b.

2021-12-02 17:53:19

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH RFC v5 0/2] nfsd: Initial implementation of NFSv4 Courteous Server



> On Dec 1, 2021, at 9:51 AM, Bruce Fields <[email protected]> wrote:
>
> Do you have a public git tree with your latest patches?
>
> --b.

Dai's patches have been pushed to the nfsd-courteous-server topic branch at

git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git

I can fold them into my for-next branch if we agree they are ready for
broader test exposure.


--
Chuck Lever




2021-12-03 21:22:02

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH RFC v5 0/2] nfsd: Initial implementation of NFSv4 Courteous Server

On Wed, Dec 01, 2021 at 02:50:50PM -0500, Bruce Fields wrote:
> On Wed, Dec 01, 2021 at 01:03:39PM -0500, Bruce Fields wrote:
> > On Wed, Dec 01, 2021 at 12:42:05PM -0500, Bruce Fields wrote:
> > > On Wed, Dec 01, 2021 at 09:36:30AM -0500, Bruce Fields wrote:
> > > > OK, good to know. It'd be interesting to dig into where nfsdcltrack is
> > > > spending its time, which we could do by replacing it with a wrapper that
> > > > runs the real nfsdcltrack under strace.
> > > >
> > > > Though maybe it'd be better to do this on a system using nfsdcld, since
> > > > that's what we're transitioning to.
> > >
> > > Trying that on a test VM here, I see each upcall doing 3 fdatasyncs() of
> > > an sqlite-journal file. On my setup, each of those is taking a few
> > > milliseconds. I wonder if it an do better.
> >
> > If I understand the sqlite documentation correctly, I *think* that if we
> > use journal_mode WAL with synchronous FULL, we should get the assurances
> > nfsd needs with one sync per transaction.
>
> So I *think* that would mean just doing something like (untested, don't have
> much idea what I'm doing):

OK, tried that out on my test VM, and: yes, the resulting strace was
much simpler (and, in particular, had only one fdatasync per upcall
instead of 3), and total time to expire 1000 courtesy clients was 6.5
seconds instead of 15.9. So, I'll clean up that patch and pass it along
to Steve D.

This is all a bit of a derail, I know, but I suspect this will be a
bottleneck in other cases too, like when a lot of clients are reclaiming
after reboot.

We do need nfsdcld to sync to disk before returning to the kernel, so
this probably can't be further optimized without doing something more
complicated to allow some kind of parallelism and batching.

So if you have a ton of clients you'll just need /var/lib/nfs to be on
low-latency storage.

--b.

>
> diff --git a/utils/nfsdcld/sqlite.c b/utils/nfsdcld/sqlite.c
> index 03016fb95823..b30f2614497b 100644
> --- a/utils/nfsdcld/sqlite.c
> +++ b/utils/nfsdcld/sqlite.c
> @@ -826,6 +826,13 @@ sqlite_prepare_dbh(const char *topdir)
> goto out_close;
> }
>
> + ret = sqlite3_exec(dbh, "PRAGMA journal_mode = WAL;", NULL, NULL, NULL);
> + if (ret)
> + goto out_close;
> + ret = sqlite3_exec(dbh, "PRAGMA synchronous = FULL;", NULL, NULL, NULL);
> + if (ret)
> + goto out_close;
> +
> ret = sqlite_query_schema_version();
> switch (ret) {
> case CLD_SQLITE_LATEST_SCHEMA_VERSION:
>
> I also wonder how expensive may be the extra overhead of starting up
> nfsdcltrack each time.
>
> --b.

2021-12-03 21:55:34

by J. Bruce Fields

[permalink] [raw]
Subject: [PATCH] nfsdcld: use WAL journal for faster commits

From: "J. Bruce Fields" <[email protected]>

Currently nfsdcld is doing three fdatasyncs for each upcall. Based on
SQLite documentation, WAL mode should also be safe, and I can confirm
from an strace that it results in only one fdatasync each.

This may be a bottleneck e.g. when lots of clients are being created or
expired at once (e.g. on reboot).

Not bothering with error checking, as this is just an optimization and
nfsdcld will still function without. (Might be better to log something
on failure, though.)

Signed-off-by: J. Bruce Fields <[email protected]>
---
utils/nfsdcld/sqlite.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/utils/nfsdcld/sqlite.c b/utils/nfsdcld/sqlite.c
index 03016fb95823..b248eeffa204 100644
--- a/utils/nfsdcld/sqlite.c
+++ b/utils/nfsdcld/sqlite.c
@@ -826,6 +826,9 @@ sqlite_prepare_dbh(const char *topdir)
goto out_close;
}

+ sqlite3_exec(dbh, "PRAGMA journal_mode = WAL;", NULL, NULL, NULL);
+ sqlite3_exec(dbh, "PRAGMA synchronous = FULL;", NULL, NULL, NULL);
+
ret = sqlite_query_schema_version();
switch (ret) {
case CLD_SQLITE_LATEST_SCHEMA_VERSION:
--
2.33.1


2021-12-03 22:07:26

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH] nfsdcld: use WAL journal for faster commits



> On Dec 3, 2021, at 4:55 PM, Bruce Fields <[email protected]> wrote:
>
> From: "J. Bruce Fields" <[email protected]>
>
> Currently nfsdcld is doing three fdatasyncs for each upcall. Based on
> SQLite documentation, WAL mode should also be safe, and I can confirm
> from an strace that it results in only one fdatasync each.
>
> This may be a bottleneck e.g. when lots of clients are being created or
> expired at once (e.g. on reboot).
>
> Not bothering with error checking, as this is just an optimization and
> nfsdcld will still function without. (Might be better to log something
> on failure, though.)

I'm in full philosophical agreement for performance improvements
in this area. There are some caveats for WAL:

- It requires SQLite v3.7.0 (2010). configure.ac may need to be
updated.

- All processes using the DB file have to be on the same system.
Not sure if this matters for some DR scenarios that nfs-utils
is supposed to support.

- WAL mode is persistent; you could set this at database creation
time and it should be sticky.

- Documentation says "synchronous = FULL is the most commonly
used synchronous setting when not in WAL mode." Why are both
PRAGMAs necessary here?

I agree that nfsdcld functionality is not affected if the first
PRAGMA fails.


> Signed-off-by: J. Bruce Fields <[email protected]>
> ---
> utils/nfsdcld/sqlite.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/utils/nfsdcld/sqlite.c b/utils/nfsdcld/sqlite.c
> index 03016fb95823..b248eeffa204 100644
> --- a/utils/nfsdcld/sqlite.c
> +++ b/utils/nfsdcld/sqlite.c
> @@ -826,6 +826,9 @@ sqlite_prepare_dbh(const char *topdir)
> goto out_close;
> }
>
> + sqlite3_exec(dbh, "PRAGMA journal_mode = WAL;", NULL, NULL, NULL);
> + sqlite3_exec(dbh, "PRAGMA synchronous = FULL;", NULL, NULL, NULL);
> +
> ret = sqlite_query_schema_version();
> switch (ret) {
> case CLD_SQLITE_LATEST_SCHEMA_VERSION:
> --
> 2.33.1
>

--
Chuck Lever




2021-12-03 22:39:23

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH] nfsdcld: use WAL journal for faster commits

On Fri, Dec 03, 2021 at 10:07:19PM +0000, Chuck Lever III wrote:
>
>
> > On Dec 3, 2021, at 4:55 PM, Bruce Fields <[email protected]> wrote:
> >
> > From: "J. Bruce Fields" <[email protected]>
> >
> > Currently nfsdcld is doing three fdatasyncs for each upcall. Based on
> > SQLite documentation, WAL mode should also be safe, and I can confirm
> > from an strace that it results in only one fdatasync each.
> >
> > This may be a bottleneck e.g. when lots of clients are being created or
> > expired at once (e.g. on reboot).
> >
> > Not bothering with error checking, as this is just an optimization and
> > nfsdcld will still function without. (Might be better to log something
> > on failure, though.)
>
> I'm in full philosophical agreement for performance improvements
> in this area. There are some caveats for WAL:
>
> - It requires SQLite v3.7.0 (2010). configure.ac may need to be
> updated.

Makes sense. But I dug around a bit, and how to do this is a total
mystery to me....

> - All processes using the DB file have to be on the same system.
> Not sure if this matters for some DR scenarios that nfs-utils
> is supposed to support.

I don't think we support that.

> - WAL mode is persistent; you could set this at database creation
> time and it should be sti cky.

I wanted to upgrade existing databases too, and figured there's no harm
in calling it on each startup.

> - Documentation says "synchronous = FULL is the most commonly
> used synchronous setting when not in WAL mode." Why are both
> PRAGMAs necessary here?

Maybe they're not; I think I did see that FULL is actually the default
but I can't find that in the documentation right now.

--b.

>
> I agree that nfsdcld functionality is not affected if the first
> PRAGMA fails.
>
>
> > Signed-off-by: J. Bruce Fields <[email protected]>
> > ---
> > utils/nfsdcld/sqlite.c | 3 +++
> > 1 file changed, 3 insertions(+)
> >
> > diff --git a/utils/nfsdcld/sqlite.c b/utils/nfsdcld/sqlite.c
> > index 03016fb95823..b248eeffa204 100644
> > --- a/utils/nfsdcld/sqlite.c
> > +++ b/utils/nfsdcld/sqlite.c
> > @@ -826,6 +826,9 @@ sqlite_prepare_dbh(const char *topdir)
> > goto out_close;
> > }
> >
> > + sqlite3_exec(dbh, "PRAGMA journal_mode = WAL;", NULL, NULL, NULL);
> > + sqlite3_exec(dbh, "PRAGMA synchronous = FULL;", NULL, NULL, NULL);
> > +
> > ret = sqlite_query_schema_version();
> > switch (ret) {
> > case CLD_SQLITE_LATEST_SCHEMA_VERSION:
> > --
> > 2.33.1
> >
>
> --
> Chuck Lever
>
>

2021-12-04 00:35:35

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH] nfsdcld: use WAL journal for faster commits


> On Dec 3, 2021, at 5:39 PM, Bruce Fields <[email protected]> wrote:
>
> On Fri, Dec 03, 2021 at 10:07:19PM +0000, Chuck Lever III wrote:
>>
>>
>>>> On Dec 3, 2021, at 4:55 PM, Bruce Fields <[email protected]> wrote:
>>>
>>> From: "J. Bruce Fields" <[email protected]>
>>>
>>> Currently nfsdcld is doing three fdatasyncs for each upcall. Based on
>>> SQLite documentation, WAL mode should also be safe, and I can confirm
>>> from an strace that it results in only one fdatasync each.
>>>
>>> This may be a bottleneck e.g. when lots of clients are being created or
>>> expired at once (e.g. on reboot).
>>>
>>> Not bothering with error checking, as this is just an optimization and
>>> nfsdcld will still function without. (Might be better to log something
>>> on failure, though.)
>>
>> I'm in full philosophical agreement for performance improvements
>> in this area. There are some caveats for WAL:
>>
>> - It requires SQLite v3.7.0 (2010). configure.ac may need to be
>> updated.
>
> Makes sense. But I dug around a bit, and how to do this is a total
> mystery to me....

aclocal/libsqlite3.m4 has an SQLITE_VERSION_NUMBER check.


>> - All processes using the DB file have to be on the same system.
>> Not sure if this matters for some DR scenarios that nfs-utils
>> is supposed to support.
>
> I don't think we support that.
>
>> - WAL mode is persistent; you could set this at database creation
>> time and it should be sti cky.
>
> I wanted to upgrade existing databases too, and figured there's no harm
> in calling it on each startup.

Ah. Makes sense!


>> - Documentation says "synchronous = FULL is the most commonly
>> used synchronous setting when not in WAL mode." Why are both
>> PRAGMAs necessary here?
>
> Maybe they're not; I think I did see that FULL is actually the default
> but I can't find that in the documentation right now.
>
> --b.
>
>>
>> I agree that nfsdcld functionality is not affected if the first
>> PRAGMA fails.
>>
>>
>>> Signed-off-by: J. Bruce Fields <[email protected]>
>>> ---
>>> utils/nfsdcld/sqlite.c | 3 +++
>>> 1 file changed, 3 insertions(+)
>>>
>>> diff --git a/utils/nfsdcld/sqlite.c b/utils/nfsdcld/sqlite.c
>>> index 03016fb95823..b248eeffa204 100644
>>> --- a/utils/nfsdcld/sqlite.c
>>> +++ b/utils/nfsdcld/sqlite.c
>>> @@ -826,6 +826,9 @@ sqlite_prepare_dbh(const char *topdir)
>>> goto out_close;
>>> }
>>>
>>> + sqlite3_exec(dbh, "PRAGMA journal_mode = WAL;", NULL, NULL, NULL);
>>> + sqlite3_exec(dbh, "PRAGMA synchronous = FULL;", NULL, NULL, NULL);
>>> +
>>> ret = sqlite_query_schema_version();
>>> switch (ret) {
>>> case CLD_SQLITE_LATEST_SCHEMA_VERSION:
>>> --
>>> 2.33.1
>>>
>>
>> --
>> Chuck Lever
>>
>>

2021-12-04 01:24:05

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH] nfsdcld: use WAL journal for faster commits

On Sat, Dec 04, 2021 at 12:35:11AM +0000, Chuck Lever III wrote:
>
> > On Dec 3, 2021, at 5:39 PM, Bruce Fields <[email protected]> wrote:
> >
> > On Fri, Dec 03, 2021 at 10:07:19PM +0000, Chuck Lever III wrote:
> >>
> >>
> >>>> On Dec 3, 2021, at 4:55 PM, Bruce Fields <[email protected]> wrote:
> >>>
> >>> From: "J. Bruce Fields" <[email protected]>
> >>>
> >>> Currently nfsdcld is doing three fdatasyncs for each upcall. Based on
> >>> SQLite documentation, WAL mode should also be safe, and I can confirm
> >>> from an strace that it results in only one fdatasync each.
> >>>
> >>> This may be a bottleneck e.g. when lots of clients are being created or
> >>> expired at once (e.g. on reboot).
> >>>
> >>> Not bothering with error checking, as this is just an optimization and
> >>> nfsdcld will still function without. (Might be better to log something
> >>> on failure, though.)
> >>
> >> I'm in full philosophical agreement for performance improvements
> >> in this area. There are some caveats for WAL:
> >>
> >> - It requires SQLite v3.7.0 (2010). configure.ac may need to be
> >> updated.
> >
> > Makes sense. But I dug around a bit, and how to do this is a total
> > mystery to me....
>
> aclocal/libsqlite3.m4 has an SQLITE_VERSION_NUMBER check.

I looked got stuck trying to figure out why the #%^ it's comparing to
SQLITE_VERSION_NUMBER. OK, I see now, it's just some sanity check.
Here's take 2.

--b.

From c22d3a2d8576273934773fefac933d4f8e04776b Mon Sep 17 00:00:00 2001
From: "J. Bruce Fields" <[email protected]>
Date: Fri, 3 Dec 2021 10:27:53 -0500
Subject: [PATCH] nfsdcld: use WAL journal for faster commits

Currently nfsdcld is doing three fdatasyncs for each upcall. Based on
SQLite documentation, WAL mode should also be safe, and I can confirm
from an strace that it results in only one fdatasync each.

This may be a bottleneck e.g. when lots of clients are being created or
expired at once (e.g. on reboot).

Not bothering with error checking, as this is just an optimization and
nfsdcld will still function without. (Might be better to log something
on failure, though.)

Signed-off-by: J. Bruce Fields <[email protected]>
---
aclocal/libsqlite3.m4 | 2 +-
utils/nfsdcld/sqlite.c | 2 ++
2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/aclocal/libsqlite3.m4 b/aclocal/libsqlite3.m4
index 8c38993cbba8..c3beb4d56f0c 100644
--- a/aclocal/libsqlite3.m4
+++ b/aclocal/libsqlite3.m4
@@ -22,7 +22,7 @@ AC_DEFUN([AC_SQLITE3_VERS], [
int vers = sqlite3_libversion_number();

return vers != SQLITE_VERSION_NUMBER ||
- vers < 3003000;
+ vers < 3007000;
}
], [libsqlite3_cv_is_recent=yes], [libsqlite3_cv_is_recent=no],
[libsqlite3_cv_is_recent=unknown])
diff --git a/utils/nfsdcld/sqlite.c b/utils/nfsdcld/sqlite.c
index 03016fb95823..eabb0daa95f5 100644
--- a/utils/nfsdcld/sqlite.c
+++ b/utils/nfsdcld/sqlite.c
@@ -826,6 +826,8 @@ sqlite_prepare_dbh(const char *topdir)
goto out_close;
}

+ sqlite3_exec(dbh, "PRAGMA journal_mode = WAL;", NULL, NULL, NULL);
+
ret = sqlite_query_schema_version();
switch (ret) {
case CLD_SQLITE_LATEST_SCHEMA_VERSION:
--
2.33.1


2021-12-06 15:51:26

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH] nfsdcld: use WAL journal for faster commits


> On Dec 3, 2021, at 8:24 PM, Bruce Fields <[email protected]> wrote:
>
> On Sat, Dec 04, 2021 at 12:35:11AM +0000, Chuck Lever III wrote:
>>
>>> On Dec 3, 2021, at 5:39 PM, Bruce Fields <[email protected]> wrote:
>>>
>>> On Fri, Dec 03, 2021 at 10:07:19PM +0000, Chuck Lever III wrote:
>>>>
>>>>
>>>>>> On Dec 3, 2021, at 4:55 PM, Bruce Fields <[email protected]> wrote:
>>>>>
>>>>> From: "J. Bruce Fields" <[email protected]>
>>>>>
>>>>> Currently nfsdcld is doing three fdatasyncs for each upcall. Based on
>>>>> SQLite documentation, WAL mode should also be safe, and I can confirm
>>>>> from an strace that it results in only one fdatasync each.
>>>>>
>>>>> This may be a bottleneck e.g. when lots of clients are being created or
>>>>> expired at once (e.g. on reboot).
>>>>>
>>>>> Not bothering with error checking, as this is just an optimization and
>>>>> nfsdcld will still function without. (Might be better to log something
>>>>> on failure, though.)
>>>>
>>>> I'm in full philosophical agreement for performance improvements
>>>> in this area. There are some caveats for WAL:
>>>>
>>>> - It requires SQLite v3.7.0 (2010). configure.ac may need to be
>>>> updated.
>>>
>>> Makes sense. But I dug around a bit, and how to do this is a total
>>> mystery to me....
>>
>> aclocal/libsqlite3.m4 has an SQLITE_VERSION_NUMBER check.
>
> I looked got stuck trying to figure out why the #%^ it's comparing to
> SQLITE_VERSION_NUMBER. OK, I see now, it's just some sanity check.
> Here's take 2.
>
> --b.
>
> From c22d3a2d8576273934773fefac933d4f8e04776b Mon Sep 17 00:00:00 2001
> From: "J. Bruce Fields" <[email protected]>
> Date: Fri, 3 Dec 2021 10:27:53 -0500
> Subject: [PATCH] nfsdcld: use WAL journal for faster commits
>
> Currently nfsdcld is doing three fdatasyncs for each upcall. Based on
> SQLite documentation, WAL mode should also be safe, and I can confirm
> from an strace that it results in only one fdatasync each.
>
> This may be a bottleneck e.g. when lots of clients are being created or
> expired at once (e.g. on reboot).
>
> Not bothering with error checking, as this is just an optimization and
> nfsdcld will still function without. (Might be better to log something
> on failure, though.)
>
> Signed-off-by: J. Bruce Fields <[email protected]>

Nice.

Reviewed-by: Chuck Lever <[email protected]>


> ---
> aclocal/libsqlite3.m4 | 2 +-
> utils/nfsdcld/sqlite.c | 2 ++
> 2 files changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/aclocal/libsqlite3.m4 b/aclocal/libsqlite3.m4
> index 8c38993cbba8..c3beb4d56f0c 100644
> --- a/aclocal/libsqlite3.m4
> +++ b/aclocal/libsqlite3.m4
> @@ -22,7 +22,7 @@ AC_DEFUN([AC_SQLITE3_VERS], [
> int vers = sqlite3_libversion_number();
>
> return vers != SQLITE_VERSION_NUMBER ||
> - vers < 3003000;
> + vers < 3007000;
> }
> ], [libsqlite3_cv_is_recent=yes], [libsqlite3_cv_is_recent=no],
> [libsqlite3_cv_is_recent=unknown])
> diff --git a/utils/nfsdcld/sqlite.c b/utils/nfsdcld/sqlite.c
> index 03016fb95823..eabb0daa95f5 100644
> --- a/utils/nfsdcld/sqlite.c
> +++ b/utils/nfsdcld/sqlite.c
> @@ -826,6 +826,8 @@ sqlite_prepare_dbh(const char *topdir)
> goto out_close;
> }
>
> + sqlite3_exec(dbh, "PRAGMA journal_mode = WAL;", NULL, NULL, NULL);
> +
> ret = sqlite_query_schema_version();
> switch (ret) {
> case CLD_SQLITE_LATEST_SCHEMA_VERSION:
> --
> 2.33.1

--
Chuck Lever



2022-01-04 22:24:47

by J. Bruce Fields

[permalink] [raw]
Subject: [PATCH] nfsdcld: use WAL journal for faster commits

From: "J. Bruce Fields" <[email protected]>

Currently nfsdcld is doing three fdatasyncs for each upcall. Based on
SQLite documentation, WAL mode should also be safe, and I can confirm
from an strace that it results in only one fdatasync each.

This may be a bottleneck e.g. when lots of clients are being created or
expired at once (e.g. on reboot).

Not bothering with error checking, as this is just an optimization and
nfsdcld will still function without. (Might be better to log something
on failure, though.)

Reviewed-by: Chuck Lever <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>
---
aclocal/libsqlite3.m4 | 2 +-
utils/nfsdcld/sqlite.c | 2 ++
2 files changed, 3 insertions(+), 1 deletion(-)

Resending to make sure SteveD saw this....--b.

diff --git a/aclocal/libsqlite3.m4 b/aclocal/libsqlite3.m4
index 8c38993cbba8..c3beb4d56f0c 100644
--- a/aclocal/libsqlite3.m4
+++ b/aclocal/libsqlite3.m4
@@ -22,7 +22,7 @@ AC_DEFUN([AC_SQLITE3_VERS], [
int vers = sqlite3_libversion_number();

return vers != SQLITE_VERSION_NUMBER ||
- vers < 3003000;
+ vers < 3007000;
}
], [libsqlite3_cv_is_recent=yes], [libsqlite3_cv_is_recent=no],
[libsqlite3_cv_is_recent=unknown])
diff --git a/utils/nfsdcld/sqlite.c b/utils/nfsdcld/sqlite.c
index 03016fb95823..eabb0daa95f5 100644
--- a/utils/nfsdcld/sqlite.c
+++ b/utils/nfsdcld/sqlite.c
@@ -826,6 +826,8 @@ sqlite_prepare_dbh(const char *topdir)
goto out_close;
}

+ sqlite3_exec(dbh, "PRAGMA journal_mode = WAL;", NULL, NULL, NULL);
+
ret = sqlite_query_schema_version();
switch (ret) {
case CLD_SQLITE_LATEST_SCHEMA_VERSION:
--
2.33.1