2023-02-13 21:13:52

by Jeff Layton

[permalink] [raw]
Subject: [PATCH 0/3] nfsd: write verifier fixes and optimization

While looking at the recent problems with the fsync during nfsd_file
cleanup, it occured to me that we could greatly simplify and improve
the server's write verifier handling. I also noticed an existing bug
which is fixed in patch #1.

Instead of trying to check for errors via fsync and resetting the write
verifier when we get one, we can just fold the current value of the
inode's errseq_t into the hashed verifier that is generated at startup
time.

Testing this new scheme has been a real challenge. Once a writeback
error is recorded on all local filesystems, further attempts to write to
the inode return -EIO (and some filesystems even flip to r/o) and you
never see the new verifier.

Trond, you originally added the code to make it reset the verifier on a
writeback error. Do you have a good way to test that? Did you guys use
NFSv3 reexport for testing this somehow?

Jeff Layton (3):
nfsd: copy the whole verifier in nfsd_copy_write_verifier
errseq: add a new errseq_fetch helper
nfsd: simplify write verifier handling

fs/nfsd/filecache.c | 22 +------------------
fs/nfsd/netns.h | 4 ----
fs/nfsd/nfs4proc.c | 17 ++++++---------
fs/nfsd/nfsctl.c | 1 -
fs/nfsd/nfssvc.c | 48 +++++++++++++-----------------------------
fs/nfsd/trace.h | 28 ------------------------
fs/nfsd/vfs.c | 28 +++++-------------------
fs/nfsd/vfs.h | 1 +
include/linux/errseq.h | 1 +
lib/errseq.c | 33 ++++++++++++++++++++++++++++-
10 files changed, 62 insertions(+), 121 deletions(-)

--
2.39.1



2023-02-13 21:13:52

by Jeff Layton

[permalink] [raw]
Subject: [PATCH 1/3] nfsd: copy the whole verifier in nfsd_copy_write_verifier

Currently, we're only memcpy'ing the first __be32. Ensure we copy into
both words.

Fixes: 91d2e9b56cf5 (NFSD: Clean up the nfsd_net::nfssvc_boot field)
Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfsd/nfssvc.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
index fe5e4f73bb98..3a38ab304b02 100644
--- a/fs/nfsd/nfssvc.c
+++ b/fs/nfsd/nfssvc.c
@@ -363,7 +363,7 @@ void nfsd_copy_write_verifier(__be32 verf[2], struct nfsd_net *nn)

do {
read_seqbegin_or_lock(&nn->writeverf_lock, &seq);
- memcpy(verf, nn->writeverf, sizeof(*verf));
+ memcpy(verf, nn->writeverf, sizeof(*verf) * 2);
} while (need_seqretry(&nn->writeverf_lock, seq));
done_seqretry(&nn->writeverf_lock, seq);
}
--
2.39.1


2023-02-13 21:13:54

by Jeff Layton

[permalink] [raw]
Subject: [PATCH 2/3] errseq: add a new errseq_fetch helper

In a later patch, nfsd is going to need to to fetch the current errseq_t
value for its write verifiers. Ordinarily, we'd use errseq_sample, but
in the event that the value hasn't been SEEN, we don't want to return a
0. Resurrect the old errseq_sample routine (before Willy fixed it) and
rechristen it as errseq_fetch.

Cc: Matthew Wilcox <[email protected]>
Signed-off-by: Jeff Layton <[email protected]>
---
include/linux/errseq.h | 1 +
lib/errseq.c | 33 ++++++++++++++++++++++++++++++++-
2 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/include/linux/errseq.h b/include/linux/errseq.h
index fc2777770768..13a731236c9b 100644
--- a/include/linux/errseq.h
+++ b/include/linux/errseq.h
@@ -9,6 +9,7 @@ typedef u32 errseq_t;

errseq_t errseq_set(errseq_t *eseq, int err);
errseq_t errseq_sample(errseq_t *eseq);
+errseq_t errseq_fetch(errseq_t *eseq);
int errseq_check(errseq_t *eseq, errseq_t since);
int errseq_check_and_advance(errseq_t *eseq, errseq_t *since);
#endif
diff --git a/lib/errseq.c b/lib/errseq.c
index 93e9b94358dc..f243b7dc36f5 100644
--- a/lib/errseq.c
+++ b/lib/errseq.c
@@ -109,7 +109,7 @@ errseq_t errseq_set(errseq_t *eseq, int err)
EXPORT_SYMBOL(errseq_set);

/**
- * errseq_sample() - Grab current errseq_t value.
+ * errseq_sample() - Grab current errseq_t value (or 0 if it's unseen)
* @eseq: Pointer to errseq_t to be sampled.
*
* This function allows callers to initialise their errseq_t variable.
@@ -131,6 +131,37 @@ errseq_t errseq_sample(errseq_t *eseq)
}
EXPORT_SYMBOL(errseq_sample);

+/**
+ * errseq_fetch() - Grab current errseq_t value
+ * @eseq: Pointer to errseq_t to be sampled.
+ *
+ * This function grabs the current errseq_t value, and returns it,
+ * and marks the value as SEEN. This differs from a "sample" in that we
+ * grab the actual value even if it has not been seen before (instead of
+ * returning 0 in that case).
+ *
+ * Context: Any context.
+ * Return: The current errseq value.
+ */
+errseq_t errseq_fetch(errseq_t *eseq)
+{
+ errseq_t old = READ_ONCE(*eseq);
+ errseq_t new = old;
+
+ /*
+ * For the common case of no errors ever having been set, we can skip
+ * marking the SEEN bit. Once an error has been set, the value will
+ * never go back to zero.
+ */
+ if (old != 0) {
+ new |= ERRSEQ_SEEN;
+ if (old != new)
+ cmpxchg(eseq, old, new);
+ }
+ return new;
+}
+EXPORT_SYMBOL(errseq_fetch);
+
/**
* errseq_check() - Has an error occurred since a particular sample point?
* @eseq: Pointer to errseq_t value to be checked.
--
2.39.1


2023-02-13 21:13:56

by Jeff Layton

[permalink] [raw]
Subject: [PATCH 3/3] nfsd: simplify write verifier handling

The write verifier exists to tell the client when the server may have
forgotten some unstable writes. The typical way that this happens is if
the server crashes, but we've also extended nfsd to change it when there
are writeback errors as well.

The way it works today though, we call something like vfs_fsync (e.g.
for a COMMIT call) and if we get back an error, we'll reset the write
verifier.

This is non-optimal for a couple of reasons:

1/ There could be significant delay between an error being
recorded and the reset. It would be ideal if the write verifier were to
change as soon as the error was recorded.

2/ It's a bit of a waste, in that if we get a writeback error on a
single inode, we'll end up resetting the write verifier for everything,
even on inodes that may be fine (e.g. on a completely separate fs).

The protocol doesn't require that we use the same verifier for all
inodes. The only requirement is that the verifier change if the server
may have forgotten some unstable writes.

Instead of resetting the per-net write verifier on errors, we can just
fetch the current errseq_t for the inode, and fold that value into the
verifier on a write. If an error is reported, then that value will
change and the verifier will also naturally change without nfsd having
to take any explicit steps.

Make nfsd only set the per-net verifier at startup time. When we need a
verifier for a reply, fetch the current errseq_t value for the mapping
and xor it into the per-net verifier.

Cc: Trond Myklebust <[email protected]>
Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfsd/filecache.c | 22 +--------------------
fs/nfsd/netns.h | 4 ----
fs/nfsd/nfs4proc.c | 17 +++++++---------
fs/nfsd/nfsctl.c | 1 -
fs/nfsd/nfssvc.c | 48 ++++++++++++++-------------------------------
fs/nfsd/trace.h | 28 --------------------------
fs/nfsd/vfs.c | 28 +++++---------------------
fs/nfsd/vfs.h | 1 +
8 files changed, 29 insertions(+), 120 deletions(-)

diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
index 3b9a10378c83..1ca9ad0aabcd 100644
--- a/fs/nfsd/filecache.c
+++ b/fs/nfsd/filecache.c
@@ -233,23 +233,6 @@ nfsd_file_alloc(struct net *net, struct inode *inode, unsigned char need,
return nf;
}

-/**
- * nfsd_file_check_write_error - check for writeback errors on a file
- * @nf: nfsd_file to check for writeback errors
- *
- * Check whether a nfsd_file has an unseen error. Reset the write
- * verifier if so.
- */
-static void
-nfsd_file_check_write_error(struct nfsd_file *nf)
-{
- struct file *file = nf->nf_file;
-
- if ((file->f_mode & FMODE_WRITE) &&
- filemap_check_wb_err(file->f_mapping, READ_ONCE(file->f_wb_err)))
- nfsd_reset_write_verifier(net_generic(nf->nf_net, nfsd_net_id));
-}
-
static void
nfsd_file_hash_remove(struct nfsd_file *nf)
{
@@ -281,10 +264,8 @@ nfsd_file_free(struct nfsd_file *nf)
nfsd_file_unhash(nf);
if (nf->nf_mark)
nfsd_file_mark_put(nf->nf_mark);
- if (nf->nf_file) {
- nfsd_file_check_write_error(nf);
+ if (nf->nf_file)
filp_close(nf->nf_file, NULL);
- }

/*
* If this item is still linked via nf_lru, that's a bug.
@@ -1038,7 +1019,6 @@ nfsd_file_do_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
out:
if (status == nfs_ok) {
this_cpu_inc(nfsd_file_acquisitions);
- nfsd_file_check_write_error(nf);
*pnf = nf;
}
put_cred(cred);
diff --git a/fs/nfsd/netns.h b/fs/nfsd/netns.h
index 51a4b7885cae..95d7adb73b24 100644
--- a/fs/nfsd/netns.h
+++ b/fs/nfsd/netns.h
@@ -109,7 +109,6 @@ struct nfsd_net {
bool nfsd_net_up;
bool lockd_up;

- seqlock_t writeverf_lock;
unsigned char writeverf[8];

/*
@@ -204,7 +203,4 @@ struct nfsd_net {
extern void nfsd_netns_free_versions(struct nfsd_net *nn);

extern unsigned int nfsd_net_id;
-
-void nfsd_copy_write_verifier(__be32 verf[2], struct nfsd_net *nn);
-void nfsd_reset_write_verifier(struct nfsd_net *nn);
#endif /* __NFSD_NETNS_H__ */
diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
index 5ae670807449..d63648d79f17 100644
--- a/fs/nfsd/nfs4proc.c
+++ b/fs/nfsd/nfs4proc.c
@@ -717,15 +717,6 @@ nfsd4_access(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
&access->ac_supported);
}

-static void gen_boot_verifier(nfs4_verifier *verifier, struct net *net)
-{
- __be32 *verf = (__be32 *)verifier->data;
-
- BUILD_BUG_ON(2*sizeof(*verf) != sizeof(verifier->data));
-
- nfsd_copy_write_verifier(verf, net_generic(net, nfsd_net_id));
-}
-
static __be32
nfsd4_commit(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
union nfsd4_op_u *u)
@@ -1586,11 +1577,17 @@ static const struct nfsd4_callback_ops nfsd4_cb_offload_ops = {

static void nfsd4_init_copy_res(struct nfsd4_copy *copy, bool sync)
{
+ __be32 *verf;
+ nfs4_verifier *verifier = &copy->cp_res.wr_verifier;
+
+ BUILD_BUG_ON(2*sizeof(*verf) != sizeof(verifier->data));
+
copy->cp_res.wr_stable_how =
test_bit(NFSD4_COPY_F_COMMITTED, &copy->cp_flags) ?
NFS_FILE_SYNC : NFS_UNSTABLE;
nfsd4_copy_set_sync(copy, sync);
- gen_boot_verifier(&copy->cp_res.wr_verifier, copy->cp_clp->net);
+ verf = (__be32 *)verifier->data;
+ nfsd_set_write_verifier(verf, copy->nf_dst);
}

static ssize_t _nfsd_copy_file_range(struct nfsd4_copy *copy,
diff --git a/fs/nfsd/nfsctl.c b/fs/nfsd/nfsctl.c
index 04474b8ccf0a..28a73577beaa 100644
--- a/fs/nfsd/nfsctl.c
+++ b/fs/nfsd/nfsctl.c
@@ -1501,7 +1501,6 @@ static __net_init int nfsd_init_net(struct net *net)
nn->nfsd4_minorversions = NULL;
nfsd4_init_leases_net(nn);
get_random_bytes(&nn->siphash_key, sizeof(nn->siphash_key));
- seqlock_init(&nn->writeverf_lock);

return 0;

diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
index 3a38ab304b02..8b09bfc1fae9 100644
--- a/fs/nfsd/nfssvc.c
+++ b/fs/nfsd/nfssvc.c
@@ -350,25 +350,27 @@ static bool nfsd_needs_lockd(struct nfsd_net *nn)
}

/**
- * nfsd_copy_write_verifier - Atomically copy a write verifier
+ * nfsd_set_write_verifier - set the write verifier for a call
* @verf: buffer in which to receive the verifier cookie
- * @nn: NFS net namespace
+ * @nf: nfsd_file being operated on for write
*
- * This function provides a wait-free mechanism for copying the
- * namespace's write verifier without tearing it.
+ * Grab the (static) write verifier for the nfsd_net, and then fold
+ * the current errseq_t value into it for the inode to create a
+ * write verifier.
*/
-void nfsd_copy_write_verifier(__be32 verf[2], struct nfsd_net *nn)
+void nfsd_set_write_verifier(__be32 verf[2], struct nfsd_file *nf)
{
- int seq = 0;
+ struct nfsd_net *nn = net_generic(nf->nf_net, nfsd_net_id);
+ errseq_t eseq = errseq_fetch(&nf->nf_file->f_mapping->wb_err);

- do {
- read_seqbegin_or_lock(&nn->writeverf_lock, &seq);
- memcpy(verf, nn->writeverf, sizeof(*verf) * 2);
- } while (need_seqretry(&nn->writeverf_lock, seq));
- done_seqretry(&nn->writeverf_lock, seq);
+ /* copy in the per-net write verifier */
+ memcpy(verf, nn->writeverf, sizeof(*verf) * 2);
+
+ /* fold the errseq into first word */
+ verf[0] ^= (__force __be32)eseq;
}

-static void nfsd_reset_write_verifier_locked(struct nfsd_net *nn)
+static void nfsd_init_write_verifier(struct nfsd_net *nn)
{
struct timespec64 now;
u64 verf;
@@ -382,26 +384,6 @@ static void nfsd_reset_write_verifier_locked(struct nfsd_net *nn)
memcpy(nn->writeverf, &verf, sizeof(nn->writeverf));
}

-/**
- * nfsd_reset_write_verifier - Generate a new write verifier
- * @nn: NFS net namespace
- *
- * This function updates the ->writeverf field of @nn. This field
- * contains an opaque cookie that, according to Section 18.32.3 of
- * RFC 8881, "the client can use to determine whether a server has
- * changed instance state (e.g., server restart) between a call to
- * WRITE and a subsequent call to either WRITE or COMMIT. This
- * cookie MUST be unchanged during a single instance of the NFSv4.1
- * server and MUST be unique between instances of the NFSv4.1
- * server."
- */
-void nfsd_reset_write_verifier(struct nfsd_net *nn)
-{
- write_seqlock(&nn->writeverf_lock);
- nfsd_reset_write_verifier_locked(nn);
- write_sequnlock(&nn->writeverf_lock);
-}
-
static int nfsd_startup_net(struct net *net, const struct cred *cred)
{
struct nfsd_net *nn = net_generic(net, nfsd_net_id);
@@ -687,7 +669,7 @@ int nfsd_create_serv(struct net *net)
register_inet6addr_notifier(&nfsd_inet6addr_notifier);
#endif
}
- nfsd_reset_write_verifier(nn);
+ nfsd_init_write_verifier(nn);
return 0;
}

diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
index 4183819ea082..93d76fe33514 100644
--- a/fs/nfsd/trace.h
+++ b/fs/nfsd/trace.h
@@ -744,34 +744,6 @@ DEFINE_EVENT(nfsd_net_class, nfsd_##name, \
DEFINE_NET_EVENT(grace_start);
DEFINE_NET_EVENT(grace_complete);

-TRACE_EVENT(nfsd_writeverf_reset,
- TP_PROTO(
- const struct nfsd_net *nn,
- const struct svc_rqst *rqstp,
- int error
- ),
- TP_ARGS(nn, rqstp, error),
- TP_STRUCT__entry(
- __field(unsigned long long, boot_time)
- __field(u32, xid)
- __field(int, error)
- __array(unsigned char, verifier, NFS4_VERIFIER_SIZE)
- ),
- TP_fast_assign(
- __entry->boot_time = nn->boot_time;
- __entry->xid = be32_to_cpu(rqstp->rq_xid);
- __entry->error = error;
-
- /* avoid seqlock inside TP_fast_assign */
- memcpy(__entry->verifier, nn->writeverf,
- NFS4_VERIFIER_SIZE);
- ),
- TP_printk("boot_time=%16llx xid=0x%08x error=%d new verifier=0x%s",
- __entry->boot_time, __entry->xid, __entry->error,
- __print_hex_str(__entry->verifier, NFS4_VERIFIER_SIZE)
- )
-);
-
TRACE_EVENT(nfsd_clid_cred_mismatch,
TP_PROTO(
const struct nfs4_client *clp,
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 21d5209f6e04..6e8ba8357735 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -628,17 +628,12 @@ __be32 nfsd4_clone_file_range(struct svc_rqst *rqstp,
if (!status)
status = commit_inode_metadata(file_inode(src));
if (status < 0) {
- struct nfsd_net *nn = net_generic(nf_dst->nf_net,
- nfsd_net_id);
-
trace_nfsd_clone_file_range_err(rqstp,
&nfsd4_get_cstate(rqstp)->save_fh,
src_pos,
&nfsd4_get_cstate(rqstp)->current_fh,
dst_pos,
count, status);
- nfsd_reset_write_verifier(nn);
- trace_nfsd_writeverf_reset(nn, rqstp, status);
ret = nfserrno(status);
}
}
@@ -1058,7 +1053,6 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf,
unsigned long *cnt, int stable,
__be32 *verf)
{
- struct nfsd_net *nn = net_generic(SVC_NET(rqstp), nfsd_net_id);
struct file *file = nf->nf_file;
struct super_block *sb = file_inode(file)->i_sb;
struct svc_export *exp;
@@ -1103,13 +1097,10 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf,
iov_iter_kvec(&iter, ITER_SOURCE, vec, vlen, *cnt);
since = READ_ONCE(file->f_wb_err);
if (verf)
- nfsd_copy_write_verifier(verf, nn);
+ nfsd_set_write_verifier(verf, nf);
host_err = vfs_iter_write(file, &iter, &pos, flags);
- if (host_err < 0) {
- nfsd_reset_write_verifier(nn);
- trace_nfsd_writeverf_reset(nn, rqstp, host_err);
+ if (host_err < 0)
goto out_nfserr;
- }
*cnt = host_err;
nfsd_stats_io_write_add(exp, *cnt);
fsnotify_modify(file);
@@ -1117,13 +1108,8 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf,
if (host_err < 0)
goto out_nfserr;

- if (stable && use_wgather) {
+ if (stable && use_wgather)
host_err = wait_for_concurrent_writes(file);
- if (host_err < 0) {
- nfsd_reset_write_verifier(nn);
- trace_nfsd_writeverf_reset(nn, rqstp, host_err);
- }
- }

out_nfserr:
if (host_err >= 0) {
@@ -1223,7 +1209,6 @@ nfsd_commit(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf,
__be32 err = nfs_ok;
u64 maxbytes;
loff_t start, end;
- struct nfsd_net *nn;

/*
* Convert the client-provided (offset, count) range to a
@@ -1240,7 +1225,6 @@ nfsd_commit(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf,
end = offset + count - 1;
}

- nn = net_generic(nf->nf_net, nfsd_net_id);
if (EX_ISSYNC(fhp->fh_export)) {
errseq_t since = READ_ONCE(nf->nf_file->f_wb_err);
int err2;
@@ -1248,7 +1232,7 @@ nfsd_commit(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf,
err2 = vfs_fsync_range(nf->nf_file, start, end, 0);
switch (err2) {
case 0:
- nfsd_copy_write_verifier(verf, nn);
+ nfsd_set_write_verifier(verf, nf);
err2 = filemap_check_wb_err(nf->nf_file->f_mapping,
since);
err = nfserrno(err2);
@@ -1257,12 +1241,10 @@ nfsd_commit(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf,
err = nfserr_notsupp;
break;
default:
- nfsd_reset_write_verifier(nn);
- trace_nfsd_writeverf_reset(nn, rqstp, err2);
err = nfserrno(err2);
}
} else
- nfsd_copy_write_verifier(verf, nn);
+ nfsd_set_write_verifier(verf, nf);

return err;
}
diff --git a/fs/nfsd/vfs.h b/fs/nfsd/vfs.h
index dbdfef7ae85b..e69f304407ae 100644
--- a/fs/nfsd/vfs.h
+++ b/fs/nfsd/vfs.h
@@ -176,4 +176,5 @@ static inline __be32 fh_getattr(const struct svc_fh *fh, struct kstat *stat)
AT_STATX_SYNC_AS_STAT));
}

+void nfsd_set_write_verifier(__be32 verf[2], struct nfsd_file *nf);
#endif /* LINUX_NFSD_VFS_H */
--
2.39.1


2023-02-14 00:49:33

by Rick Macklem

[permalink] [raw]
Subject: Re: [PATCH 3/3] nfsd: simplify write verifier handling

On Mon, Feb 13, 2023 at 1:14 PM Jeff Layton <[email protected]> wrote:
>
> CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to [email protected]
>
>
> The write verifier exists to tell the client when the server may have
> forgotten some unstable writes. The typical way that this happens is if
> the server crashes, but we've also extended nfsd to change it when there
> are writeback errors as well.
>
> The way it works today though, we call something like vfs_fsync (e.g.
> for a COMMIT call) and if we get back an error, we'll reset the write
> verifier.
>
> This is non-optimal for a couple of reasons:
>
> 1/ There could be significant delay between an error being
> recorded and the reset. It would be ideal if the write verifier were to
> change as soon as the error was recorded.
>
> 2/ It's a bit of a waste, in that if we get a writeback error on a
> single inode, we'll end up resetting the write verifier for everything,
> even on inodes that may be fine (e.g. on a completely separate fs).
>
Here's the snippet from RFC8881:
The final portion of the result is the field writeverf. This field
is the write verifier and is a cookie that the client can use to
determine whether a server has changed instance state (e.g., server
restart) between a call to WRITE and a subsequent call to either
WRITE or COMMIT. This cookie MUST be unchanged during a single
instance of the NFSv4.1 server and MUST be unique between instances
of the NFSv4.1 server. If the cookie changes, then the client MUST
assume that any data written with an UNSTABLE4 value for committed
and an old writeverf in the reply has been lost and will need to be
recovered.

I've always interpreted the writeverf as "per-server" and not "per-file".
Although I'll admit the above does not make that crystal clear, it does make
it clear that the writeverf applies to a "server instance" and not a file or
file system on the server.

The FreeBSD client assumes it is "per-server" and re-writes all uncommitted
writes for the server, not just ones for the file (or file system) the
writeverf is
replied with. (I vaguely recall Solaris does the same?)

At the very least, I think you should run this past the IETF working group
([email protected]) to see what they say w.r.t. the writeverf being "per-file" vs
"per-server".

rick

> The protocol doesn't require that we use the same verifier for all
> inodes. The only requirement is that the verifier change if the server
> may have forgotten some unstable writes.
>
> Instead of resetting the per-net write verifier on errors, we can just
> fetch the current errseq_t for the inode, and fold that value into the
> verifier on a write. If an error is reported, then that value will
> change and the verifier will also naturally change without nfsd having
> to take any explicit steps.
>
> Make nfsd only set the per-net verifier at startup time. When we need a
> verifier for a reply, fetch the current errseq_t value for the mapping
> and xor it into the per-net verifier.
>
> Cc: Trond Myklebust <[email protected]>
> Signed-off-by: Jeff Layton <[email protected]>
> ---
> fs/nfsd/filecache.c | 22 +--------------------
> fs/nfsd/netns.h | 4 ----
> fs/nfsd/nfs4proc.c | 17 +++++++---------
> fs/nfsd/nfsctl.c | 1 -
> fs/nfsd/nfssvc.c | 48 ++++++++++++++-------------------------------
> fs/nfsd/trace.h | 28 --------------------------
> fs/nfsd/vfs.c | 28 +++++---------------------
> fs/nfsd/vfs.h | 1 +
> 8 files changed, 29 insertions(+), 120 deletions(-)
>
> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> index 3b9a10378c83..1ca9ad0aabcd 100644
> --- a/fs/nfsd/filecache.c
> +++ b/fs/nfsd/filecache.c
> @@ -233,23 +233,6 @@ nfsd_file_alloc(struct net *net, struct inode *inode, unsigned char need,
> return nf;
> }
>
> -/**
> - * nfsd_file_check_write_error - check for writeback errors on a file
> - * @nf: nfsd_file to check for writeback errors
> - *
> - * Check whether a nfsd_file has an unseen error. Reset the write
> - * verifier if so.
> - */
> -static void
> -nfsd_file_check_write_error(struct nfsd_file *nf)
> -{
> - struct file *file = nf->nf_file;
> -
> - if ((file->f_mode & FMODE_WRITE) &&
> - filemap_check_wb_err(file->f_mapping, READ_ONCE(file->f_wb_err)))
> - nfsd_reset_write_verifier(net_generic(nf->nf_net, nfsd_net_id));
> -}
> -
> static void
> nfsd_file_hash_remove(struct nfsd_file *nf)
> {
> @@ -281,10 +264,8 @@ nfsd_file_free(struct nfsd_file *nf)
> nfsd_file_unhash(nf);
> if (nf->nf_mark)
> nfsd_file_mark_put(nf->nf_mark);
> - if (nf->nf_file) {
> - nfsd_file_check_write_error(nf);
> + if (nf->nf_file)
> filp_close(nf->nf_file, NULL);
> - }
>
> /*
> * If this item is still linked via nf_lru, that's a bug.
> @@ -1038,7 +1019,6 @@ nfsd_file_do_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
> out:
> if (status == nfs_ok) {
> this_cpu_inc(nfsd_file_acquisitions);
> - nfsd_file_check_write_error(nf);
> *pnf = nf;
> }
> put_cred(cred);
> diff --git a/fs/nfsd/netns.h b/fs/nfsd/netns.h
> index 51a4b7885cae..95d7adb73b24 100644
> --- a/fs/nfsd/netns.h
> +++ b/fs/nfsd/netns.h
> @@ -109,7 +109,6 @@ struct nfsd_net {
> bool nfsd_net_up;
> bool lockd_up;
>
> - seqlock_t writeverf_lock;
> unsigned char writeverf[8];
>
> /*
> @@ -204,7 +203,4 @@ struct nfsd_net {
> extern void nfsd_netns_free_versions(struct nfsd_net *nn);
>
> extern unsigned int nfsd_net_id;
> -
> -void nfsd_copy_write_verifier(__be32 verf[2], struct nfsd_net *nn);
> -void nfsd_reset_write_verifier(struct nfsd_net *nn);
> #endif /* __NFSD_NETNS_H__ */
> diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
> index 5ae670807449..d63648d79f17 100644
> --- a/fs/nfsd/nfs4proc.c
> +++ b/fs/nfsd/nfs4proc.c
> @@ -717,15 +717,6 @@ nfsd4_access(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
> &access->ac_supported);
> }
>
> -static void gen_boot_verifier(nfs4_verifier *verifier, struct net *net)
> -{
> - __be32 *verf = (__be32 *)verifier->data;
> -
> - BUILD_BUG_ON(2*sizeof(*verf) != sizeof(verifier->data));
> -
> - nfsd_copy_write_verifier(verf, net_generic(net, nfsd_net_id));
> -}
> -
> static __be32
> nfsd4_commit(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
> union nfsd4_op_u *u)
> @@ -1586,11 +1577,17 @@ static const struct nfsd4_callback_ops nfsd4_cb_offload_ops = {
>
> static void nfsd4_init_copy_res(struct nfsd4_copy *copy, bool sync)
> {
> + __be32 *verf;
> + nfs4_verifier *verifier = &copy->cp_res.wr_verifier;
> +
> + BUILD_BUG_ON(2*sizeof(*verf) != sizeof(verifier->data));
> +
> copy->cp_res.wr_stable_how =
> test_bit(NFSD4_COPY_F_COMMITTED, &copy->cp_flags) ?
> NFS_FILE_SYNC : NFS_UNSTABLE;
> nfsd4_copy_set_sync(copy, sync);
> - gen_boot_verifier(&copy->cp_res.wr_verifier, copy->cp_clp->net);
> + verf = (__be32 *)verifier->data;
> + nfsd_set_write_verifier(verf, copy->nf_dst);
> }
>
> static ssize_t _nfsd_copy_file_range(struct nfsd4_copy *copy,
> diff --git a/fs/nfsd/nfsctl.c b/fs/nfsd/nfsctl.c
> index 04474b8ccf0a..28a73577beaa 100644
> --- a/fs/nfsd/nfsctl.c
> +++ b/fs/nfsd/nfsctl.c
> @@ -1501,7 +1501,6 @@ static __net_init int nfsd_init_net(struct net *net)
> nn->nfsd4_minorversions = NULL;
> nfsd4_init_leases_net(nn);
> get_random_bytes(&nn->siphash_key, sizeof(nn->siphash_key));
> - seqlock_init(&nn->writeverf_lock);
>
> return 0;
>
> diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
> index 3a38ab304b02..8b09bfc1fae9 100644
> --- a/fs/nfsd/nfssvc.c
> +++ b/fs/nfsd/nfssvc.c
> @@ -350,25 +350,27 @@ static bool nfsd_needs_lockd(struct nfsd_net *nn)
> }
>
> /**
> - * nfsd_copy_write_verifier - Atomically copy a write verifier
> + * nfsd_set_write_verifier - set the write verifier for a call
> * @verf: buffer in which to receive the verifier cookie
> - * @nn: NFS net namespace
> + * @nf: nfsd_file being operated on for write
> *
> - * This function provides a wait-free mechanism for copying the
> - * namespace's write verifier without tearing it.
> + * Grab the (static) write verifier for the nfsd_net, and then fold
> + * the current errseq_t value into it for the inode to create a
> + * write verifier.
> */
> -void nfsd_copy_write_verifier(__be32 verf[2], struct nfsd_net *nn)
> +void nfsd_set_write_verifier(__be32 verf[2], struct nfsd_file *nf)
> {
> - int seq = 0;
> + struct nfsd_net *nn = net_generic(nf->nf_net, nfsd_net_id);
> + errseq_t eseq = errseq_fetch(&nf->nf_file->f_mapping->wb_err);
>
> - do {
> - read_seqbegin_or_lock(&nn->writeverf_lock, &seq);
> - memcpy(verf, nn->writeverf, sizeof(*verf) * 2);
> - } while (need_seqretry(&nn->writeverf_lock, seq));
> - done_seqretry(&nn->writeverf_lock, seq);
> + /* copy in the per-net write verifier */
> + memcpy(verf, nn->writeverf, sizeof(*verf) * 2);
> +
> + /* fold the errseq into first word */
> + verf[0] ^= (__force __be32)eseq;
> }
>
> -static void nfsd_reset_write_verifier_locked(struct nfsd_net *nn)
> +static void nfsd_init_write_verifier(struct nfsd_net *nn)
> {
> struct timespec64 now;
> u64 verf;
> @@ -382,26 +384,6 @@ static void nfsd_reset_write_verifier_locked(struct nfsd_net *nn)
> memcpy(nn->writeverf, &verf, sizeof(nn->writeverf));
> }
>
> -/**
> - * nfsd_reset_write_verifier - Generate a new write verifier
> - * @nn: NFS net namespace
> - *
> - * This function updates the ->writeverf field of @nn. This field
> - * contains an opaque cookie that, according to Section 18.32.3 of
> - * RFC 8881, "the client can use to determine whether a server has
> - * changed instance state (e.g., server restart) between a call to
> - * WRITE and a subsequent call to either WRITE or COMMIT. This
> - * cookie MUST be unchanged during a single instance of the NFSv4.1
> - * server and MUST be unique between instances of the NFSv4.1
> - * server."
> - */
> -void nfsd_reset_write_verifier(struct nfsd_net *nn)
> -{
> - write_seqlock(&nn->writeverf_lock);
> - nfsd_reset_write_verifier_locked(nn);
> - write_sequnlock(&nn->writeverf_lock);
> -}
> -
> static int nfsd_startup_net(struct net *net, const struct cred *cred)
> {
> struct nfsd_net *nn = net_generic(net, nfsd_net_id);
> @@ -687,7 +669,7 @@ int nfsd_create_serv(struct net *net)
> register_inet6addr_notifier(&nfsd_inet6addr_notifier);
> #endif
> }
> - nfsd_reset_write_verifier(nn);
> + nfsd_init_write_verifier(nn);
> return 0;
> }
>
> diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
> index 4183819ea082..93d76fe33514 100644
> --- a/fs/nfsd/trace.h
> +++ b/fs/nfsd/trace.h
> @@ -744,34 +744,6 @@ DEFINE_EVENT(nfsd_net_class, nfsd_##name, \
> DEFINE_NET_EVENT(grace_start);
> DEFINE_NET_EVENT(grace_complete);
>
> -TRACE_EVENT(nfsd_writeverf_reset,
> - TP_PROTO(
> - const struct nfsd_net *nn,
> - const struct svc_rqst *rqstp,
> - int error
> - ),
> - TP_ARGS(nn, rqstp, error),
> - TP_STRUCT__entry(
> - __field(unsigned long long, boot_time)
> - __field(u32, xid)
> - __field(int, error)
> - __array(unsigned char, verifier, NFS4_VERIFIER_SIZE)
> - ),
> - TP_fast_assign(
> - __entry->boot_time = nn->boot_time;
> - __entry->xid = be32_to_cpu(rqstp->rq_xid);
> - __entry->error = error;
> -
> - /* avoid seqlock inside TP_fast_assign */
> - memcpy(__entry->verifier, nn->writeverf,
> - NFS4_VERIFIER_SIZE);
> - ),
> - TP_printk("boot_time=%16llx xid=0x%08x error=%d new verifier=0x%s",
> - __entry->boot_time, __entry->xid, __entry->error,
> - __print_hex_str(__entry->verifier, NFS4_VERIFIER_SIZE)
> - )
> -);
> -
> TRACE_EVENT(nfsd_clid_cred_mismatch,
> TP_PROTO(
> const struct nfs4_client *clp,
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 21d5209f6e04..6e8ba8357735 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -628,17 +628,12 @@ __be32 nfsd4_clone_file_range(struct svc_rqst *rqstp,
> if (!status)
> status = commit_inode_metadata(file_inode(src));
> if (status < 0) {
> - struct nfsd_net *nn = net_generic(nf_dst->nf_net,
> - nfsd_net_id);
> -
> trace_nfsd_clone_file_range_err(rqstp,
> &nfsd4_get_cstate(rqstp)->save_fh,
> src_pos,
> &nfsd4_get_cstate(rqstp)->current_fh,
> dst_pos,
> count, status);
> - nfsd_reset_write_verifier(nn);
> - trace_nfsd_writeverf_reset(nn, rqstp, status);
> ret = nfserrno(status);
> }
> }
> @@ -1058,7 +1053,6 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf,
> unsigned long *cnt, int stable,
> __be32 *verf)
> {
> - struct nfsd_net *nn = net_generic(SVC_NET(rqstp), nfsd_net_id);
> struct file *file = nf->nf_file;
> struct super_block *sb = file_inode(file)->i_sb;
> struct svc_export *exp;
> @@ -1103,13 +1097,10 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf,
> iov_iter_kvec(&iter, ITER_SOURCE, vec, vlen, *cnt);
> since = READ_ONCE(file->f_wb_err);
> if (verf)
> - nfsd_copy_write_verifier(verf, nn);
> + nfsd_set_write_verifier(verf, nf);
> host_err = vfs_iter_write(file, &iter, &pos, flags);
> - if (host_err < 0) {
> - nfsd_reset_write_verifier(nn);
> - trace_nfsd_writeverf_reset(nn, rqstp, host_err);
> + if (host_err < 0)
> goto out_nfserr;
> - }
> *cnt = host_err;
> nfsd_stats_io_write_add(exp, *cnt);
> fsnotify_modify(file);
> @@ -1117,13 +1108,8 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf,
> if (host_err < 0)
> goto out_nfserr;
>
> - if (stable && use_wgather) {
> + if (stable && use_wgather)
> host_err = wait_for_concurrent_writes(file);
> - if (host_err < 0) {
> - nfsd_reset_write_verifier(nn);
> - trace_nfsd_writeverf_reset(nn, rqstp, host_err);
> - }
> - }
>
> out_nfserr:
> if (host_err >= 0) {
> @@ -1223,7 +1209,6 @@ nfsd_commit(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf,
> __be32 err = nfs_ok;
> u64 maxbytes;
> loff_t start, end;
> - struct nfsd_net *nn;
>
> /*
> * Convert the client-provided (offset, count) range to a
> @@ -1240,7 +1225,6 @@ nfsd_commit(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf,
> end = offset + count - 1;
> }
>
> - nn = net_generic(nf->nf_net, nfsd_net_id);
> if (EX_ISSYNC(fhp->fh_export)) {
> errseq_t since = READ_ONCE(nf->nf_file->f_wb_err);
> int err2;
> @@ -1248,7 +1232,7 @@ nfsd_commit(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf,
> err2 = vfs_fsync_range(nf->nf_file, start, end, 0);
> switch (err2) {
> case 0:
> - nfsd_copy_write_verifier(verf, nn);
> + nfsd_set_write_verifier(verf, nf);
> err2 = filemap_check_wb_err(nf->nf_file->f_mapping,
> since);
> err = nfserrno(err2);
> @@ -1257,12 +1241,10 @@ nfsd_commit(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf,
> err = nfserr_notsupp;
> break;
> default:
> - nfsd_reset_write_verifier(nn);
> - trace_nfsd_writeverf_reset(nn, rqstp, err2);
> err = nfserrno(err2);
> }
> } else
> - nfsd_copy_write_verifier(verf, nn);
> + nfsd_set_write_verifier(verf, nf);
>
> return err;
> }
> diff --git a/fs/nfsd/vfs.h b/fs/nfsd/vfs.h
> index dbdfef7ae85b..e69f304407ae 100644
> --- a/fs/nfsd/vfs.h
> +++ b/fs/nfsd/vfs.h
> @@ -176,4 +176,5 @@ static inline __be32 fh_getattr(const struct svc_fh *fh, struct kstat *stat)
> AT_STATX_SYNC_AS_STAT));
> }
>
> +void nfsd_set_write_verifier(__be32 verf[2], struct nfsd_file *nf);
> #endif /* LINUX_NFSD_VFS_H */
> --
> 2.39.1
>
>

2023-02-14 03:28:54

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH 3/3] nfsd: simplify write verifier handling

On Mon, 2023-02-13 at 16:49 -0800, Rick Macklem wrote:
> On Mon, Feb 13, 2023 at 1:14 PM Jeff Layton <[email protected]>
> wrote:
> >
> > CAUTION: This email originated from outside of the University of
> > Guelph. Do not click links or open attachments unless you recognize
> > the sender and know the content is safe. If in doubt, forward
> > suspicious emails to [email protected]
> >
> >
> > The write verifier exists to tell the client when the server may
> > have
> > forgotten some unstable writes. The typical way that this happens
> > is if
> > the server crashes, but we've also extended nfsd to change it when
> > there
> > are writeback errors as well.
> >
> > The way it works today though, we call something like vfs_fsync
> > (e.g.
> > for a COMMIT call) and if we get back an error, we'll reset the
> > write
> > verifier.
> >
> > This is non-optimal for a couple of reasons:
> >
> > 1/ There could be significant delay between an error being
> > recorded and the reset. It would be ideal if the write verifier
> > were to
> > change as soon as the error was recorded.
> >
> > 2/ It's a bit of a waste, in that if we get a writeback error on a
> > single inode, we'll end up resetting the write verifier for
> > everything,
> > even on inodes that may be fine (e.g. on a completely separate fs).
> >
> Here's the snippet from RFC8881:
>    The final portion of the result is the field writeverf.  This
> field
>    is the write verifier and is a cookie that the client can use to
>    determine whether a server has changed instance state (e.g.,
> server
>    restart) between a call to WRITE and a subsequent call to either
>    WRITE or COMMIT.  This cookie MUST be unchanged during a single
>    instance of the NFSv4.1 server and MUST be unique between
> instances
>    of the NFSv4.1 server.  If the cookie changes, then the client
> MUST
>    assume that any data written with an UNSTABLE4 value for committed
>    and an old writeverf in the reply has been lost and will need to
> be
>    recovered.
>
> I've always interpreted the writeverf as "per-server" and not  "per-
> file".
> Although I'll admit the above does not make that crystal clear, it
> does make
> it clear that the writeverf applies to a "server instance" and not a
> file or
> file system on the server.
>
> The FreeBSD client assumes it is "per-server" and re-writes all
> uncommitted
> writes for the server, not just ones for the file (or file system)
> the
> writeverf is
> replied with.  (I vaguely recall Solaris does the same?)
>
> At the very least, I think you should run this past the IETF working
> group
> ([email protected]) to see what they say w.r.t. the writeverf being
> "per-file" vs
> "per-server".
>

As I recall, we've already had this discussion on the IETF NFSv4
working group mailing list:
https://mailarchive.ietf.org/arch/msg/nfsv4/99Ow2muMylXKWd9lzi9_BX2LJDY/


That's why I kept it a global in the first place.

Now note that RFC8881 does also clarify in Section 18.3.3 that:


The server must vary the value of the write
verifier at each server event or instantiation that may lead to a
loss of uncommitted data. Most commonly this occurs when the server
is restarted; however, other events at the server may result in
uncommitted data loss as well.

So I feel it is quite OK to use the verifier the way we do now in order
to signify that a fatal write error has occurred and that clients must
resend any data that was uncommitted.

--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]



2023-02-14 13:55:24

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH 3/3] nfsd: simplify write verifier handling

On Mon, 2023-02-13 at 22:28 -0500, Trond Myklebust wrote:
> On Mon, 2023-02-13 at 16:49 -0800, Rick Macklem wrote:
> > On Mon, Feb 13, 2023 at 1:14 PM Jeff Layton <[email protected]>
> > wrote:
> > >
> > > CAUTION: This email originated from outside of the University of
> > > Guelph. Do not click links or open attachments unless you recognize
> > > the sender and know the content is safe. If in doubt, forward
> > > suspicious emails to [email protected]
> > >
> > >
> > > The write verifier exists to tell the client when the server may
> > > have
> > > forgotten some unstable writes. The typical way that this happens
> > > is if
> > > the server crashes, but we've also extended nfsd to change it when
> > > there
> > > are writeback errors as well.
> > >
> > > The way it works today though, we call something like vfs_fsync
> > > (e.g.
> > > for a COMMIT call) and if we get back an error, we'll reset the
> > > write
> > > verifier.
> > >
> > > This is non-optimal for a couple of reasons:
> > >
> > > 1/ There could be significant delay between an error being
> > > recorded and the reset. It would be ideal if the write verifier
> > > were to
> > > change as soon as the error was recorded.
> > >
> > > 2/ It's a bit of a waste, in that if we get a writeback error on a
> > > single inode, we'll end up resetting the write verifier for
> > > everything,
> > > even on inodes that may be fine (e.g. on a completely separate fs).
> > >
> > Here's the snippet from RFC8881:
> > ?? The final portion of the result is the field writeverf.? This
> > field
> > ?? is the write verifier and is a cookie that the client can use to
> > ?? determine whether a server has changed instance state (e.g.,
> > server
> > ?? restart) between a call to WRITE and a subsequent call to either
> > ?? WRITE or COMMIT.? This cookie MUST be unchanged during a single
> > ?? instance of the NFSv4.1 server and MUST be unique between
> > instances
> > ?? of the NFSv4.1 server.? If the cookie changes, then the client
> > MUST
> > ?? assume that any data written with an UNSTABLE4 value for committed
> > ?? and an old writeverf in the reply has been lost and will need to
> > be
> > ?? recovered.
> >
> > I've always interpreted the writeverf as "per-server" and not? "per-
> > file".
> > Although I'll admit the above does not make that crystal clear, it
> > does make
> > it clear that the writeverf applies to a "server instance" and not a
> > file or
> > file system on the server.
> >
> > The FreeBSD client assumes it is "per-server" and re-writes all
> > uncommitted
> > writes for the server, not just ones for the file (or file system)
> > the
> > writeverf is
> > replied with.? (I vaguely recall Solaris does the same?)
> >
> > At the very least, I think you should run this past the IETF working
> > group
> > ([email protected]) to see what they say w.r.t. the writeverf being
> > "per-file" vs
> > "per-server".
> >
>
> As I recall, we've already had this discussion on the IETF NFSv4
> working group mailing list:
> https://mailarchive.ietf.org/arch/msg/nfsv4/99Ow2muMylXKWd9lzi9_BX2LJDY/
>
>
> That's why I kept it a global in the first place.
>
> Now note that RFC8881 does also clarify in Section 18.3.3 that:
>
>
> The server must vary the value of the write
> verifier at each server event or instantiation that may lead to a
> loss of uncommitted data. Most commonly this occurs when the server
> is restarted; however, other events at the server may result in
> uncommitted data loss as well.
>
> So I feel it is quite OK to use the verifier the way we do now in order
> to signify that a fatal write error has occurred and that clients must
> resend any data that was uncommitted.
>

Thanks, I missed that discussion. I think you guys have convinced me
that we have to keep this per-server. I won't bother starting a new
thread on it.

It's a pity. It would have been a lot more elegant as a per-inode thing!

Chuck, I think that means we'll just want to keep patch #1 in this?
series?

Thanks,
--
Jeff Layton <[email protected]>

2023-02-14 14:59:12

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH 3/3] nfsd: simplify write verifier handling



> On Feb 14, 2023, at 8:53 AM, Jeff Layton <[email protected]> wrote:
>
> On Mon, 2023-02-13 at 22:28 -0500, Trond Myklebust wrote:
>> On Mon, 2023-02-13 at 16:49 -0800, Rick Macklem wrote:
>>> On Mon, Feb 13, 2023 at 1:14 PM Jeff Layton <[email protected]>
>>> wrote:
>>>>
>>>> CAUTION: This email originated from outside of the University of
>>>> Guelph. Do not click links or open attachments unless you recognize
>>>> the sender and know the content is safe. If in doubt, forward
>>>> suspicious emails to [email protected]
>>>>
>>>>
>>>> The write verifier exists to tell the client when the server may
>>>> have
>>>> forgotten some unstable writes. The typical way that this happens
>>>> is if
>>>> the server crashes, but we've also extended nfsd to change it when
>>>> there
>>>> are writeback errors as well.
>>>>
>>>> The way it works today though, we call something like vfs_fsync
>>>> (e.g.
>>>> for a COMMIT call) and if we get back an error, we'll reset the
>>>> write
>>>> verifier.
>>>>
>>>> This is non-optimal for a couple of reasons:
>>>>
>>>> 1/ There could be significant delay between an error being
>>>> recorded and the reset. It would be ideal if the write verifier
>>>> were to
>>>> change as soon as the error was recorded.
>>>>
>>>> 2/ It's a bit of a waste, in that if we get a writeback error on a
>>>> single inode, we'll end up resetting the write verifier for
>>>> everything,
>>>> even on inodes that may be fine (e.g. on a completely separate fs).
>>>>
>>> Here's the snippet from RFC8881:
>>> The final portion of the result is the field writeverf. This
>>> field
>>> is the write verifier and is a cookie that the client can use to
>>> determine whether a server has changed instance state (e.g.,
>>> server
>>> restart) between a call to WRITE and a subsequent call to either
>>> WRITE or COMMIT. This cookie MUST be unchanged during a single
>>> instance of the NFSv4.1 server and MUST be unique between
>>> instances
>>> of the NFSv4.1 server. If the cookie changes, then the client
>>> MUST
>>> assume that any data written with an UNSTABLE4 value for committed
>>> and an old writeverf in the reply has been lost and will need to
>>> be
>>> recovered.
>>>
>>> I've always interpreted the writeverf as "per-server" and not "per-
>>> file".
>>> Although I'll admit the above does not make that crystal clear, it
>>> does make
>>> it clear that the writeverf applies to a "server instance" and not a
>>> file or
>>> file system on the server.
>>>
>>> The FreeBSD client assumes it is "per-server" and re-writes all
>>> uncommitted
>>> writes for the server, not just ones for the file (or file system)
>>> the
>>> writeverf is
>>> replied with. (I vaguely recall Solaris does the same?)
>>>
>>> At the very least, I think you should run this past the IETF working
>>> group
>>> ([email protected]) to see what they say w.r.t. the writeverf being
>>> "per-file" vs
>>> "per-server".
>>>
>>
>> As I recall, we've already had this discussion on the IETF NFSv4
>> working group mailing list:
>> https://mailarchive.ietf.org/arch/msg/nfsv4/99Ow2muMylXKWd9lzi9_BX2LJDY/
>>
>>
>> That's why I kept it a global in the first place.
>>
>> Now note that RFC8881 does also clarify in Section 18.3.3 that:
>>
>>
>> The server must vary the value of the write
>> verifier at each server event or instantiation that may lead to a
>> loss of uncommitted data. Most commonly this occurs when the server
>> is restarted; however, other events at the server may result in
>> uncommitted data loss as well.
>>
>> So I feel it is quite OK to use the verifier the way we do now in order
>> to signify that a fatal write error has occurred and that clients must
>> resend any data that was uncommitted.
>>
>
> Thanks, I missed that discussion. I think you guys have convinced me
> that we have to keep this per-server. I won't bother starting a new
> thread on it.
>
> It's a pity. It would have been a lot more elegant as a per-inode thing!
>
> Chuck, I think that means we'll just want to keep patch #1 in this
> series?

Regarding patch 1/3:

"sizeof(verf)" works as well as "sizeof(*verf) * 2" and is a little
more clear to boot. You can redrive a v2 of your patch or I can make
one. Up to you.


--
Chuck Lever




2023-02-14 15:02:03

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH 3/3] nfsd: simplify write verifier handling

On Tue, 2023-02-14 at 14:58 +0000, Chuck Lever III wrote:
>
> > On Feb 14, 2023, at 8:53 AM, Jeff Layton <[email protected]> wrote:
> >
> > On Mon, 2023-02-13 at 22:28 -0500, Trond Myklebust wrote:
> > > On Mon, 2023-02-13 at 16:49 -0800, Rick Macklem wrote:
> > > > On Mon, Feb 13, 2023 at 1:14 PM Jeff Layton <[email protected]>
> > > > wrote:
> > > > >
> > > > > CAUTION: This email originated from outside of the University of
> > > > > Guelph. Do not click links or open attachments unless you recognize
> > > > > the sender and know the content is safe. If in doubt, forward
> > > > > suspicious emails to [email protected]
> > > > >
> > > > >
> > > > > The write verifier exists to tell the client when the server may
> > > > > have
> > > > > forgotten some unstable writes. The typical way that this happens
> > > > > is if
> > > > > the server crashes, but we've also extended nfsd to change it when
> > > > > there
> > > > > are writeback errors as well.
> > > > >
> > > > > The way it works today though, we call something like vfs_fsync
> > > > > (e.g.
> > > > > for a COMMIT call) and if we get back an error, we'll reset the
> > > > > write
> > > > > verifier.
> > > > >
> > > > > This is non-optimal for a couple of reasons:
> > > > >
> > > > > 1/ There could be significant delay between an error being
> > > > > recorded and the reset. It would be ideal if the write verifier
> > > > > were to
> > > > > change as soon as the error was recorded.
> > > > >
> > > > > 2/ It's a bit of a waste, in that if we get a writeback error on a
> > > > > single inode, we'll end up resetting the write verifier for
> > > > > everything,
> > > > > even on inodes that may be fine (e.g. on a completely separate fs).
> > > > >
> > > > Here's the snippet from RFC8881:
> > > > The final portion of the result is the field writeverf. This
> > > > field
> > > > is the write verifier and is a cookie that the client can use to
> > > > determine whether a server has changed instance state (e.g.,
> > > > server
> > > > restart) between a call to WRITE and a subsequent call to either
> > > > WRITE or COMMIT. This cookie MUST be unchanged during a single
> > > > instance of the NFSv4.1 server and MUST be unique between
> > > > instances
> > > > of the NFSv4.1 server. If the cookie changes, then the client
> > > > MUST
> > > > assume that any data written with an UNSTABLE4 value for committed
> > > > and an old writeverf in the reply has been lost and will need to
> > > > be
> > > > recovered.
> > > >
> > > > I've always interpreted the writeverf as "per-server" and not "per-
> > > > file".
> > > > Although I'll admit the above does not make that crystal clear, it
> > > > does make
> > > > it clear that the writeverf applies to a "server instance" and not a
> > > > file or
> > > > file system on the server.
> > > >
> > > > The FreeBSD client assumes it is "per-server" and re-writes all
> > > > uncommitted
> > > > writes for the server, not just ones for the file (or file system)
> > > > the
> > > > writeverf is
> > > > replied with. (I vaguely recall Solaris does the same?)
> > > >
> > > > At the very least, I think you should run this past the IETF working
> > > > group
> > > > ([email protected]) to see what they say w.r.t. the writeverf being
> > > > "per-file" vs
> > > > "per-server".
> > > >
> > >
> > > As I recall, we've already had this discussion on the IETF NFSv4
> > > working group mailing list:
> > > https://mailarchive.ietf.org/arch/msg/nfsv4/99Ow2muMylXKWd9lzi9_BX2LJDY/
> > >
> > >
> > > That's why I kept it a global in the first place.
> > >
> > > Now note that RFC8881 does also clarify in Section 18.3.3 that:
> > >
> > >
> > > The server must vary the value of the write
> > > verifier at each server event or instantiation that may lead to a
> > > loss of uncommitted data. Most commonly this occurs when the server
> > > is restarted; however, other events at the server may result in
> > > uncommitted data loss as well.
> > >
> > > So I feel it is quite OK to use the verifier the way we do now in order
> > > to signify that a fatal write error has occurred and that clients must
> > > resend any data that was uncommitted.
> > >
> >
> > Thanks, I missed that discussion. I think you guys have convinced me
> > that we have to keep this per-server. I won't bother starting a new
> > thread on it.
> >
> > It's a pity. It would have been a lot more elegant as a per-inode thing!
> >
> > Chuck, I think that means we'll just want to keep patch #1 in this
> > series?
>
> Regarding patch 1/3:
>
> "sizeof(verf)" works as well as "sizeof(*verf) * 2" and is a little
> more clear to boot. You can redrive a v2 of your patch or I can make
> one. Up to you.
>

That sounds fine to me. Go for it!
--
Jeff Layton <[email protected]>

2023-02-14 22:57:54

by Rick Macklem

[permalink] [raw]
Subject: Re: [PATCH 3/3] nfsd: simplify write verifier handling

On Tue, Feb 14, 2023 at 5:53 AM Jeff Layton <[email protected]> wrote:
>
> On Mon, 2023-02-13 at 22:28 -0500, Trond Myklebust wrote:
> > On Mon, 2023-02-13 at 16:49 -0800, Rick Macklem wrote:
> > > On Mon, Feb 13, 2023 at 1:14 PM Jeff Layton <[email protected]>
> > > wrote:
> > > >
> > > > CAUTION: This email originated from outside of the University of
> > > > Guelph. Do not click links or open attachments unless you recognize
> > > > the sender and know the content is safe. If in doubt, forward
> > > > suspicious emails to [email protected]
> > > >
> > > >
> > > > The write verifier exists to tell the client when the server may
> > > > have
> > > > forgotten some unstable writes. The typical way that this happens
> > > > is if
> > > > the server crashes, but we've also extended nfsd to change it when
> > > > there
> > > > are writeback errors as well.
> > > >
> > > > The way it works today though, we call something like vfs_fsync
> > > > (e.g.
> > > > for a COMMIT call) and if we get back an error, we'll reset the
> > > > write
> > > > verifier.
> > > >
> > > > This is non-optimal for a couple of reasons:
> > > >
> > > > 1/ There could be significant delay between an error being
> > > > recorded and the reset. It would be ideal if the write verifier
> > > > were to
> > > > change as soon as the error was recorded.
> > > >
> > > > 2/ It's a bit of a waste, in that if we get a writeback error on a
> > > > single inode, we'll end up resetting the write verifier for
> > > > everything,
> > > > even on inodes that may be fine (e.g. on a completely separate fs).
> > > >
> > > Here's the snippet from RFC8881:
> > > The final portion of the result is the field writeverf. This
> > > field
> > > is the write verifier and is a cookie that the client can use to
> > > determine whether a server has changed instance state (e.g.,
> > > server
> > > restart) between a call to WRITE and a subsequent call to either
> > > WRITE or COMMIT. This cookie MUST be unchanged during a single
> > > instance of the NFSv4.1 server and MUST be unique between
> > > instances
> > > of the NFSv4.1 server. If the cookie changes, then the client
> > > MUST
> > > assume that any data written with an UNSTABLE4 value for committed
> > > and an old writeverf in the reply has been lost and will need to
> > > be
> > > recovered.
> > >
> > > I've always interpreted the writeverf as "per-server" and not "per-
> > > file".
> > > Although I'll admit the above does not make that crystal clear, it
> > > does make
> > > it clear that the writeverf applies to a "server instance" and not a
> > > file or
> > > file system on the server.
> > >
> > > The FreeBSD client assumes it is "per-server" and re-writes all
> > > uncommitted
> > > writes for the server, not just ones for the file (or file system)
> > > the
> > > writeverf is
> > > replied with. (I vaguely recall Solaris does the same?)
> > >
> > > At the very least, I think you should run this past the IETF working
> > > group
> > > ([email protected]) to see what they say w.r.t. the writeverf being
> > > "per-file" vs
> > > "per-server".
> > >
> >
> > As I recall, we've already had this discussion on the IETF NFSv4
> > working group mailing list:
> > https://mailarchive.ietf.org/arch/msg/nfsv4/99Ow2muMylXKWd9lzi9_BX2LJDY/
> >
> >
> > That's why I kept it a global in the first place.
> >
> > Now note that RFC8881 does also clarify in Section 18.3.3 that:
> >
> >
> > The server must vary the value of the write
> > verifier at each server event or instantiation that may lead to a
> > loss of uncommitted data. Most commonly this occurs when the server
> > is restarted; however, other events at the server may result in
> > uncommitted data loss as well.
> >
> > So I feel it is quite OK to use the verifier the way we do now in order
> > to signify that a fatal write error has occurred and that clients must
> > resend any data that was uncommitted.
> >
>
> Thanks, I missed that discussion. I think you guys have convinced me
> that we have to keep this per-server. I won't bother starting a new
> thread on it.
>
> It's a pity. It would have been a lot more elegant as a per-inode thing!
>
If you think it is worth the effort, you could propose an extension to
4.2. Something like Write_plus, Commit_plus operations.

rick

> Chuck, I think that means we'll just want to keep patch #1 in this
> series?
>
> Thanks,
> --
> Jeff Layton <[email protected]>

2023-02-14 23:16:22

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH 3/3] nfsd: simplify write verifier handling

On Tue, 2023-02-14 at 14:57 -0800, Rick Macklem wrote:
> On Tue, Feb 14, 2023 at 5:53 AM Jeff Layton <[email protected]> wrote:
> >
> > On Mon, 2023-02-13 at 22:28 -0500, Trond Myklebust wrote:
> > > On Mon, 2023-02-13 at 16:49 -0800, Rick Macklem wrote:
> > > > On Mon, Feb 13, 2023 at 1:14 PM Jeff Layton <[email protected]>
> > > > wrote:
> > > > >
> > > > > CAUTION: This email originated from outside of the University of
> > > > > Guelph. Do not click links or open attachments unless you recognize
> > > > > the sender and know the content is safe. If in doubt, forward
> > > > > suspicious emails to [email protected]
> > > > >
> > > > >
> > > > > The write verifier exists to tell the client when the server may
> > > > > have
> > > > > forgotten some unstable writes. The typical way that this happens
> > > > > is if
> > > > > the server crashes, but we've also extended nfsd to change it when
> > > > > there
> > > > > are writeback errors as well.
> > > > >
> > > > > The way it works today though, we call something like vfs_fsync
> > > > > (e.g.
> > > > > for a COMMIT call) and if we get back an error, we'll reset the
> > > > > write
> > > > > verifier.
> > > > >
> > > > > This is non-optimal for a couple of reasons:
> > > > >
> > > > > 1/ There could be significant delay between an error being
> > > > > recorded and the reset. It would be ideal if the write verifier
> > > > > were to
> > > > > change as soon as the error was recorded.
> > > > >
> > > > > 2/ It's a bit of a waste, in that if we get a writeback error on a
> > > > > single inode, we'll end up resetting the write verifier for
> > > > > everything,
> > > > > even on inodes that may be fine (e.g. on a completely separate fs).
> > > > >
> > > > Here's the snippet from RFC8881:
> > > > The final portion of the result is the field writeverf. This
> > > > field
> > > > is the write verifier and is a cookie that the client can use to
> > > > determine whether a server has changed instance state (e.g.,
> > > > server
> > > > restart) between a call to WRITE and a subsequent call to either
> > > > WRITE or COMMIT. This cookie MUST be unchanged during a single
> > > > instance of the NFSv4.1 server and MUST be unique between
> > > > instances
> > > > of the NFSv4.1 server. If the cookie changes, then the client
> > > > MUST
> > > > assume that any data written with an UNSTABLE4 value for committed
> > > > and an old writeverf in the reply has been lost and will need to
> > > > be
> > > > recovered.
> > > >
> > > > I've always interpreted the writeverf as "per-server" and not "per-
> > > > file".
> > > > Although I'll admit the above does not make that crystal clear, it
> > > > does make
> > > > it clear that the writeverf applies to a "server instance" and not a
> > > > file or
> > > > file system on the server.
> > > >
> > > > The FreeBSD client assumes it is "per-server" and re-writes all
> > > > uncommitted
> > > > writes for the server, not just ones for the file (or file system)
> > > > the
> > > > writeverf is
> > > > replied with. (I vaguely recall Solaris does the same?)
> > > >
> > > > At the very least, I think you should run this past the IETF working
> > > > group
> > > > ([email protected]) to see what they say w.r.t. the writeverf being
> > > > "per-file" vs
> > > > "per-server".
> > > >
> > >
> > > As I recall, we've already had this discussion on the IETF NFSv4
> > > working group mailing list:
> > > https://mailarchive.ietf.org/arch/msg/nfsv4/99Ow2muMylXKWd9lzi9_BX2LJDY/
> > >
> > >
> > > That's why I kept it a global in the first place.
> > >
> > > Now note that RFC8881 does also clarify in Section 18.3.3 that:
> > >
> > >
> > > The server must vary the value of the write
> > > verifier at each server event or instantiation that may lead to a
> > > loss of uncommitted data. Most commonly this occurs when the server
> > > is restarted; however, other events at the server may result in
> > > uncommitted data loss as well.
> > >
> > > So I feel it is quite OK to use the verifier the way we do now in order
> > > to signify that a fatal write error has occurred and that clients must
> > > resend any data that was uncommitted.
> > >
> >
> > Thanks, I missed that discussion. I think you guys have convinced me
> > that we have to keep this per-server. I won't bother starting a new
> > thread on it.
> >
> > It's a pity. It would have been a lot more elegant as a per-inode thing!
> >
> If you think it is worth the effort, you could propose an extension to
> 4.2. Something like Write_plus, Commit_plus operations.
>

I considered that, but I don't think it really helps. We'd have to bump
the verifier on a per-server basis anyway to keep up backward
compatibility. I think we're stuck unless we wanted to make a break with
the past.

>
> > Chuck, I think that means we'll just want to keep patch #1 in this
> > series?
> >
> > Thanks,
> > --
> > Jeff Layton <[email protected]>

--
Jeff Layton <[email protected]>

2023-02-14 23:35:15

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH 3/3] nfsd: simplify write verifier handling

On Tue, 2023-02-14 at 14:57 -0800, Rick Macklem wrote:
> On Tue, Feb 14, 2023 at 5:53 AM Jeff Layton <[email protected]>
> wrote:
> >
> > On Mon, 2023-02-13 at 22:28 -0500, Trond Myklebust wrote:
> > > On Mon, 2023-02-13 at 16:49 -0800, Rick Macklem wrote:
> > > > On Mon, Feb 13, 2023 at 1:14 PM Jeff Layton
> > > > <[email protected]>
> > > > wrote:
> > > > >
> > > > > CAUTION: This email originated from outside of the University
> > > > > of
> > > > > Guelph. Do not click links or open attachments unless you
> > > > > recognize
> > > > > the sender and know the content is safe. If in doubt, forward
> > > > > suspicious emails to [email protected]
> > > > >
> > > > >
> > > > > The write verifier exists to tell the client when the server
> > > > > may
> > > > > have
> > > > > forgotten some unstable writes. The typical way that this
> > > > > happens
> > > > > is if
> > > > > the server crashes, but we've also extended nfsd to change it
> > > > > when
> > > > > there
> > > > > are writeback errors as well.
> > > > >
> > > > > The way it works today though, we call something like
> > > > > vfs_fsync
> > > > > (e.g.
> > > > > for a COMMIT call) and if we get back an error, we'll reset
> > > > > the
> > > > > write
> > > > > verifier.
> > > > >
> > > > > This is non-optimal for a couple of reasons:
> > > > >
> > > > > 1/ There could be significant delay between an error being
> > > > > recorded and the reset. It would be ideal if the write
> > > > > verifier
> > > > > were to
> > > > > change as soon as the error was recorded.
> > > > >
> > > > > 2/ It's a bit of a waste, in that if we get a writeback error
> > > > > on a
> > > > > single inode, we'll end up resetting the write verifier for
> > > > > everything,
> > > > > even on inodes that may be fine (e.g. on a completely
> > > > > separate fs).
> > > > >
> > > > Here's the snippet from RFC8881:
> > > >    The final portion of the result is the field writeverf. 
> > > > This
> > > > field
> > > >    is the write verifier and is a cookie that the client can
> > > > use to
> > > >    determine whether a server has changed instance state (e.g.,
> > > > server
> > > >    restart) between a call to WRITE and a subsequent call to
> > > > either
> > > >    WRITE or COMMIT.  This cookie MUST be unchanged during a
> > > > single
> > > >    instance of the NFSv4.1 server and MUST be unique between
> > > > instances
> > > >    of the NFSv4.1 server.  If the cookie changes, then the
> > > > client
> > > > MUST
> > > >    assume that any data written with an UNSTABLE4 value for
> > > > committed
> > > >    and an old writeverf in the reply has been lost and will
> > > > need to
> > > > be
> > > >    recovered.
> > > >
> > > > I've always interpreted the writeverf as "per-server" and not 
> > > > "per-
> > > > file".
> > > > Although I'll admit the above does not make that crystal clear,
> > > > it
> > > > does make
> > > > it clear that the writeverf applies to a "server instance" and
> > > > not a
> > > > file or
> > > > file system on the server.
> > > >
> > > > The FreeBSD client assumes it is "per-server" and re-writes all
> > > > uncommitted
> > > > writes for the server, not just ones for the file (or file
> > > > system)
> > > > the
> > > > writeverf is
> > > > replied with.  (I vaguely recall Solaris does the same?)
> > > >
> > > > At the very least, I think you should run this past the IETF
> > > > working
> > > > group
> > > > ([email protected]) to see what they say w.r.t. the writeverf
> > > > being
> > > > "per-file" vs
> > > > "per-server".
> > > >
> > >
> > > As I recall, we've already had this discussion on the IETF NFSv4
> > > working group mailing list:
> > > https://mailarchive.ietf.org/arch/msg/nfsv4/99Ow2muMylXKWd9lzi9_BX2LJDY/
> > >
> > >
> > > That's why I kept it a global in the first place.
> > >
> > > Now note that RFC8881 does also clarify in Section 18.3.3 that:
> > >
> > >
> > >    The server must vary the value of the write
> > >    verifier at each server event or instantiation that may lead
> > > to a
> > >    loss of uncommitted data.  Most commonly this occurs when the
> > > server
> > >    is restarted; however, other events at the server may result
> > > in
> > >    uncommitted data loss as well.
> > >
> > > So I feel it is quite OK to use the verifier the way we do now in
> > > order
> > > to signify that a fatal write error has occurred and that clients
> > > must
> > > resend any data that was uncommitted.
> > >
> >
> > Thanks, I missed that discussion. I think you guys have convinced
> > me
> > that we have to keep this per-server. I won't bother starting a new
> > thread on it.
> >
> > It's a pity. It would have been a lot more elegant as a per-inode
> > thing!
> >
> If you think it is worth the effort, you could propose an extension
> to
> 4.2. Something like Write_plus, Commit_plus operations.
>
> rick
>
OK. Apparently some further clarification is required here.

The main reason for needing to bump the boot verifier on an error is
NFSv3. Unlike NFSv4, the NFSv3 clients are stateless, and so error
propagation using Jeff's "errseq" mechanism is inherently flawed
because it is ultimately caching the I/O errors in a file descriptor.
The fact that the file cache garbage collector can close these NFSv3
file descriptors at any time without any possibility of coordination
with the clients, and causing loss of the "errseq" cached data, means
that we must have an alternative mechanism to propagate error state for
unstable writes. Hence the use of the boot verifier.

NFSv4 does not have these limitations. The clients are required to use
stateful OPEN/CLOSE/DELEGRETURN/... operations in order to signify to
the NFSv4 server when file descriptors need to be kept open, and hence
"errseq" data needs to be preserved. The only cases where that
assumption is violated are that of a network partition and server
reboot, where it is obvious to all parties that information has been
lost.
Note: The Linux NFSv4 client does not currently assume that information
might be lost during network partition, so we might want to look into
fixing that. However it does obviously recover safely during a server
reboot thanks to the boot verifier mechanism.

IOW: there should be no need for a change in NFSv4 semantics in order
to address this issue. The problem is, as usual, mostly limited to
NFSv3.

--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]