Hi everyone,
I'm working with several Red Hat derived systems and have noticed an
issue with ENOSPC and NFS that I'm looking for some guidance on.
First let me describe the testing setup, and then I'll share my
results from an EL7 based system (kernel 3.10.0-1160.90.1.el7), an EL8
based system (kernel 4.18.0-425.19.2.el8_7), an EL8 based system
patched with commit e6005436f6cc9ed13288f936903f0151e5543485 (kernel
4.18.0-425.19.2.el8_7 plus that commit), and finally an EL8 based
system but with an upstream 6.1 kernel.
Assume I have a 20M quota on my current working directory which is an
NFS export from one of the major enterprise vendors.
The testing looks like the following:
# rm -f file1
# touch file1
# dd bs=1M count=20 if=/dev/zero of=file2 # this will use all the quota
20+0 records in
20+0 records out
20971520 bytes (21 MB, 20 MiB) copied, 0.193018 s, 109 MB/s
# tee -a file1 <<< abc
abc
tee: file1: No space left on device
# rm -f file2
# tee -a file1 <<< abc
abc
On an EL7 based system, running the above works just as shown. I.e.
you create file1, then use all the quota with file2, attempt to write
to file1 which fails with ENOSPC (as expected), remove file2 (which
frees up the quota), and then attempt to write to file1 again which
succeeds.
However, on a stock EL8 based system, I instead get the following
surprising behavior:
# rm -f file1
# touch file1
# dd bs=1M count=20 if=/dev/zero of=file2 # this will use all the quota
20+0 records in
20+0 records out
20971520 bytes (21 MB, 20 MiB) copied, 0.193018 s, 109 MB/s
# tee -a file1 <<< abc
abc
tee: file1: No space left on device
# rm -f file2
# tee -a file1 <<< abc
abc
tee: file1: No space left on device
# tee -a file1 <<< abc
abc
tee: file1: No space left on device
I.e. Even after freeing the space by removing file2, writing to file1
continues to fail with ENOSPC forever (I've only shown 2 iterations
above) [1]. No amount of waiting will cause it to go away. But, we've
found that running sync(1) on the file will fix it (the sync itself
will complain with ENOSPC, but then subsequent tee invocatinos
succeed).
I thought that perhaps the issue was the fact that kernel
4.18.0-425.19.2.el8_7 was missing commit
e6005436f6cc9ed13288f936903f0151e5543485 (which adds some ENOSPC
handling to `nfs_file_write'), so we patched the kernel with that
patch and tested again. It's worth saying that with this patch, the
behavior of our 4.18 kernel and the 6.1 kernel are consistent when
running this test, but I feel like there might still be a bug here.
What I get now is:
# rm -f file1
# touch file1
# dd bs=1M count=20 if=/dev/zero of=file2 # this will use all the quota
20+0 records in
20+0 records out
20971520 bytes (21 MB, 20 MiB) copied, 0.193018 s, 109 MB/s
# tee -a file1 <<< abc
abc
tee: file1: No space left on device
# rm -f file2
# tee -a file1 <<< abc
abc
tee: file1: No space left on device
# tee -a file1 <<< abc
abc
I.e. The first attempt to write to the file after freeing the quota
fails with ENOSPC, but subsequent attempts succeed. Note that this is
different from the original behavior on our EL7 based system as shown
above where as soon as the quota is freed up, there are no more ENOSPC
errors.
I'm no expert, but below I'm including some digging I did in case it's
helpful for understanding the situation more fully without needing to
reproduce yourselves. If it's not helpful and just distracting,
apologies in advance!
From strace'ing and systemtap'ing I noticed that the first call to
`tee' (after the quota is used up by file2) returns the ENOSPC in
response to close(2) (i.e. via `nfs_file_flush') and the second call
to `tee' (after the quota has been freed) returns the ENOSPC in
response to the write(2) (i.e. via `nfs_file_write' , and then clears
the error via the changes we introduced with commit
e6005436f6cc9ed13288f936903f0151e5543485).
Here is the output of [2], run while reproducing (the comments and
spacing are mine just to more easily be able to tell things apart):
# stap -vv --vp 10101 /tmp/t.stp
...
# This section is the first tee invocation when the quota is fully
consumed by file2
1686164924543967263: tee
module("nfs").function("nfs_file_write@fs/nfs/file.c:592").call
f_wb_err: wb_err: flags:0x0
1686164924543983309: tee
module("nfs").function("nfs_file_write@fs/nfs/file.c:592").return
f_wb_err: wb_err: flags:0x0
1686164924543986715: tee
module("nfs").function("nfs_file_flush@fs/nfs/file.c:139").call
f_wb_err: wb_err: flags:0x0
1686164924545701586: tee
module("nfs").function("nfs_file_flush@fs/nfs/file.c:139").return
f_wb_err: wb_err:ENOSPC flags:0x0
# This section is the second tee invocation when the quota has been freed
1686164933146193798: tee
module("nfs").function("nfs_file_write@fs/nfs/file.c:592").call
f_wb_err: wb_err:ENOSPC flags:0x0
1686164933147496167: tee
module("nfs").function("nfs_file_write@fs/nfs/file.c:592").return
f_wb_err: wb_err: flags:0x0
1686164933147623303: tee
module("nfs").function("nfs_file_flush@fs/nfs/file.c:139").call
f_wb_err: wb_err: flags:0x0
1686164933147627755: tee
module("nfs").function("nfs_file_flush@fs/nfs/file.c:139").return
f_wb_err: wb_err: flags:0x0
# This section is the third tee invocation that succeeds with no errors
1686164937358310484: tee
module("nfs").function("nfs_file_write@fs/nfs/file.c:592").call
f_wb_err: wb_err: flags:0x0
1686164937358323369: tee
module("nfs").function("nfs_file_write@fs/nfs/file.c:592").return
f_wb_err: wb_err: flags:0x0
1686164937358326649: tee
module("nfs").function("nfs_file_flush@fs/nfs/file.c:139").call
f_wb_err: wb_err: flags:0x0
1686164937359877744: tee
module("nfs").function("nfs_file_flush@fs/nfs/file.c:139").return
f_wb_err: wb_err: flags:0x0
I barely know what I'm looking at, but I wondered whether
`nfs_file_flush' should be resetting the error when it returns by
calling `file_check_and_advance_wb_err' instead of
`filemap_check_wb_err'. Given that it's reported the error to user
space, is that sufficient to clear the error? Or, does it need to keep
the error because there are dirty pages that haven't been written back
yet?
I guess what I'm asking is, is the above behavior we've observed on
EL8 patched with e6005436f6cc9ed13288f936903f0151e5543485 and
separately with 6.1 the expected behavior? Or should it behave like it
did in EL7?
Any insight or direction would be greatly appreciated!
Chris
[1]: It's not actually the write(2) that is giving the ENOSPC, it's
the close(2) and in fact the bytes successfully wind up in the file
even though the command appears to fail.
[2]: A systemtap script I threw together to look at fields I thought
might come into play:
function dump(file:long) {
printf("%d: %s %70s f_wb_err:%6s wb_err:%6s flags:%#x\n",
gettimeofday_ns(),
execname(),
pp(),
errno_str(@cast(file, "struct file")->f_wb_err),
errno_str(@cast(file, "struct file")->f_mapping->wb_err),
@cast(file, "struct file")->f_mapping->flags);
}
probe module("nfs").function("nfs_file_write").call {
if (!isinstr(execname(), "tee"))
next
dump(@cast($iocb, "struct kiocb")->ki_filp);
}
probe module("nfs").function("nfs_file_write").return {
if (!isinstr(execname(), "tee"))
next
dump(@cast(@entry($iocb), "struct kiocb")->ki_filp);
}
probe module("nfs").function("nfs_file_flush").call {
if (!isinstr(execname(), "tee"))
next
dump($file);
}
probe module("nfs").function("nfs_file_flush").return {
if (!isinstr(execname(), "tee"))
next
dump(@entry($file));
}
On Thu, 2023-06-08 at 13:05 -0400, Chris Perl wrote:
> Hi everyone,
>
> I'm working with several Red Hat derived systems and have noticed an
> issue with ENOSPC and NFS that I'm looking for some guidance on.
>
> First let me describe the testing setup, and then I'll share my
> results from an EL7 based system (kernel 3.10.0-1160.90.1.el7), an EL8
> based system (kernel 4.18.0-425.19.2.el8_7), an EL8 based system
> patched with commit e6005436f6cc9ed13288f936903f0151e5543485 (kernel
> 4.18.0-425.19.2.el8_7 plus that commit), and finally an EL8 based
> system but with an upstream 6.1 kernel.
>
> Assume I have a 20M quota on my current working directory which is an
> NFS export from one of the major enterprise vendors.
>
> The testing looks like the following:
>
> # rm -f file1
> # touch file1
> # dd bs=1M count=20 if=/dev/zero of=file2 # this will use all the quota
> 20+0 records in
> 20+0 records out
> 20971520 bytes (21 MB, 20 MiB) copied, 0.193018 s, 109 MB/s
> # tee -a file1 <<< abc
> abc
> tee: file1: No space left on device
> # rm -f file2
> # tee -a file1 <<< abc
> abc
>
> On an EL7 based system, running the above works just as shown. I.e.
> you create file1, then use all the quota with file2, attempt to write
> to file1 which fails with ENOSPC (as expected), remove file2 (which
> frees up the quota), and then attempt to write to file1 again which
> succeeds.
>
> However, on a stock EL8 based system, I instead get the following
> surprising behavior:
>
> # rm -f file1
> # touch file1
> # dd bs=1M count=20 if=/dev/zero of=file2 # this will use all the quota
> 20+0 records in
> 20+0 records out
> 20971520 bytes (21 MB, 20 MiB) copied, 0.193018 s, 109 MB/s
> # tee -a file1 <<< abc
> abc
> tee: file1: No space left on device
> # rm -f file2
> # tee -a file1 <<< abc
> abc
> tee: file1: No space left on device
> # tee -a file1 <<< abc
> abc
> tee: file1: No space left on device
>
> I.e. Even after freeing the space by removing file2, writing to file1
> continues to fail with ENOSPC forever (I've only shown 2 iterations
> above) [1]. No amount of waiting will cause it to go away. But, we've
> found that running sync(1) on the file will fix it (the sync itself
> will complain with ENOSPC, but then subsequent tee invocatinos
> succeed).
>
> I thought that perhaps the issue was the fact that kernel
> 4.18.0-425.19.2.el8_7 was missing commit
> e6005436f6cc9ed13288f936903f0151e5543485 (which adds some ENOSPC
> handling to `nfs_file_write'), so we patched the kernel with that
> patch and tested again. It's worth saying that with this patch, the
> behavior of our 4.18 kernel and the 6.1 kernel are consistent when
> running this test, but I feel like there might still be a bug here.
>
> What I get now is:
>
> # rm -f file1
> # touch file1
> # dd bs=1M count=20 if=/dev/zero of=file2 # this will use all the quota
> 20+0 records in
> 20+0 records out
> 20971520 bytes (21 MB, 20 MiB) copied, 0.193018 s, 109 MB/s
> # tee -a file1 <<< abc
> abc
> tee: file1: No space left on device
> # rm -f file2
> # tee -a file1 <<< abc
> abc
> tee: file1: No space left on device
> # tee -a file1 <<< abc
> abc
>
> I.e. The first attempt to write to the file after freeing the quota
> fails with ENOSPC, but subsequent attempts succeed. Note that this is
> different from the original behavior on our EL7 based system as shown
> above where as soon as the quota is freed up, there are no more ENOSPC
> errors.
>
> I'm no expert, but below I'm including some digging I did in case it's
> helpful for understanding the situation more fully without needing to
> reproduce yourselves. If it's not helpful and just distracting,
> apologies in advance!
>
> From strace'ing and systemtap'ing I noticed that the first call to
> `tee' (after the quota is used up by file2) returns the ENOSPC in
> response to close(2) (i.e. via `nfs_file_flush') and the second call
That is (unfortunately) expected behavior. I've argued (mostly
unsuccessfully) for years that we shouldn't return writeback errors in
the close() codepath.
No program should rely on looking for those. The only "legit" error on
close() is -EBADF.
> to `tee' (after the quota has been freed) returns the ENOSPC in
> response to the write(2) (i.e. via `nfs_file_write' , and then clears
> the error via the changes we introduced with commit
> e6005436f6cc9ed13288f936903f0151e5543485).
>
Looking at nfs_file_write, it's already tracking errors itself during
the write. Does this patch fix that? Note that I've not tested this --
YMMV!
----------------------8------------------------
[RFC PATCH] nfs: ignore the error from generic_write_sync
In the write codepath, we're only interested in writeback errors that
occur after the point where the write has started. It's possible though
that there were previous errors stored in the mapping before the write
ever began, in which case generic_write_sync will return error.
We already track errors over the part we're interested in, so we can
safely discard errors from generic_write_sync.
Reported-by: Chris Perl <[email protected]>
Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfs/file.c | 10 +++++++---
1 file changed, 7 insertions(+), 3 deletions(-)
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index f0edf5a36237..3ca1ffb1245e 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -673,10 +673,14 @@ ssize_t nfs_file_write(struct kiocb *iocb, struct iov_iter *from)
iocb->ki_pos - written,
iocb->ki_pos - 1);
}
- result = generic_write_sync(iocb, written);
- if (result < 0)
- return result;
+ /*
+ * For a write, we're only interested in errors that occur
+ * after the point where we sample the wb_error. Ignore
+ * errors from generic_write_sync, which may have occurred
+ * before that point.
+ */
+ generic_write_sync(iocb, written);
out:
/* Return error values */
error = filemap_check_wb_err(file->f_mapping, since);
--
2.40.1
On Thu, Jun 8, 2023 at 4:50 PM Jeff Layton <[email protected]> wrote:
>
> On Thu, 2023-06-08 at 13:05 -0400, Chris Perl wrote:
> > Hi everyone,
> >
> > I'm working with several Red Hat derived systems and have noticed an
> > issue with ENOSPC and NFS that I'm looking for some guidance on.
> >
> > First let me describe the testing setup, and then I'll share my
> > results from an EL7 based system (kernel 3.10.0-1160.90.1.el7), an EL8
> > based system (kernel 4.18.0-425.19.2.el8_7), an EL8 based system
> > patched with commit e6005436f6cc9ed13288f936903f0151e5543485 (kernel
> > 4.18.0-425.19.2.el8_7 plus that commit), and finally an EL8 based
> > system but with an upstream 6.1 kernel.
> >
> > Assume I have a 20M quota on my current working directory which is an
> > NFS export from one of the major enterprise vendors.
> >
> > The testing looks like the following:
> >
> > # rm -f file1
> > # touch file1
> > # dd bs=1M count=20 if=/dev/zero of=file2 # this will use all the quota
> > 20+0 records in
> > 20+0 records out
> > 20971520 bytes (21 MB, 20 MiB) copied, 0.193018 s, 109 MB/s
> > # tee -a file1 <<< abc
> > abc
> > tee: file1: No space left on device
> > # rm -f file2
> > # tee -a file1 <<< abc
> > abc
> >
> > On an EL7 based system, running the above works just as shown. I.e.
> > you create file1, then use all the quota with file2, attempt to write
> > to file1 which fails with ENOSPC (as expected), remove file2 (which
> > frees up the quota), and then attempt to write to file1 again which
> > succeeds.
> >
> > However, on a stock EL8 based system, I instead get the following
> > surprising behavior:
> >
> > # rm -f file1
> > # touch file1
> > # dd bs=1M count=20 if=/dev/zero of=file2 # this will use all the quota
> > 20+0 records in
> > 20+0 records out
> > 20971520 bytes (21 MB, 20 MiB) copied, 0.193018 s, 109 MB/s
> > # tee -a file1 <<< abc
> > abc
> > tee: file1: No space left on device
> > # rm -f file2
> > # tee -a file1 <<< abc
> > abc
> > tee: file1: No space left on device
> > # tee -a file1 <<< abc
> > abc
> > tee: file1: No space left on device
> >
> > I.e. Even after freeing the space by removing file2, writing to file1
> > continues to fail with ENOSPC forever (I've only shown 2 iterations
> > above) [1]. No amount of waiting will cause it to go away. But, we've
> > found that running sync(1) on the file will fix it (the sync itself
> > will complain with ENOSPC, but then subsequent tee invocatinos
> > succeed).
> >
> > I thought that perhaps the issue was the fact that kernel
> > 4.18.0-425.19.2.el8_7 was missing commit
> > e6005436f6cc9ed13288f936903f0151e5543485 (which adds some ENOSPC
> > handling to `nfs_file_write'), so we patched the kernel with that
> > patch and tested again. It's worth saying that with this patch, the
> > behavior of our 4.18 kernel and the 6.1 kernel are consistent when
> > running this test, but I feel like there might still be a bug here.
> >
> > What I get now is:
> >
> > # rm -f file1
> > # touch file1
> > # dd bs=1M count=20 if=/dev/zero of=file2 # this will use all the quota
> > 20+0 records in
> > 20+0 records out
> > 20971520 bytes (21 MB, 20 MiB) copied, 0.193018 s, 109 MB/s
> > # tee -a file1 <<< abc
> > abc
> > tee: file1: No space left on device
> > # rm -f file2
> > # tee -a file1 <<< abc
> > abc
> > tee: file1: No space left on device
> > # tee -a file1 <<< abc
> > abc
> >
> > I.e. The first attempt to write to the file after freeing the quota
> > fails with ENOSPC, but subsequent attempts succeed. Note that this is
> > different from the original behavior on our EL7 based system as shown
> > above where as soon as the quota is freed up, there are no more ENOSPC
> > errors.
> >
> > I'm no expert, but below I'm including some digging I did in case it's
> > helpful for understanding the situation more fully without needing to
> > reproduce yourselves. If it's not helpful and just distracting,
> > apologies in advance!
> >
> > From strace'ing and systemtap'ing I noticed that the first call to
> > `tee' (after the quota is used up by file2) returns the ENOSPC in
> > response to close(2) (i.e. via `nfs_file_flush') and the second call
>
> That is (unfortunately) expected behavior. I've argued (mostly
> unsuccessfully) for years that we shouldn't return writeback errors in
> the close() codepath.
>
> No program should rely on looking for those. The only "legit" error on
> close() is -EBADF.
>
> > to `tee' (after the quota has been freed) returns the ENOSPC in
> > response to the write(2) (i.e. via `nfs_file_write' , and then clears
> > the error via the changes we introduced with commit
> > e6005436f6cc9ed13288f936903f0151e5543485).
> >
>
> Looking at nfs_file_write, it's already tracking errors itself during
> the write. Does this patch fix that? Note that I've not tested this --
> YMMV!
Unfortunately that patch doesn't seem to help.
Since we applied commit e6005436f6cc9ed13288f936903f0151e5543485 and
it seemed to improve the situation (from an unbounded number of ENOSPC
errors to only one additional ENOSPC error), I believe that implies
the error we're seeing is coming from `filemap_check_wb_err', not
`generic_write_sync' in `nfs_file_write'.
I'll do some more tracing and see if I can narrow it down a bit more.
> ----------------------8------------------------
>
> [RFC PATCH] nfs: ignore the error from generic_write_sync
>
> In the write codepath, we're only interested in writeback errors that
> occur after the point where the write has started. It's possible though
> that there were previous errors stored in the mapping before the write
> ever began, in which case generic_write_sync will return error.
>
> We already track errors over the part we're interested in, so we can
> safely discard errors from generic_write_sync.
>
> Reported-by: Chris Perl <[email protected]>
> Signed-off-by: Jeff Layton <[email protected]>
> ---
> fs/nfs/file.c | 10 +++++++---
> 1 file changed, 7 insertions(+), 3 deletions(-)
>
> diff --git a/fs/nfs/file.c b/fs/nfs/file.c
> index f0edf5a36237..3ca1ffb1245e 100644
> --- a/fs/nfs/file.c
> +++ b/fs/nfs/file.c
> @@ -673,10 +673,14 @@ ssize_t nfs_file_write(struct kiocb *iocb, struct iov_iter *from)
> iocb->ki_pos - written,
> iocb->ki_pos - 1);
> }
> - result = generic_write_sync(iocb, written);
> - if (result < 0)
> - return result;
>
> + /*
> + * For a write, we're only interested in errors that occur
> + * after the point where we sample the wb_error. Ignore
> + * errors from generic_write_sync, which may have occurred
> + * before that point.
> + */
> + generic_write_sync(iocb, written);
> out:
> /* Return error values */
> error = filemap_check_wb_err(file->f_mapping, since);
> --
> 2.40.1
>
>
On Mon, 2023-06-12 at 10:27 -0400, Chris Perl wrote:
> On Thu, Jun 8, 2023 at 4:50 PM Jeff Layton <[email protected]> wrote:
> >
> > On Thu, 2023-06-08 at 13:05 -0400, Chris Perl wrote:
> > > Hi everyone,
> > >
> > > I'm working with several Red Hat derived systems and have noticed an
> > > issue with ENOSPC and NFS that I'm looking for some guidance on.
> > >
> > > First let me describe the testing setup, and then I'll share my
> > > results from an EL7 based system (kernel 3.10.0-1160.90.1.el7), an EL8
> > > based system (kernel 4.18.0-425.19.2.el8_7), an EL8 based system
> > > patched with commit e6005436f6cc9ed13288f936903f0151e5543485 (kernel
> > > 4.18.0-425.19.2.el8_7 plus that commit), and finally an EL8 based
> > > system but with an upstream 6.1 kernel.
> > >
> > > Assume I have a 20M quota on my current working directory which is an
> > > NFS export from one of the major enterprise vendors.
> > >
> > > The testing looks like the following:
> > >
> > > # rm -f file1
> > > # touch file1
> > > # dd bs=1M count=20 if=/dev/zero of=file2 # this will use all the quota
> > > 20+0 records in
> > > 20+0 records out
> > > 20971520 bytes (21 MB, 20 MiB) copied, 0.193018 s, 109 MB/s
> > > # tee -a file1 <<< abc
> > > abc
> > > tee: file1: No space left on device
> > > # rm -f file2
> > > # tee -a file1 <<< abc
> > > abc
> > >
> > > On an EL7 based system, running the above works just as shown. I.e.
> > > you create file1, then use all the quota with file2, attempt to write
> > > to file1 which fails with ENOSPC (as expected), remove file2 (which
> > > frees up the quota), and then attempt to write to file1 again which
> > > succeeds.
> > >
> > > However, on a stock EL8 based system, I instead get the following
> > > surprising behavior:
> > >
> > > # rm -f file1
> > > # touch file1
> > > # dd bs=1M count=20 if=/dev/zero of=file2 # this will use all the quota
> > > 20+0 records in
> > > 20+0 records out
> > > 20971520 bytes (21 MB, 20 MiB) copied, 0.193018 s, 109 MB/s
> > > # tee -a file1 <<< abc
> > > abc
> > > tee: file1: No space left on device
> > > # rm -f file2
> > > # tee -a file1 <<< abc
> > > abc
> > > tee: file1: No space left on device
> > > # tee -a file1 <<< abc
> > > abc
> > > tee: file1: No space left on device
> > >
> > > I.e. Even after freeing the space by removing file2, writing to file1
> > > continues to fail with ENOSPC forever (I've only shown 2 iterations
> > > above) [1]. No amount of waiting will cause it to go away. But, we've
> > > found that running sync(1) on the file will fix it (the sync itself
> > > will complain with ENOSPC, but then subsequent tee invocatinos
> > > succeed).
> > >
> > > I thought that perhaps the issue was the fact that kernel
> > > 4.18.0-425.19.2.el8_7 was missing commit
> > > e6005436f6cc9ed13288f936903f0151e5543485 (which adds some ENOSPC
> > > handling to `nfs_file_write'), so we patched the kernel with that
> > > patch and tested again. It's worth saying that with this patch, the
> > > behavior of our 4.18 kernel and the 6.1 kernel are consistent when
> > > running this test, but I feel like there might still be a bug here.
> > >
> > > What I get now is:
> > >
> > > # rm -f file1
> > > # touch file1
> > > # dd bs=1M count=20 if=/dev/zero of=file2 # this will use all the quota
> > > 20+0 records in
> > > 20+0 records out
> > > 20971520 bytes (21 MB, 20 MiB) copied, 0.193018 s, 109 MB/s
> > > # tee -a file1 <<< abc
> > > abc
> > > tee: file1: No space left on device
> > > # rm -f file2
> > > # tee -a file1 <<< abc
> > > abc
> > > tee: file1: No space left on device
> > > # tee -a file1 <<< abc
> > > abc
> > >
> > > I.e. The first attempt to write to the file after freeing the quota
> > > fails with ENOSPC, but subsequent attempts succeed. Note that this is
> > > different from the original behavior on our EL7 based system as shown
> > > above where as soon as the quota is freed up, there are no more ENOSPC
> > > errors.
> > >
> > > I'm no expert, but below I'm including some digging I did in case it's
> > > helpful for understanding the situation more fully without needing to
> > > reproduce yourselves. If it's not helpful and just distracting,
> > > apologies in advance!
> > >
> > > From strace'ing and systemtap'ing I noticed that the first call to
> > > `tee' (after the quota is used up by file2) returns the ENOSPC in
> > > response to close(2) (i.e. via `nfs_file_flush') and the second call
> >
> > That is (unfortunately) expected behavior. I've argued (mostly
> > unsuccessfully) for years that we shouldn't return writeback errors in
> > the close() codepath.
> >
> > No program should rely on looking for those. The only "legit" error on
> > close() is -EBADF.
> >
> > > to `tee' (after the quota has been freed) returns the ENOSPC in
> > > response to the write(2) (i.e. via `nfs_file_write' , and then clears
> > > the error via the changes we introduced with commit
> > > e6005436f6cc9ed13288f936903f0151e5543485).
> > >
> >
> > Looking at nfs_file_write, it's already tracking errors itself during
> > the write. Does this patch fix that? Note that I've not tested this --
> > YMMV!
>
> Unfortunately that patch doesn't seem to help.
>
> Since we applied commit e6005436f6cc9ed13288f936903f0151e5543485 and
> it seemed to improve the situation (from an unbounded number of ENOSPC
> errors to only one additional ENOSPC error), I believe that implies
> the error we're seeing is coming from `filemap_check_wb_err', not
> `generic_write_sync' in `nfs_file_write'.
>
> I'll do some more tracing and see if I can narrow it down a bit more.
Got it: I think I see what's happening. filemap_sample_wb_err just calls
errseq_sample, which does this:
errseq_t errseq_sample(errseq_t *eseq)
{
errseq_t old = READ_ONCE(*eseq);
/* If nobody has seen this error yet, then we can be the first. */
if (!(old & ERRSEQ_SEEN))
old = 0;
return old;
}
Because no one has seen that error yet (ERRSEQ_SEEN is clear), the write
ends up being the first to see it and it gets back a 0, even though the
error happened before the sample.
The above behavior is what we want for the sample that we do at open()
time, but not what's needed for this use-case. We need a new helper that
samples the value regardless of whether it has already been seen:
errseq_t errseq_peek(errseq_t *eseq)
{
return READ_ONCE(*eseq);
}
...but we'll also need to fix up errseq_check to handle differences
between the SEEN bit.
I'll see if I can spin up a patch for that. Stay tuned.
--
Jeff Layton <[email protected]>
On Mon, 2023-06-12 at 11:58 -0400, Jeff Layton wrote:
>
> Got it: I think I see what's happening. filemap_sample_wb_err just calls
> errseq_sample, which does this:
>
> errseq_t errseq_sample(errseq_t *eseq)
> {
> errseq_t old = READ_ONCE(*eseq);
>
> /* If nobody has seen this error yet, then we can be the first. */
> if (!(old & ERRSEQ_SEEN))
> old = 0;
> return old;
> }
>
> Because no one has seen that error yet (ERRSEQ_SEEN is clear), the write
> ends up being the first to see it and it gets back a 0, even though the
> error happened before the sample.
>
> The above behavior is what we want for the sample that we do at open()
> time, but not what's needed for this use-case. We need a new helper that
> samples the value regardless of whether it has already been seen:
>
> errseq_t errseq_peek(errseq_t *eseq)
> {
> return READ_ONCE(*eseq);
> }
>
> ...but we'll also need to fix up errseq_check to handle differences
> between the SEEN bit.
>
> I'll see if I can spin up a patch for that. Stay tuned.
This may not be fixable with the way that NFS is trying to use errseq_t.
The fundamental problem is that we need to mark the errseq_t in the
mapping as SEEN when we sample it, to ensure that a later error is
recorded and not ignored.
But...if the error hasn't been reported yet and we mark it SEEN here,
and then a later error doesn't occur, then a later open won't have its
errseq_t set to 0, and that unseen error could be lost.
It's a bit of a pity: as originally envisioned, the errseq_t mechanism
would provide for this sort of use case, but we added this patch not
long after the original code went in, and it changed those semantics:
b4678df184b3 errseq: Always report a writeback error once
I don't see a good way to do this using the current errseq_t mechanism,
given these competing needs. I'll keep thinking about it though. Maybe
we could add some sort of store and forward mechanism for fsync on NFS?
That could get rather complex though.
Cheers,
--
Jeff Layton <[email protected]>
On Mon, Jun 12, 2023 at 1:30 PM Jeff Layton <[email protected]> wrote:
>
> On Mon, 2023-06-12 at 11:58 -0400, Jeff Layton wrote:
>
> >
> > Got it: I think I see what's happening. filemap_sample_wb_err just calls
> > errseq_sample, which does this:
> >
> > errseq_t errseq_sample(errseq_t *eseq)
> > {
> > errseq_t old = READ_ONCE(*eseq);
> >
> > /* If nobody has seen this error yet, then we can be the first. */
> > if (!(old & ERRSEQ_SEEN))
> > old = 0;
> > return old;
> > }
> >
> > Because no one has seen that error yet (ERRSEQ_SEEN is clear), the write
> > ends up being the first to see it and it gets back a 0, even though the
> > error happened before the sample.
> >
> > The above behavior is what we want for the sample that we do at open()
> > time, but not what's needed for this use-case. We need a new helper that
> > samples the value regardless of whether it has already been seen:
> >
> > errseq_t errseq_peek(errseq_t *eseq)
> > {
> > return READ_ONCE(*eseq);
> > }
> >
> > ...but we'll also need to fix up errseq_check to handle differences
> > between the SEEN bit.
> >
> > I'll see if I can spin up a patch for that. Stay tuned.
>
> This may not be fixable with the way that NFS is trying to use errseq_t.
>
> The fundamental problem is that we need to mark the errseq_t in the
> mapping as SEEN when we sample it, to ensure that a later error is
> recorded and not ignored.
>
> But...if the error hasn't been reported yet and we mark it SEEN here,
> and then a later error doesn't occur, then a later open won't have its
> errseq_t set to 0, and that unseen error could be lost.
>
> It's a bit of a pity: as originally envisioned, the errseq_t mechanism
> would provide for this sort of use case, but we added this patch not
> long after the original code went in, and it changed those semantics:
>
> b4678df184b3 errseq: Always report a writeback error once
>
> I don't see a good way to do this using the current errseq_t mechanism,
> given these competing needs. I'll keep thinking about it though. Maybe
> we could add some sort of store and forward mechanism for fsync on NFS?
> That could get rather complex though.
Can/should it be marked SEEN when the initial close(2) from tee(1)
reports the error?
Part of the reason I had originally asked about `nfs_file_flush' (i.e.
what close(2) calls) using `file_check_and_advance_wb_err' instead of
`filemap_check_wb_err' was because I was drawn to comparing
`nfs_file_flush' against `nfs_file_fsync' as it seems like in the 3.10
based EL7 kernels, the former used to delegate to the latter (by way
of `vfs_fsync') and so they had consistent behavior, whereas now they
do not.
On Mon, 2023-06-12 at 13:30 -0400, Jeff Layton wrote:
> On Mon, 2023-06-12 at 11:58 -0400, Jeff Layton wrote:
>
> >
> > Got it: I think I see what's happening. filemap_sample_wb_err just
> > calls
> > errseq_sample, which does this:
> >
> > errseq_t errseq_sample(errseq_t
> > *eseq)
> > {
> >
> > errseq_t old =
> > READ_ONCE(*eseq);
> >
> >
> >
> > /* If nobody has seen this error yet, then we can be the
> > first. */
> > if (!(old &
> > ERRSEQ_SEEN))
> >
> > old =
> > 0;
> >
> > return
> > old;
> >
> > }
> >
> > Because no one has seen that error yet (ERRSEQ_SEEN is clear), the
> > write
> > ends up being the first to see it and it gets back a 0, even though
> > the
> > error happened before the sample.
> >
> > The above behavior is what we want for the sample that we do at
> > open()
> > time, but not what's needed for this use-case. We need a new helper
> > that
> > samples the value regardless of whether it has already been seen:
> >
> > errseq_t errseq_peek(errseq_t *eseq)
> > {
> > return READ_ONCE(*eseq);
> > }
> >
> > ...but we'll also need to fix up errseq_check to handle differences
> > between the SEEN bit.
> >
> > I'll see if I can spin up a patch for that. Stay tuned.
>
> This may not be fixable with the way that NFS is trying to use
> errseq_t.
>
> The fundamental problem is that we need to mark the errseq_t in the
> mapping as SEEN when we sample it, to ensure that a later error is
> recorded and not ignored.
>
> But...if the error hasn't been reported yet and we mark it SEEN here,
> and then a later error doesn't occur, then a later open won't have
> its
> errseq_t set to 0, and that unseen error could be lost.
>
> It's a bit of a pity: as originally envisioned, the errseq_t
> mechanism
> would provide for this sort of use case, but we added this patch not
> long after the original code went in, and it changed those semantics:
>
> b4678df184b3 errseq: Always report a writeback error once
>
> I don't see a good way to do this using the current errseq_t
> mechanism,
> given these competing needs. I'll keep thinking about it though.
> Maybe
> we could add some sort of store and forward mechanism for fsync on
> NFS?
> That could get rather complex though.
>
> Cheers,
Does RHEL-8 have commit 6c984083ec24, 064109db53ec, d95b26650e86,
e6005436f6cc, 9641d9bc9b75, and cea9ba7239dc applied?
--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]
On Mon, 2023-06-12 at 13:49 -0400, Chris Perl wrote:
> On Mon, Jun 12, 2023 at 1:30 PM Jeff Layton <[email protected]> wrote:
> >
> > On Mon, 2023-06-12 at 11:58 -0400, Jeff Layton wrote:
> >
> > >
> > > Got it: I think I see what's happening. filemap_sample_wb_err just calls
> > > errseq_sample, which does this:
> > >
> > > errseq_t errseq_sample(errseq_t *eseq)
> > > {
> > > errseq_t old = READ_ONCE(*eseq);
> > >
> > > /* If nobody has seen this error yet, then we can be the first. */
> > > if (!(old & ERRSEQ_SEEN))
> > > old = 0;
> > > return old;
> > > }
> > >
> > > Because no one has seen that error yet (ERRSEQ_SEEN is clear), the write
> > > ends up being the first to see it and it gets back a 0, even though the
> > > error happened before the sample.
> > >
> > > The above behavior is what we want for the sample that we do at open()
> > > time, but not what's needed for this use-case. We need a new helper that
> > > samples the value regardless of whether it has already been seen:
> > >
> > > errseq_t errseq_peek(errseq_t *eseq)
> > > {
> > > return READ_ONCE(*eseq);
> > > }
> > >
> > > ...but we'll also need to fix up errseq_check to handle differences
> > > between the SEEN bit.
> > >
> > > I'll see if I can spin up a patch for that. Stay tuned.
> >
> > This may not be fixable with the way that NFS is trying to use errseq_t.
> >
> > The fundamental problem is that we need to mark the errseq_t in the
> > mapping as SEEN when we sample it, to ensure that a later error is
> > recorded and not ignored.
> >
> > But...if the error hasn't been reported yet and we mark it SEEN here,
> > and then a later error doesn't occur, then a later open won't have its
> > errseq_t set to 0, and that unseen error could be lost.
> >
> > It's a bit of a pity: as originally envisioned, the errseq_t mechanism
> > would provide for this sort of use case, but we added this patch not
> > long after the original code went in, and it changed those semantics:
> >
> > b4678df184b3 errseq: Always report a writeback error once
> >
> > I don't see a good way to do this using the current errseq_t mechanism,
> > given these competing needs. I'll keep thinking about it though. Maybe
> > we could add some sort of store and forward mechanism for fsync on NFS?
> > That could get rather complex though.
>
> Can/should it be marked SEEN when the initial close(2) from tee(1)
> reports the error?
>
No. Most software doesn't check for errors on close(), and for good
reason: there's no requirement that any data be written back before
close() returns. A successful return is meaningless.
It turns out that NFSv4 (usually) writes back the data before a close
returns, but you don't want to rely on that.
> Part of the reason I had originally asked about `nfs_file_flush' (i.e.
> what close(2) calls) using `file_check_and_advance_wb_err' instead of
> `filemap_check_wb_err' was because I was drawn to comparing
> `nfs_file_flush' against `nfs_file_fsync' as it seems like in the 3.10
> based EL7 kernels, the former used to delegate to the latter (by way
> of `vfs_fsync') and so they had consistent behavior, whereas now they
> do not.
I think the problem is in some of the changes to write that have come
into play since then. They tried to use errseq_t to track errors over a
small window, but the underlying infrastructure is not quite suited for
that at the moment.
I think we can get there though by carving another flag bit out of the
counter in the errseq_t. I'm working on a patch for that now.
Cheers,
--
Jeff Layton <[email protected]>
On Mon, 2023-06-12 at 19:04 +0000, Trond Myklebust wrote:
> On Mon, 2023-06-12 at 13:30 -0400, Jeff Layton wrote:
> > On Mon, 2023-06-12 at 11:58 -0400, Jeff Layton wrote:
> >
> > >
> > > Got it: I think I see what's happening. filemap_sample_wb_err just
> > > calls
> > > errseq_sample, which does this:
> > >
> > > errseq_t errseq_sample(errseq_t
> > > *eseq)?????????????????????????????????????????????????????????????
> > > {??????????????????????????????????????????????????????????????????
> > > ????????????????????????????????
> > > ??????? errseq_t old =
> > > READ_ONCE(*eseq);??????????????????????????????????????????????????
> > > ?????????
> > > ???????????????????????????????????????????????????????????????????
> > > ????????????????????????????????
> > > ??????? /* If nobody has seen this error yet, then we can be the
> > > first. */?????????????????????????
> > > ??????? if (!(old &
> > > ERRSEQ_SEEN))??????????????????????????????????????????????????????
> > > ????????????
> > > ??????????????? old =
> > > 0;?????????????????????????????????????????????????????????????????
> > > ??????????
> > > ??????? return
> > > old;???????????????????????????????????????????????????????????????
> > > ?????????????????
> > > }?????????????????????????????????????????????????????????
> > >
> > > Because no one has seen that error yet (ERRSEQ_SEEN is clear), the
> > > write
> > > ends up being the first to see it and it gets back a 0, even though
> > > the
> > > error happened before the sample.
> > >
> > > The above behavior is what we want for the sample that we do at
> > > open()
> > > time, but not what's needed for this use-case. We need a new helper
> > > that
> > > samples the value regardless of whether it has already been seen:
> > >
> > > errseq_t errseq_peek(errseq_t *eseq)
> > > {
> > > ????????return READ_ONCE(*eseq);
> > > }
> > >
> > > ...but we'll also need to fix up errseq_check to handle differences
> > > between the SEEN bit.
> > >
> > > I'll see if I can spin up a patch for that. Stay tuned.
> >
> > This may not be fixable with the way that NFS is trying to use
> > errseq_t.
> >
> > The fundamental problem is that we need to mark the errseq_t in the
> > mapping as SEEN when we sample it, to ensure that a later error is
> > recorded and not ignored.
> >
> > But...if the error hasn't been reported yet and we mark it SEEN here,
> > and then a later error doesn't occur, then a later open won't have
> > its
> > errseq_t set to 0, and that unseen error could be lost.
> >
> > It's a bit of a pity: as originally envisioned, the errseq_t
> > mechanism
> > would provide for this sort of use case, but we added this patch not
> > long after the original code went in, and it changed those semantics:
> >
> > ??? b4678df184b3 errseq: Always report a writeback error once
> >
> > I don't see a good way to do this using the current errseq_t
> > mechanism,
> > given these competing needs. I'll keep thinking about it though.
> > Maybe
> > we could add some sort of store and forward mechanism for fsync on
> > NFS?
> > That could get rather complex though.
> >
> > Cheers,
>
> Does RHEL-8 have commit 6c984083ec24, 064109db53ec, d95b26650e86,
> e6005436f6cc, 9641d9bc9b75, and cea9ba7239dc applied?
>
Ben is working on backporting those as we speak. Hopefully we can get
RHEL8's state closer to where upstream is.
I'm also working on a patch for upstream that should give Chris the
expected behavior in this test.
--
Jeff Layton <[email protected]>
On Mon, 2023-06-12 at 15:17 -0400, Jeff Layton wrote:
> On Mon, 2023-06-12 at 13:49 -0400, Chris Perl wrote:
> > On Mon, Jun 12, 2023 at 1:30 PM Jeff Layton <[email protected]>
> > wrote:
> > >
> > > On Mon, 2023-06-12 at 11:58 -0400, Jeff Layton wrote:
> > >
> > > >
> > > > Got it: I think I see what's happening. filemap_sample_wb_err
> > > > just calls
> > > > errseq_sample, which does this:
> > > >
> > > > errseq_t errseq_sample(errseq_t *eseq)
> > > > {
> > > > errseq_t old = READ_ONCE(*eseq);
> > > >
> > > > /* If nobody has seen this error yet, then we can be
> > > > the first. */
> > > > if (!(old & ERRSEQ_SEEN))
> > > > old = 0;
> > > > return old;
> > > > }
> > > >
> > > > Because no one has seen that error yet (ERRSEQ_SEEN is clear),
> > > > the write
> > > > ends up being the first to see it and it gets back a 0, even
> > > > though the
> > > > error happened before the sample.
> > > >
> > > > The above behavior is what we want for the sample that we do at
> > > > open()
> > > > time, but not what's needed for this use-case. We need a new
> > > > helper that
> > > > samples the value regardless of whether it has already been
> > > > seen:
> > > >
> > > > errseq_t errseq_peek(errseq_t *eseq)
> > > > {
> > > > return READ_ONCE(*eseq);
> > > > }
> > > >
> > > > ...but we'll also need to fix up errseq_check to handle
> > > > differences
> > > > between the SEEN bit.
> > > >
> > > > I'll see if I can spin up a patch for that. Stay tuned.
> > >
> > > This may not be fixable with the way that NFS is trying to use
> > > errseq_t.
> > >
> > > The fundamental problem is that we need to mark the errseq_t in
> > > the
> > > mapping as SEEN when we sample it, to ensure that a later error
> > > is
> > > recorded and not ignored.
> > >
> > > But...if the error hasn't been reported yet and we mark it SEEN
> > > here,
> > > and then a later error doesn't occur, then a later open won't
> > > have its
> > > errseq_t set to 0, and that unseen error could be lost.
> > >
> > > It's a bit of a pity: as originally envisioned, the errseq_t
> > > mechanism
> > > would provide for this sort of use case, but we added this patch
> > > not
> > > long after the original code went in, and it changed those
> > > semantics:
> > >
> > > b4678df184b3 errseq: Always report a writeback error once
> > >
> > > I don't see a good way to do this using the current errseq_t
> > > mechanism,
> > > given these competing needs. I'll keep thinking about it though.
> > > Maybe
> > > we could add some sort of store and forward mechanism for fsync
> > > on NFS?
> > > That could get rather complex though.
> >
> > Can/should it be marked SEEN when the initial close(2) from tee(1)
> > reports the error?
> >
>
> No. Most software doesn't check for errors on close(), and for good
> reason: there's no requirement that any data be written back before
> close() returns. A successful return is meaningless.
>
> It turns out that NFSv4 (usually) writes back the data before a close
> returns, but you don't want to rely on that.
>
> > Part of the reason I had originally asked about `nfs_file_flush'
> > (i.e.
> > what close(2) calls) using `file_check_and_advance_wb_err' instead
> > of
> > `filemap_check_wb_err' was because I was drawn to comparing
> > `nfs_file_flush' against `nfs_file_fsync' as it seems like in the
> > 3.10
> > based EL7 kernels, the former used to delegate to the latter (by
> > way
> > of `vfs_fsync') and so they had consistent behavior, whereas now
> > they
> > do not.
>
> I think the problem is in some of the changes to write that have come
> into play since then. They tried to use errseq_t to track errors over
> a
> small window, but the underlying infrastructure is not quite suited
> for
> that at the moment.
>
> I think we can get there though by carving another flag bit out of
> the
> counter in the errseq_t. I'm working on a patch for that now.
>
The current NFS client code tries to do its best to match the
description in the manpages for how errors are reported: we try to
report them exactly once, either in write() or fsync().
We do still return errors on close(), but that kind of opportunistic
error return makes sure to use filemap_check_wb_err() so that we don't
break the write() + fsync() documented semantics.
The issue of picking up errors using errseq_sample() before even any
I/O has been attempted has been raised before, but AFAIK, the current
behaviour does actually match the promises made in the manpages, and it
matches what can happen with other filesystems.
I don't want to special case the NFS client, because that just leads to
people getting confused as to whether or not it will work correctly
with applications such as postgresql.
--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]
On 12 Jun 2023, at 15:21, Jeff Layton wrote:
> On Mon, 2023-06-12 at 19:04 +0000, Trond Myklebust wrote:
>>
>> Does RHEL-8 have commit 6c984083ec24, 064109db53ec, d95b26650e86,
>> e6005436f6cc, 9641d9bc9b75, and cea9ba7239dc applied?
>>
>
> Ben is working on backporting those as we speak. Hopefully we can get
> RHEL8's state closer to where upstream is.
https://bugzilla.redhat.com/show_bug.cgi?id=2213644
We don't have eager write support in RHEL-8, so 064109db53ec is unneeded.
Ben
On Mon, 2023-06-12 at 19:53 +0000, Trond Myklebust wrote:
> On Mon, 2023-06-12 at 15:17 -0400, Jeff Layton wrote:
> > On Mon, 2023-06-12 at 13:49 -0400, Chris Perl wrote:
> > > On Mon, Jun 12, 2023 at 1:30 PM Jeff Layton <[email protected]>
> > > wrote:
> > > >
> > > > On Mon, 2023-06-12 at 11:58 -0400, Jeff Layton wrote:
> > > >
> > > > >
> > > > > Got it: I think I see what's happening. filemap_sample_wb_err
> > > > > just calls
> > > > > errseq_sample, which does this:
> > > > >
> > > > > errseq_t errseq_sample(errseq_t *eseq)
> > > > > {
> > > > > errseq_t old = READ_ONCE(*eseq);
> > > > >
> > > > > /* If nobody has seen this error yet, then we can be
> > > > > the first. */
> > > > > if (!(old & ERRSEQ_SEEN))
> > > > > old = 0;
> > > > > return old;
> > > > > }
> > > > >
> > > > > Because no one has seen that error yet (ERRSEQ_SEEN is clear),
> > > > > the write
> > > > > ends up being the first to see it and it gets back a 0, even
> > > > > though the
> > > > > error happened before the sample.
> > > > >
> > > > > The above behavior is what we want for the sample that we do at
> > > > > open()
> > > > > time, but not what's needed for this use-case. We need a new
> > > > > helper that
> > > > > samples the value regardless of whether it has already been
> > > > > seen:
> > > > >
> > > > > errseq_t errseq_peek(errseq_t *eseq)
> > > > > {
> > > > > return READ_ONCE(*eseq);
> > > > > }
> > > > >
> > > > > ...but we'll also need to fix up errseq_check to handle
> > > > > differences
> > > > > between the SEEN bit.
> > > > >
> > > > > I'll see if I can spin up a patch for that. Stay tuned.
> > > >
> > > > This may not be fixable with the way that NFS is trying to use
> > > > errseq_t.
> > > >
> > > > The fundamental problem is that we need to mark the errseq_t in
> > > > the
> > > > mapping as SEEN when we sample it, to ensure that a later error
> > > > is
> > > > recorded and not ignored.
> > > >
> > > > But...if the error hasn't been reported yet and we mark it SEEN
> > > > here,
> > > > and then a later error doesn't occur, then a later open won't
> > > > have its
> > > > errseq_t set to 0, and that unseen error could be lost.
> > > >
> > > > It's a bit of a pity: as originally envisioned, the errseq_t
> > > > mechanism
> > > > would provide for this sort of use case, but we added this patch
> > > > not
> > > > long after the original code went in, and it changed those
> > > > semantics:
> > > >
> > > > b4678df184b3 errseq: Always report a writeback error once
> > > >
> > > > I don't see a good way to do this using the current errseq_t
> > > > mechanism,
> > > > given these competing needs. I'll keep thinking about it though.
> > > > Maybe
> > > > we could add some sort of store and forward mechanism for fsync
> > > > on NFS?
> > > > That could get rather complex though.
> > >
> > > Can/should it be marked SEEN when the initial close(2) from tee(1)
> > > reports the error?
> > >
> >
> > No. Most software doesn't check for errors on close(), and for good
> > reason: there's no requirement that any data be written back before
> > close() returns. A successful return is meaningless.
> >
> > It turns out that NFSv4 (usually) writes back the data before a close
> > returns, but you don't want to rely on that.
> >
> > > Part of the reason I had originally asked about `nfs_file_flush'
> > > (i.e.
> > > what close(2) calls) using `file_check_and_advance_wb_err' instead
> > > of
> > > `filemap_check_wb_err' was because I was drawn to comparing
> > > `nfs_file_flush' against `nfs_file_fsync' as it seems like in the
> > > 3.10
> > > based EL7 kernels, the former used to delegate to the latter (by
> > > way
> > > of `vfs_fsync') and so they had consistent behavior, whereas now
> > > they
> > > do not.
> >
> > I think the problem is in some of the changes to write that have come
> > into play since then. They tried to use errseq_t to track errors over
> > a
> > small window, but the underlying infrastructure is not quite suited
> > for
> > that at the moment.
> >
> > I think we can get there though by carving another flag bit out of
> > the
> > counter in the errseq_t. I'm working on a patch for that now.
> >
>
> The current NFS client code tries to do its best to match the
> description in the manpages for how errors are reported: we try to
> report them exactly once, either in write() or fsync().
> We do still return errors on close(), but that kind of opportunistic
> error return makes sure to use filemap_check_wb_err() so that we don't
> break the write() + fsync() documented semantics.
>
> The issue of picking up errors using errseq_sample() before even any
> I/O has been attempted has been raised before, but AFAIK, the current
> behaviour does actually match the promises made in the manpages, and it
> matches what can happen with other filesystems.
> I don't want to special case the NFS client, because that just leads to
> people getting confused as to whether or not it will work correctly
> with applications such as postgresql.
>
The point here would be to bring NFS more into line with how other
filesystems behave. As Chris pointed out, other filesystems don't report
an error on a new write() just because there was an earlier, unseen
writeback error on the same inode.
I think we can achieve this by carving out another flag bit from the
errseq_t counter. I'm building and testing a patch now, and I'll post it
once I'm convinced it's sane.
Cheers,
--
Jeff Layton <[email protected]>
On Mon, 2023-06-12 at 16:20 -0400, Jeff Layton wrote:
> On Mon, 2023-06-12 at 19:53 +0000, Trond Myklebust wrote:
> > On Mon, 2023-06-12 at 15:17 -0400, Jeff Layton wrote:
> > > On Mon, 2023-06-12 at 13:49 -0400, Chris Perl wrote:
> > > > On Mon, Jun 12, 2023 at 1:30 PM Jeff Layton
> > > > <[email protected]>
> > > > wrote:
> > > > >
> > > > > On Mon, 2023-06-12 at 11:58 -0400, Jeff Layton wrote:
> > > > >
> > > > > >
> > > > > > Got it: I think I see what's happening.
> > > > > > filemap_sample_wb_err
> > > > > > just calls
> > > > > > errseq_sample, which does this:
> > > > > >
> > > > > > errseq_t errseq_sample(errseq_t *eseq)
> > > > > > {
> > > > > > errseq_t old = READ_ONCE(*eseq);
> > > > > >
> > > > > > /* If nobody has seen this error yet, then we can
> > > > > > be
> > > > > > the first. */
> > > > > > if (!(old & ERRSEQ_SEEN))
> > > > > > old = 0;
> > > > > > return old;
> > > > > > }
> > > > > >
> > > > > > Because no one has seen that error yet (ERRSEQ_SEEN is
> > > > > > clear),
> > > > > > the write
> > > > > > ends up being the first to see it and it gets back a 0,
> > > > > > even
> > > > > > though the
> > > > > > error happened before the sample.
> > > > > >
> > > > > > The above behavior is what we want for the sample that we
> > > > > > do at
> > > > > > open()
> > > > > > time, but not what's needed for this use-case. We need a
> > > > > > new
> > > > > > helper that
> > > > > > samples the value regardless of whether it has already been
> > > > > > seen:
> > > > > >
> > > > > > errseq_t errseq_peek(errseq_t *eseq)
> > > > > > {
> > > > > > return READ_ONCE(*eseq);
> > > > > > }
> > > > > >
> > > > > > ...but we'll also need to fix up errseq_check to handle
> > > > > > differences
> > > > > > between the SEEN bit.
> > > > > >
> > > > > > I'll see if I can spin up a patch for that. Stay tuned.
> > > > >
> > > > > This may not be fixable with the way that NFS is trying to
> > > > > use
> > > > > errseq_t.
> > > > >
> > > > > The fundamental problem is that we need to mark the errseq_t
> > > > > in
> > > > > the
> > > > > mapping as SEEN when we sample it, to ensure that a later
> > > > > error
> > > > > is
> > > > > recorded and not ignored.
> > > > >
> > > > > But...if the error hasn't been reported yet and we mark it
> > > > > SEEN
> > > > > here,
> > > > > and then a later error doesn't occur, then a later open won't
> > > > > have its
> > > > > errseq_t set to 0, and that unseen error could be lost.
> > > > >
> > > > > It's a bit of a pity: as originally envisioned, the errseq_t
> > > > > mechanism
> > > > > would provide for this sort of use case, but we added this
> > > > > patch
> > > > > not
> > > > > long after the original code went in, and it changed those
> > > > > semantics:
> > > > >
> > > > > b4678df184b3 errseq: Always report a writeback error once
> > > > >
> > > > > I don't see a good way to do this using the current errseq_t
> > > > > mechanism,
> > > > > given these competing needs. I'll keep thinking about it
> > > > > though.
> > > > > Maybe
> > > > > we could add some sort of store and forward mechanism for
> > > > > fsync
> > > > > on NFS?
> > > > > That could get rather complex though.
> > > >
> > > > Can/should it be marked SEEN when the initial close(2) from
> > > > tee(1)
> > > > reports the error?
> > > >
> > >
> > > No. Most software doesn't check for errors on close(), and for
> > > good
> > > reason: there's no requirement that any data be written back
> > > before
> > > close() returns. A successful return is meaningless.
> > >
> > > It turns out that NFSv4 (usually) writes back the data before a
> > > close
> > > returns, but you don't want to rely on that.
> > >
> > > > Part of the reason I had originally asked about
> > > > `nfs_file_flush'
> > > > (i.e.
> > > > what close(2) calls) using `file_check_and_advance_wb_err'
> > > > instead
> > > > of
> > > > `filemap_check_wb_err' was because I was drawn to comparing
> > > > `nfs_file_flush' against `nfs_file_fsync' as it seems like in
> > > > the
> > > > 3.10
> > > > based EL7 kernels, the former used to delegate to the latter
> > > > (by
> > > > way
> > > > of `vfs_fsync') and so they had consistent behavior, whereas
> > > > now
> > > > they
> > > > do not.
> > >
> > > I think the problem is in some of the changes to write that have
> > > come
> > > into play since then. They tried to use errseq_t to track errors
> > > over
> > > a
> > > small window, but the underlying infrastructure is not quite
> > > suited
> > > for
> > > that at the moment.
> > >
> > > I think we can get there though by carving another flag bit out
> > > of
> > > the
> > > counter in the errseq_t. I'm working on a patch for that now.
> > >
> >
> > The current NFS client code tries to do its best to match the
> > description in the manpages for how errors are reported: we try to
> > report them exactly once, either in write() or fsync().
> > We do still return errors on close(), but that kind of
> > opportunistic
> > error return makes sure to use filemap_check_wb_err() so that we
> > don't
> > break the write() + fsync() documented semantics.
> >
> > The issue of picking up errors using errseq_sample() before even
> > any
> > I/O has been attempted has been raised before, but AFAIK, the
> > current
> > behaviour does actually match the promises made in the manpages,
> > and it
> > matches what can happen with other filesystems.
> > I don't want to special case the NFS client, because that just
> > leads to
> > people getting confused as to whether or not it will work correctly
> > with applications such as postgresql.
> >
>
> The point here would be to bring NFS more into line with how other
> filesystems behave. As Chris pointed out, other filesystems don't
> report
> an error on a new write() just because there was an earlier, unseen
> writeback error on the same inode.
That's not quite true.
generic_write_sync() will return whatever vfs_fsync_range() returns, so
anything involving O_DSYNC or O_SYNC or the newer pwritev2() RFW_DSYNC
/ RFW_SYNC flags has the potential to behave just like NFS.
The difference is that NFS can do this with ordinary buffered writes
too.
>
> I think we can achieve this by carving out another flag bit from the
> errseq_t counter. I'm building and testing a patch now, and I'll post
> it
> once I'm convinced it's sane.
>
> Cheers,
--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]
Hi,
On Tue, 13 Jun 2023, at 04:20, Jeff Layton wrote:
> The point here would be to bring NFS more into line with how other
> filesystems behave. As Chris pointed out, other filesystems don't report
> an error on a new write() just because there was an earlier, unseen
> writeback error on the same inode.
>
> I think we can achieve this by carving out another flag bit from the
> errseq_t counter. I'm building and testing a patch now, and I'll post it
> once I'm convinced it's sane.
Just wondering if anything has happened regarding this issue. I saw
"[RFC PATCH] errseq_t: split the ERRSEQ_SEEN flag into two" on the list
but that didn't seem to get any attention.
The current behaviour is really quite surprising because if you have the
following sequence:
1. quota hit or remote disk runs out of space
2. write() returns 0
3. close() [1]
4. space freed
5. write() returns ENOSPC
and then read the file, you'll see the contents from the write in (5)
and *not* the write in (2), even though the write in (5) is the one that
returned an error.
[1]: this returns ENOSPC too, but it seems like we're assuming applications
don't check the result of close()
--
Hao Wei