The errseq_t infrastructure assumes that errors which occurred before
the file descriptor was opened are of no interest to the application.
This turns out to be a regression for some applications, notably Postgres.
Before errseq_t, a writeback error would be reported exactly once (as
long as the inode remained in memory), so Postgres could open a file,
call fsync() and find out whether there had been a writeback error on
that file from another process.
This patch restores that behaviour by reporting errors to file descriptors
which are opened after the error occurred, but before it was reported
to any file descriptor.
Cc: [email protected]
Fixes: 5660e13d2fd6 ("fs: new infrastructure for writeback error handling and reporting")
Signed-off-by: Matthew Wilcox <[email protected]>
diff --git a/lib/errseq.c b/lib/errseq.c
index df782418b333..093f1fba4ee0 100644
--- a/lib/errseq.c
+++ b/lib/errseq.c
@@ -119,19 +119,11 @@ EXPORT_SYMBOL(errseq_set);
errseq_t errseq_sample(errseq_t *eseq)
{
errseq_t old = READ_ONCE(*eseq);
- errseq_t new = old;
- /*
- * For the common case of no errors ever having been set, we can skip
- * marking the SEEN bit. Once an error has been set, the value will
- * never go back to zero.
- */
- if (old != 0) {
- new |= ERRSEQ_SEEN;
- if (old != new)
- cmpxchg(eseq, old, new);
- }
- return new;
+ /* If nobody has seen this error yet, then we can be the first. */
+ if (!(old & ERRSEQ_SEEN))
+ old = 0;
+ return old;
}
EXPORT_SYMBOL(errseq_sample);
Hi,
On 2018-04-23 13:42:08 -0700, Matthew Wilcox wrote:
> The errseq_t infrastructure assumes that errors which occurred before
> the file descriptor was opened are of no interest to the application.
> This turns out to be a regression for some applications, notably Postgres.
>
> Before errseq_t, a writeback error would be reported exactly once (as
> long as the inode remained in memory), so Postgres could open a file,
> call fsync() and find out whether there had been a writeback error on
> that file from another process.
>
> This patch restores that behaviour by reporting errors to file descriptors
> which are opened after the error occurred, but before it was reported
> to any file descriptor.
>
> Cc: [email protected]
> Fixes: 5660e13d2fd6 ("fs: new infrastructure for writeback error handling and reporting")
> Signed-off-by: Matthew Wilcox <[email protected]>
>
> diff --git a/lib/errseq.c b/lib/errseq.c
> index df782418b333..093f1fba4ee0 100644
> --- a/lib/errseq.c
> +++ b/lib/errseq.c
> @@ -119,19 +119,11 @@ EXPORT_SYMBOL(errseq_set);
> errseq_t errseq_sample(errseq_t *eseq)
> {
There's a comment above this:
*
* This function allows callers to sample an errseq_t value, marking it as
* "seen" if required.
> errseq_t old = READ_ONCE(*eseq);
> - errseq_t new = old;
>
> - /*
> - * For the common case of no errors ever having been set, we can skip
> - * marking the SEEN bit. Once an error has been set, the value will
> - * never go back to zero.
> - */
> - if (old != 0) {
> - new |= ERRSEQ_SEEN;
> - if (old != new)
> - cmpxchg(eseq, old, new);
> - }
> - return new;
Which seems not to be true anymore after this hunk.
> + /* If nobody has seen this error yet, then we can be the first. */
> + if (!(old & ERRSEQ_SEEN))
> + old = 0;
> + return old;
> }
> EXPORT_SYMBOL(errseq_sample);
I've never really looked at this code in any depth before, but won't
this potentially lead to the same error being reported on multiple FDs?
Imagine two fds (potentially in different processes) getting the 0
returned by errseq_sample() because it's not ERRSEQ_SEEN. Afaict
file_check_and_advance_wb_err() will return an error that's always
unlike 0 in that case, and thus the error will returned on both fds?
I'm personally perfectly fine with that, but it's not necessarily what's
described as desired in your email?.
Greetings,
Andres Freund
On Mon, Apr 23, 2018 at 01:57:30PM -0700, Andres Freund wrote:
> On 2018-04-23 13:42:08 -0700, Matthew Wilcox wrote:
> > @@ -119,19 +119,11 @@ EXPORT_SYMBOL(errseq_set);
> > errseq_t errseq_sample(errseq_t *eseq)
> > {
>
> There's a comment above this:
> *
> * This function allows callers to sample an errseq_t value, marking it as
> * "seen" if required.
Oh, good catch. I'll fix that. Thanks!
> I've never really looked at this code in any depth before, but won't
> this potentially lead to the same error being reported on multiple FDs?
> Imagine two fds (potentially in different processes) getting the 0
> returned by errseq_sample() because it's not ERRSEQ_SEEN. Afaict
> file_check_and_advance_wb_err() will return an error that's always
> unlike 0 in that case, and thus the error will returned on both fds?
>
> I'm personally perfectly fine with that, but it's not necessarily what's
> described as desired in your email?.
That's what I was trying to say with this paragraph:
> > This patch restores that behaviour by reporting errors to file descriptors
> > which are opened after the error occurred, but before it was reported
> > to any file descriptor.
Maybe it was a little unclear? I'd appreciate a suggestion on making
it clearer.
I think this behaviour is perfectly justifiable. Writeback errors occur
asynchronously to open. Userspace can't tell the difference between:
kernel: writeback
p1: open
p2: open
p1: fsync
p2: fsync
and
p1: open
p2: open
kernel: writeback
p1: fsync
p2: fsync
Since both p1 and p2 would get the writeback error in the second case,
they can both get the writeback error in the first place.
p2 can't rely on this, after all we could have the sequence:
p1: open
p1: fsync
p2: open
p2: fsync
and p2 will not see the error, but it wouldn't have seen the error
before the errseq_t infrastructure went in. And maybe p1 handled the
error three weeks ago; we don't want p2 to see the error.
On 2018-04-23 14:43:48 -0700, Matthew Wilcox wrote:
> On Mon, Apr 23, 2018 at 01:57:30PM -0700, Andres Freund wrote:
> > I've never really looked at this code in any depth before, but won't
> > this potentially lead to the same error being reported on multiple FDs?
> > Imagine two fds (potentially in different processes) getting the 0
> > returned by errseq_sample() because it's not ERRSEQ_SEEN. Afaict
> > file_check_and_advance_wb_err() will return an error that's always
> > unlike 0 in that case, and thus the error will returned on both fds?
> >
> > I'm personally perfectly fine with that, but it's not necessarily what's
> > described as desired in your email?.
>
> That's what I was trying to say with this paragraph:
>
> > > This patch restores that behaviour by reporting errors to file descriptors
> > > which are opened after the error occurred, but before it was reported
> > > to any file descriptor.
>
> Maybe it was a little unclear? I'd appreciate a suggestion on making
> it clearer.
I think I was thinking of the following paragraph from your commit
message:
> Before errseq_t, a writeback error would be reported exactly once (as
> long as the inode remained in memory), so Postgres could open a file,
> call fsync() and find out whether there had been a writeback error on
> that file from another process.
Note the "exactly once", making "that behaviour" in the following
paragraph potentially refer to exactly once:
> This patch restores that behaviour by reporting errors to file descriptors
> which are opened after the error occurred, but before it was reported
> to any file descriptor.
Just adding a sentence here saying something like "This means that the
error might be reported to more fds than before." or such would address
that potential ambiguity?
> I think this behaviour is perfectly justifiable.
I agree.
Greetings,
Andres Freund
On Mon, Apr 23, 2018 at 02:43:48PM -0700, Matthew Wilcox wrote:
> On Mon, Apr 23, 2018 at 01:57:30PM -0700, Andres Freund wrote:
> > On 2018-04-23 13:42:08 -0700, Matthew Wilcox wrote:
> > > @@ -119,19 +119,11 @@ EXPORT_SYMBOL(errseq_set);
> > > errseq_t errseq_sample(errseq_t *eseq)
> > > {
> >
> > There's a comment above this:
> > *
> > * This function allows callers to sample an errseq_t value, marking it as
> > * "seen" if required.
>
> Oh, good catch. I'll fix that. Thanks!
How does this look?
@@ -111,27 +111,22 @@ EXPORT_SYMBOL(errseq_set);
* errseq_sample() - Grab current errseq_t value.
* @eseq: Pointer to errseq_t to be sampled.
*
- * This function allows callers to sample an errseq_t value, marking it as
- * "seen" if required.
+ * This function allows callers to initialise their errseq_t variable.
+ * If the error has been "seen", new callers will not see an old error.
+ * If there is an unseen error in @eseq, the caller of this function will
+ * see it the next time it checks for an error.
*
+ * Context: Any context.
* Return: The current errseq value.
*/
errseq_t errseq_sample(errseq_t *eseq)
On 2018-04-23 14:51:50 -0700, Matthew Wilcox wrote:
> How does this look?
>
> @@ -111,27 +111,22 @@ EXPORT_SYMBOL(errseq_set);
> * errseq_sample() - Grab current errseq_t value.
> * @eseq: Pointer to errseq_t to be sampled.
> *
> - * This function allows callers to sample an errseq_t value, marking it as
> - * "seen" if required.
> + * This function allows callers to initialise their errseq_t variable.
> + * If the error has been "seen", new callers will not see an old error.
> + * If there is an unseen error in @eseq, the caller of this function will
> + * see it the next time it checks for an error.
> *
> + * Context: Any context.
> * Return: The current errseq value.
> */
> errseq_t errseq_sample(errseq_t *eseq)
LGTM.
On Mon, 2018-04-23 at 13:42 -0700, Matthew Wilcox wrote:
> The errseq_t infrastructure assumes that errors which occurred before
> the file descriptor was opened are of no interest to the application.
> This turns out to be a regression for some applications, notably Postgres.
>
> Before errseq_t, a writeback error would be reported exactly once (as
> long as the inode remained in memory), so Postgres could open a file,
> call fsync() and find out whether there had been a writeback error on
> that file from another process.
>
> This patch restores that behaviour by reporting errors to file descriptors
> which are opened after the error occurred, but before it was reported
> to any file descriptor.
>
> Cc: [email protected]
> Fixes: 5660e13d2fd6 ("fs: new infrastructure for writeback error handling and reporting")
> Signed-off-by: Matthew Wilcox <[email protected]>
>
> diff --git a/lib/errseq.c b/lib/errseq.c
> index df782418b333..093f1fba4ee0 100644
> --- a/lib/errseq.c
> +++ b/lib/errseq.c
> @@ -119,19 +119,11 @@ EXPORT_SYMBOL(errseq_set);
> errseq_t errseq_sample(errseq_t *eseq)
> {
> errseq_t old = READ_ONCE(*eseq);
> - errseq_t new = old;
>
> - /*
> - * For the common case of no errors ever having been set, we can skip
> - * marking the SEEN bit. Once an error has been set, the value will
> - * never go back to zero.
> - */
> - if (old != 0) {
> - new |= ERRSEQ_SEEN;
> - if (old != new)
> - cmpxchg(eseq, old, new);
> - }
> - return new;
> + /* If nobody has seen this error yet, then we can be the first. */
> + if (!(old & ERRSEQ_SEEN))
> + old = 0;
> + return old;
> }
> EXPORT_SYMBOL(errseq_sample);
>
Patch looks good to me, modulo the comment fix that Andres pointed out.
Reviewed-by: Jeff Layton <[email protected]>