by NeilBrown

[permalink] [raw]

Subject: Re: [PATCH v2 06/17] mm: doc comment for scary spot in write_one_page

On Wed, Apr 12 2017, Jeff Layton wrote:

> On Wed, 2017-04-12 at 07:38 -0700, Matthew Wilcox wrote:
>> On Wed, Apr 12, 2017 at 09:01:34AM -0400, Jeff Layton wrote:
>> > On Wed, 2017-04-12 at 08:06 -0400, Jeff Layton wrote:
>> > > Not sure what to do here just yet.
>> > >
>> > > Signed-off-by: Jeff Layton <[email protected]>
>> > > ---
>> > > mm/page-writeback.c | 6 ++++++
>> > > 1 file changed, 6 insertions(+)
>> > >
>> > > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
>> > > index de0dbf12e2c1..3ac8399dc984 100644
>> > > --- a/mm/page-writeback.c
>> > > +++ b/mm/page-writeback.c
>> > > @@ -2388,6 +2388,12 @@ int write_one_page(struct page *page)
>> > > ret = mapping->a_ops->writepage(page, &wbc);
>> > > if (ret == 0) {
>> > > wait_on_page_writeback(page);
>> > > + /*
>> > > + * FIXME: is this racy? What guarantees that PG_error
>> > > + * will still be set once we get around to checking it?
>> > > + * What if writeback fails, but then a read is issued
>> > > + * before we check this, and that calls ClearPageError?
>> > > + */
>> > > if (PageError(page))
>> > > ret = -EIO;
>> > > }
>> >
>> > Ahh, we are always under the page lock here, and this is generally used
>> > for writing out directory pages anyway. I'm fine with dropping this
>> > patch unless someone else sees a problem here.
>>
>> ->writepage drops the page lock. We're still holding a refcount on this
>> page, but that's not going to prevent read being called. But maybe the
>> filesystem won't call read on a page that's marked as PageError?
>
> Hard to be sure there. I really wonder if that check is needed at all,
> the more I look at it. After all, we are calling writepage with
> WB_SYNC_ALL so we should get an error there.

WB_SYNC_ALL doesn't cause writepage to wait. It might case it to ask
for REQ_SYNC, so the write requests gets priority in the block layer.
WB_SYNC_ALL does cause writepages (with an 's') to wait.
(At least, that is how I read the code).

>
> Is it also possible these pages could be written back before that point
> (due to memory pressure or something) and that fail?

Probably, in which case clear_page_dirty_for_io() will fail and
write_one_page() will just unlock the page.

>
> Maybe we should just have a call to filemap_check_errors on exiting
> this function?

I'm leaning in that direction.

>
> With the the wb_err_t based stuff, we could change it to sample the
> wb_err early, and then use that to see if an error has occurred since
> then. Maybe we should even allow callers to pass a wb_err_t in here, so
> we can report errors that have occurred since a known point?

That feels to me like over-engineering. We would need to
unconditionally call writepage() for that to work.

We seem to be agreed that write errors for buffered writes are reported
per-address-space. To get per-page errors you have to use direct IO.
Let's focus on that policy and make it work.

Thanks,
NeilBrown

Attachments:

signature.asc (832.00 B)

2017-04-12 21:56:17

by NeilBrown

[permalink] [raw]

Subject: Re: [PATCH v2 07/17] fs: new infrastructure for writeback error handling and reporting

On Wed, Apr 12 2017, Jeff Layton wrote:

> +void __filemap_set_wb_error(struct address_space *mapping, int err)

I was really hoping that this would be

void __set_wb_error(wb_err_t *wb_err, int err)

so

Then nfs_context_set_write_error could become

static void nfs_context_set_write_error(struct nfs_open_context *ctx, int error)
{
__set_wb_error(&ctx->wb_err, error);
}

and filemap_set_sb_error() would be:

static inline void filemap_set_wb_error(struct address_space *mapping, int err)
{
/* Optimize for the common case of no error */
if (unlikely(err))
__set_wb_error(&mapping->f_wb_err, err);
}

Similarly we would have
wb_err_t sample_wb_error(wb_err_t *wb_err)
{
...
}

and

wb_err_t filemap_sample_wb_error(struct address_space *mapping)
{
return sample_wb_error(&mapping->f_wb_err);
}

so nfs_file_fsync_commit() could have
ret = sample_wb_error(&ctx->wb_err);
in place of
ret = xchg(&ctx->error, 0);

int filemap_report_wb_error(struct file *file)

would become

int filemap_report_wb_error(struct file *file, wb_err_t *err)

or something.

The address space is just one (obvious) place where the wb error can be
stored. The filesystem might have a different place with finer
granularity (nfs already does).

> +wb_err_t filemap_sample_wb_error(struct address_space *mapping)
> +{
> + wb_err_t old = READ_ONCE(mapping->wb_err);
> + wb_err_t new = old;
> +
> + /*
> + * For the common case of no errors ever having been set, we can skip
> + * marking the SEEN bit. Once an error has been set, the value will
> + * never go back to zero.
> + */
> + if (old != 0) {
> + new |= WB_ERR_SEEN;
> + if (old != new)
> + cmpxchg(&mapping->wb_err, old, new);
> + }
> + return new;
> +}

I do like how the use of cmpxchg work out here - no looping!

Thanks
NeilBrown

Attachments:

signature.asc (832.00 B)

2017-04-12 22:15:15

On Wed, Apr 12 2017, Jeff Layton wrote:

> On Thu, 2017-04-13 at 07:55 +1000, NeilBrown wrote:
>> On Wed, Apr 12 2017, Jeff Layton wrote:
>>
>>
>> > +void __filemap_set_wb_error(struct address_space *mapping, int err)
>>
>> I was really hoping that this would be
>>
>> void __set_wb_error(wb_err_t *wb_err, int err)
>>
>> so
>>
>> Then nfs_context_set_write_error could become
>>
>> static void nfs_context_set_write_error(struct nfs_open_context *ctx, int error)
>> {
>> __set_wb_error(&ctx->wb_err, error);
>> }
>>
>> and filemap_set_sb_error() would be:
>>
>> static inline void filemap_set_wb_error(struct address_space *mapping, int err)
>> {
>> /* Optimize for the common case of no error */
>> if (unlikely(err))
>> __set_wb_error(&mapping->f_wb_err, err);
>> }
>>
>> Similarly we would have
>> wb_err_t sample_wb_error(wb_err_t *wb_err)
>> {
>> ...
>> }
>>
>> and
>>
>> wb_err_t filemap_sample_wb_error(struct address_space *mapping)
>> {
>> return sample_wb_error(&mapping->f_wb_err);
>> }
>>
>> so nfs_file_fsync_commit() could have
>> ret = sample_wb_error(&ctx->wb_err);
>> in place of
>> ret = xchg(&ctx->error, 0);
>>
>> int filemap_report_wb_error(struct file *file)
>>
>> would become
>>
>> int filemap_report_wb_error(struct file *file, wb_err_t *err)
>>
>> or something.
>>
>> The address space is just one (obvious) place where the wb error can be
>> stored. The filesystem might have a different place with finer
>> granularity (nfs already does).
>>
>>
>
> I think it'd be much simpler to adapt NFS over to use the new
> infrastructure (I have a draft patch for that already). You'd lose the
> ability to track a different error for each nfs_open_context, but I'm
> not sure how valuable that is anyway. I'll need to think about that
> one...

From a technical perspective, it might be "simpler" but I contest "much
simpler". I think it would be easy to put one wb_err_t per
nfs_open_context, if the former were designed well (which itself would
be easy).

From a political perspective, I doubt it would be simple. NFS is the
way it is for a reason, and convincing an author that their reason is
not valid tends to be harder than most technical issues.
(looking to history...
the 'error' field was added to the nfs_open_context in
Commit: 6caf69feb23a ("NFSv2/v3/v4: Place NFS nfs_page shared data into a single structure that hangs off filp->private_data. As a side effect, this also cleans up the NFSv4 private file state info.")

in 2.6.12. Prior to that file->f_error was used.
Prior to commit 9ffb8c3a1955 ("Import 2.2.3pre1") (which has no comment)
errors were ... interesting. Look for nfs_check_error in
commit d9c0ffee4db7 ("Import 2.1.128") and notice the use of current->pid!!
All commits from the history.git tree.
)

It is quite possible for an NFS server to return different errors to
different users. It might be odd, but it is possible. Should an error
that affects one user pollute all other users?

Thanks,
NeilBrown

Attachments:

signature.asc (832.00 B)

2017-04-17 22:56:25

On Fri, Apr 21 2017, Jeff Layton wrote:

> On Tue, 2017-04-18 at 08:56 +1000, NeilBrown wrote:
>> On Wed, Apr 12 2017, Jeff Layton wrote:
>>
>> > On Thu, 2017-04-13 at 08:14 +1000, NeilBrown wrote:
>> > >
>> > > I suspect that the filemap_check_wb_error() will need to be moved
>> > > into some parent of the current call site, which is essentially what you
>> > > suggest below. It would be nice if we could do that first, rather than
>> > > having the current rather odd code. But maybe this way is an easier
>> > > transition. It isn't obviously wrong, it just isn't obviously right
>> > > either.
>> > >
>> >
>> > Yeah. It's just such a daunting task to have to change so much of the
>> > existing code. I'm looking for ways to make this simpler.
>> >
>> > I think it probably is reasonable for filemap_write_and_wait* to just
>> > sample it as early as possible in those functions. filemap_fdatawait is
>> > the real questionable one, as you may have already had some writebacks
>> > complete with errors.
>> >
>> > In any case, my thinking was that the old code is not obviously correct
>> > either, so while this shortens the "error capture window" on these
>> > calls, it seems like a reasonable place to start improving things.
>>
>> I agree. It wouldn't hurt to add a note to this effect in the patch
>> comment so that people understand that the code isn't seen to be
>> "correct" but only "no worse" with clear direction on what sort of
>> improvement might be appropriate.
>>
>
> I've got a cleaned-up set that is getting close to ready for
> reposting. Before I do though, I think there is another option here
> that's worth discussing.
>
> We could store a second wb_err_t (aka errseq_t in the new set) in the
> mapping that would would basically act as a "cursor" for these cases.
> filemap_check_errors would need to do something like
> filemap_report_wb_error, but it would swap the value into the mapping's
> cursor instead of dealing with the one in struct file.
>
> I don't really like adding yet another field here, but the struct
> address_space definition has this:
>
> __attribute__((aligned(sizeof(long))));
>
> Adding the wb_err field means that we end up growing the struct by 8
> bytes on x86_64 anyway. Adding another 4 bytes would just consume the
> pad, so it wouldn't cost anything there. YMMV on other arches of
> course.
>
> That's also not perfectly like what we have with AS_EIO/AS_ENOSPC
> flags, but is probably close enough not to matter.
>
> So...this would let us limp along for even longer with the model of
> reporting since last check. I'm not sure that's a good thing though. A
> long term goal here is to have kernel code that's dealing with
> writeback be more deliberate about the point from which it's checking
> errors, and this doesn't help promote that.

I think this question needs some input from filesystem developers who
might be affected by the answer.

My preference is to not add this field. I think we would eventually
want to remove it again, and it is easier to ensure it doesn't stay
forever if it is never added.
The version without this field isn't (I think) too bad, but maybe it is
bad enough to motivate fs developers to create a better solution in each
individual case.

If some filesystem developer says they don't like that sort of social
engineering, or objects for any other reason, I will bow to the superior
stake they hold.

NeilBrown

Attachments:

signature.asc (832.00 B)

2017-04-24 11:50:36

by Jeff Layton

[permalink] [raw]

Subject: Re: [PATCH v2 08/17] fs: retrofit old error reporting API onto new infrastructure

On Mon, 2017-04-24 at 08:38 +1000, NeilBrown wrote:
> On Fri, Apr 21 2017, Jeff Layton wrote:
>
> > On Tue, 2017-04-18 at 08:56 +1000, NeilBrown wrote:
> > > On Wed, Apr 12 2017, Jeff Layton wrote:
> > >
> > > > On Thu, 2017-04-13 at 08:14 +1000, NeilBrown wrote:
> > > > >
> > > > > I suspect that the filemap_check_wb_error() will need to be moved
> > > > > into some parent of the current call site, which is essentially what you
> > > > > suggest below. It would be nice if we could do that first, rather than
> > > > > having the current rather odd code. But maybe this way is an easier
> > > > > transition. It isn't obviously wrong, it just isn't obviously right
> > > > > either.
> > > > >
> > > >
> > > > Yeah. It's just such a daunting task to have to change so much of the
> > > > existing code. I'm looking for ways to make this simpler.
> > > >
> > > > I think it probably is reasonable for filemap_write_and_wait* to just
> > > > sample it as early as possible in those functions. filemap_fdatawait is
> > > > the real questionable one, as you may have already had some writebacks
> > > > complete with errors.
> > > >
> > > > In any case, my thinking was that the old code is not obviously correct
> > > > either, so while this shortens the "error capture window" on these
> > > > calls, it seems like a reasonable place to start improving things.
> > >
> > > I agree. It wouldn't hurt to add a note to this effect in the patch
> > > comment so that people understand that the code isn't seen to be
> > > "correct" but only "no worse" with clear direction on what sort of
> > > improvement might be appropriate.
> > >
> >
> > I've got a cleaned-up set that is getting close to ready for
> > reposting. Before I do though, I think there is another option here
> > that's worth discussing.
> >
> > We could store a second wb_err_t (aka errseq_t in the new set) in the
> > mapping that would would basically act as a "cursor" for these cases.
> > filemap_check_errors would need to do something like
> > filemap_report_wb_error, but it would swap the value into the mapping's
> > cursor instead of dealing with the one in struct file.
> >
> > I don't really like adding yet another field here, but the struct
> > address_space definition has this:
> >
> > __attribute__((aligned(sizeof(long))));
> >
> > Adding the wb_err field means that we end up growing the struct by 8
> > bytes on x86_64 anyway. Adding another 4 bytes would just consume the
> > pad, so it wouldn't cost anything there. YMMV on other arches of
> > course.
> >
> > That's also not perfectly like what we have with AS_EIO/AS_ENOSPC
> > flags, but is probably close enough not to matter.
> >
> > So...this would let us limp along for even longer with the model of
> > reporting since last check. I'm not sure that's a good thing though. A
> > long term goal here is to have kernel code that's dealing with
> > writeback be more deliberate about the point from which it's checking
> > errors, and this doesn't help promote that.
>
> I think this question needs some input from filesystem developers who
> might be affected by the answer.
>
> My preference is to not add this field. I think we would eventually
> want to remove it again, and it is easier to ensure it doesn't stay
> forever if it is never added.
> The version without this field isn't (I think) too bad, but maybe it is
> bad enough to motivate fs developers to create a better solution in each
> individual case.
>
> If some filesystem developer says they don't like that sort of social
> engineering, or objects for any other reason, I will bow to the superior
> stake they hold.
>
>

That's pretty much my view too. I just figured I needed to throw the
option out there in the interest of full disclosure.

I think keeping a per-mapping cursor like this does make sense in some
situations though. For instance, there does seem to be quite a bit of
local fs journaling code that goes through the pagecache. For those, I
could see keeping the cursor in some sort of per-journal structure, and
doing a check-and-advance against that in appropriate places.

This is an option we can bring up for folks who do want to continue to
use a similar error tracking model in these situations though.
--
Jeff Layton <[email protected]>