2006-12-05 06:52:53

by Nick Piggin

[permalink] [raw]
Subject: Status of buffered write path (deadlock fixes)

Hi,

I'd like to try to state where we are WRT the buffered write patches,
and ask for comments. Sorry for the wide cc list, but this is an
important issue which hasn't had enough review.

Well the next -mm will include everything we've done so far. I won't
repost patches unless someone would like to comment on a specific one.

I think the core generic_file_buffered_write is fairly robust, after
fixing the efault and zerolength iov problems picked up in testing
(thanks, very helpful!).

So now I *believe* we have an approach that solves the deadlock and
doesn't expose transient or stale data, transient zeroes, or anything
like that.

Error handling is getting close, but there may be cases that nobody
has picked up, and I've noticed a couple which I'll explain below.

I think we do the right thing WRT pagecache error handling: a
!uptodate page remains !uptodate, an uptodate page can handle the
write being done in several parts. Comments in the patches attempt
to explain how this works. I think it is pretty straightforward.

But WRT block allocation in the case of errors, it needs more review.

Block allocation:
- prepare_write can allocate blocks
- prepare_write doesn't need to initialize the pagecache on top of
these blocks where it is within the range specified in prepare_write
(because the copy_from_user will initialise it correctly)
- In the case of a !uptodate page, unless the page is brought uptodate
(ie the copy_from_user completely succeeds) and marked dirty, then
a read that sneaks in after we unlock the page (to retry the write)
will try to bring it uptodate by pulling in the uninitialised blocks.

Problem 1:
I think that allocating blocks outside i_size is OK WRT uninitialised
data, because we update i_size only after a successful copy. However,
I don't think we trim these blocks off (eg. perhaps the "prepare_write
may have instantiated a few blocks" path should be the normal error
path for both the copy_from_user and the commit_write error cases as
well?)

We allocate blocks within holes, but these don't need to be trimmed: it
is enough to just zero out any new buffers. It might be nicer if we had
some kind of way to punch a hole, but it is a rare corner case.

Problem 2:
nobh error handling[*]. We have just a single buffer that is used for
each block in the prepare_write path, so the "zero new buffers" trick
doesn't work.

I think one solution to this could be to allocate all buffers for the
page like normal, and then strip them off when commit_write succeeds?
This would allow the zero_new_buffers path to work properly.

[*] Actually I think there is a problem with the mainline nobh error
handling in that a whole page of blocks will get zeroed on failure,
even valid data that isn't being touched by the write.

Finally, filesystems. Only OGAWA Hirofumi and Mark Fasheh have given much
feedback so far. I've tried to grok ext2/3 and think they'll work OK, and
have at least *looked* at all the rest. However in the worst case, there
might be many subtle and different problems :( Filesystem developers need
to review this, please. I don't want to cc every filesystem dev list, but
if anybody thinks it would be helpful to forward this then please do.

Well, that's about where its at. Block allocation problems 1 and 2
shouldn't be too hard to fix, but I would like confirmation / suggestions.

Thanks,
Nick

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com


2006-12-07 19:55:39

by Mark Fasheh

[permalink] [raw]
Subject: Re: Status of buffered write path (deadlock fixes)

Hi Nick,

On Tue, Dec 05, 2006 at 05:52:02PM +1100, Nick Piggin wrote:
> Hi,
>
> I'd like to try to state where we are WRT the buffered write patches,
> and ask for comments. Sorry for the wide cc list, but this is an
> important issue which hasn't had enough review.

I pulled broken-out-2006-12-05-01-0.tar.gz from ftp.kernel.org and applied
the following patches to get a reasonable idea of what the final product
would look like:

revert-generic_file_buffered_write-handle-zero-length-iovec-segments.patch
revert-generic_file_buffered_write-deadlock-on-vectored-write.patch
generic_file_buffered_write-cleanup.patch
mm-only-mm-debug-write-deadlocks.patch
mm-fix-pagecache-write-deadlocks.patch
mm-fix-pagecache-write-deadlocks-comment.patch
mm-fix-pagecache-write-deadlocks-xip.patch
mm-fix-pagecache-write-deadlocks-mm-pagecache-write-deadlocks-efault-fix.patch
mm-fix-pagecache-write-deadlocks-zerolength-fix.patch
mm-fix-pagecache-write-deadlocks-stale-holes-fix.patch
fs-prepare_write-fixes.patch

If this is incorrect, or I should apply further patches, please let me know.

Hopefully my feedback can be of use to you.


> Well the next -mm will include everything we've done so far. I won't
> repost patches unless someone would like to comment on a specific one.
>
> I think the core generic_file_buffered_write is fairly robust, after
> fixing the efault and zerolength iov problems picked up in testing
> (thanks, very helpful!).
>
> So now I *believe* we have an approach that solves the deadlock and
> doesn't expose transient or stale data, transient zeroes, or anything
> like that.

In generic_file_buffered_write() we now do:

status = a_ops->commit_write(file, page, offset,offset+copied);

Which tells the file system to commit only the amount of data that
filemap_copy_from_user() was able to pull in, despite our zeroing of
the newly allocated buffers in the copied != bytes case. Shouldn't we be
doing:

status = a_ops->commit_write(file, page, offset,offset+bytes);

instead, thus preserving ordered writeout (for ext3, ocfs2, etc) for those
parts of the page which are properly allocated and zero'd but haven't been
copied into yet? I think that in the case of a crash just after the
transaction is closed in ->commit_write(), we might lose those guarantees,
exposing stale data on disk.


> Error handling is getting close, but there may be cases that nobody
> has picked up, and I've noticed a couple which I'll explain below.
>
> I think we do the right thing WRT pagecache error handling: a
> !uptodate page remains !uptodate, an uptodate page can handle the
> write being done in several parts. Comments in the patches attempt
> to explain how this works. I think it is pretty straightforward.
>
> But WRT block allocation in the case of errors, it needs more review.
>
> Block allocation:
> - prepare_write can allocate blocks
> - prepare_write doesn't need to initialize the pagecache on top of
> these blocks where it is within the range specified in prepare_write
> (because the copy_from_user will initialise it correctly)
> - In the case of a !uptodate page, unless the page is brought uptodate
> (ie the copy_from_user completely succeeds) and marked dirty, then
> a read that sneaks in after we unlock the page (to retry the write)
> will try to bring it uptodate by pulling in the uninitialised blocks.

For some reason, I'm not seeing where BH_New is being cleared in case with
no errors or faults. Hopefully I'm wrong, but if I'm not I believe we need
to clear the flag somewhere (perhaps in block_commit_write()?).


Ok, that's it for now. I have some thoughts regarding the asymmetry between
ranges passed to ->prepare_write() and ->commit_write(), but I'd like to
save those thoughts until I know whether my comments above uncovered real
issues :)

Thanks,
--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
[email protected]

2006-12-08 03:29:28

by Nick Piggin

[permalink] [raw]
Subject: Re: Status of buffered write path (deadlock fixes)

Mark Fasheh wrote:
> Hi Nick,
>
> On Tue, Dec 05, 2006 at 05:52:02PM +1100, Nick Piggin wrote:
>
>>Hi,
>>
>>I'd like to try to state where we are WRT the buffered write patches,
>>and ask for comments. Sorry for the wide cc list, but this is an
>>important issue which hasn't had enough review.
>
>
> I pulled broken-out-2006-12-05-01-0.tar.gz from ftp.kernel.org and applied
> the following patches to get a reasonable idea of what the final product
> would look like:
>
> revert-generic_file_buffered_write-handle-zero-length-iovec-segments.patch
> revert-generic_file_buffered_write-deadlock-on-vectored-write.patch
> generic_file_buffered_write-cleanup.patch
> mm-only-mm-debug-write-deadlocks.patch
> mm-fix-pagecache-write-deadlocks.patch
> mm-fix-pagecache-write-deadlocks-comment.patch
> mm-fix-pagecache-write-deadlocks-xip.patch
> mm-fix-pagecache-write-deadlocks-mm-pagecache-write-deadlocks-efault-fix.patch
> mm-fix-pagecache-write-deadlocks-zerolength-fix.patch
> mm-fix-pagecache-write-deadlocks-stale-holes-fix.patch
> fs-prepare_write-fixes.patch
>
> If this is incorrect, or I should apply further patches, please let me know.
>
> Hopefully my feedback can be of use to you.

That looks right (there are fixes for the "cont" buffer scheme, but they
probably don't bother you).

>>Well the next -mm will include everything we've done so far. I won't
>>repost patches unless someone would like to comment on a specific one.
>>
>>I think the core generic_file_buffered_write is fairly robust, after
>>fixing the efault and zerolength iov problems picked up in testing
>>(thanks, very helpful!).
>>
>>So now I *believe* we have an approach that solves the deadlock and
>>doesn't expose transient or stale data, transient zeroes, or anything
>>like that.
>
>
> In generic_file_buffered_write() we now do:
>
> status = a_ops->commit_write(file, page, offset,offset+copied);
>
> Which tells the file system to commit only the amount of data that
> filemap_copy_from_user() was able to pull in, despite our zeroing of
> the newly allocated buffers in the copied != bytes case. Shouldn't we be
> doing:
>
> status = a_ops->commit_write(file, page, offset,offset+bytes);
>
> instead, thus preserving ordered writeout (for ext3, ocfs2, etc) for those
> parts of the page which are properly allocated and zero'd but haven't been
> copied into yet? I think that in the case of a crash just after the
> transaction is closed in ->commit_write(), we might lose those guarantees,
> exposing stale data on disk.

No, because we might be talking about buffers that haven't been newly
allocated, but are not uptodate (so the pagecache can contain junk).

We can't zero these guys and do the commit_write, because that exposes
transient zeroes to a concurrent reader.

What we *could* do, is to do the full length commit_write for uptodate
pages, even if the copy is short. But we still need to do a zero-length
commit_write if the page is not uptodate (this would reduce the number
of new cases that need to be considered).

Does a short commit_write cause a problem for filesystems? They can
still do any and all operations they would have with a full-length one.
But maybe it would be better to eliminate that case. OK.

How about a zero-length commit_write? In that case again, they should
be able to remain unchanged *except* that they are not to extend i_size
or mark the page uptodate.

>>Error handling is getting close, but there may be cases that nobody
>>has picked up, and I've noticed a couple which I'll explain below.
>>
>>I think we do the right thing WRT pagecache error handling: a
>>!uptodate page remains !uptodate, an uptodate page can handle the
>>write being done in several parts. Comments in the patches attempt
>>to explain how this works. I think it is pretty straightforward.
>>
>>But WRT block allocation in the case of errors, it needs more review.
>>
>>Block allocation:
>>- prepare_write can allocate blocks
>>- prepare_write doesn't need to initialize the pagecache on top of
>> these blocks where it is within the range specified in prepare_write
>> (because the copy_from_user will initialise it correctly)
>>- In the case of a !uptodate page, unless the page is brought uptodate
>> (ie the copy_from_user completely succeeds) and marked dirty, then
>> a read that sneaks in after we unlock the page (to retry the write)
>> will try to bring it uptodate by pulling in the uninitialised blocks.
>
>
> For some reason, I'm not seeing where BH_New is being cleared in case with
> no errors or faults. Hopefully I'm wrong, but if I'm not I believe we need
> to clear the flag somewhere (perhaps in block_commit_write()?).

Hmm, it is a bit inconsistent. It seems to be anywhere from prepare_write
to block_write_full_page.

Where do filesystems need the bit? It would be nice to clear it in
commit_write if possible. Worst case we'll need a new bit.

> Ok, that's it for now. I have some thoughts regarding the asymmetry between
> ranges passed to ->prepare_write() and ->commit_write(), but I'd like to
> save those thoughts until I know whether my comments above uncovered real
> issues :)

Very helpful, thanks ;)

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-12-08 23:49:12

by Mark Fasheh

[permalink] [raw]
Subject: Re: Status of buffered write path (deadlock fixes)

On Fri, Dec 08, 2006 at 02:28:10PM +1100, Nick Piggin wrote:
> >In generic_file_buffered_write() we now do:
> >
> > status = a_ops->commit_write(file, page, offset,offset+copied);
> >
> >Which tells the file system to commit only the amount of data that
> >filemap_copy_from_user() was able to pull in, despite our zeroing of
> >the newly allocated buffers in the copied != bytes case. Shouldn't we be
> >doing:
> >
> > status = a_ops->commit_write(file, page, offset,offset+bytes);
> >
> >instead, thus preserving ordered writeout (for ext3, ocfs2, etc) for those
> >parts of the page which are properly allocated and zero'd but haven't been
> >copied into yet? I think that in the case of a crash just after the
> >transaction is closed in ->commit_write(), we might lose those guarantees,
> >exposing stale data on disk.
>
> No, because we might be talking about buffers that haven't been newly
> allocated, but are not uptodate (so the pagecache can contain junk).
>
> We can't zero these guys and do the commit_write, because that exposes
> transient zeroes to a concurrent reader.

Ahh ok - zeroing would populate with incorrect data and we can't write those
buffers because we'd write junk over good data.

And of course, the way it works right now will break ordered write mode.

Sigh.


> What we *could* do, is to do the full length commit_write for uptodate
> pages, even if the copy is short. But we still need to do a zero-length
> commit_write if the page is not uptodate (this would reduce the number
> of new cases that need to be considered).
>
> Does a short commit_write cause a problem for filesystems? They can
> still do any and all operations they would have with a full-length one.

If they've done allocation, yes. You're telling the file system to stop
early in the page, even though there may be BH_New buffers further on which
should be processed (for things like ordered data mode).

Hmm, I think we should just just change functions like walk_page_buffers()
in fs/ext3/inode.c and fs/ocfs2/aops.c to look for BH_New buffers outside
the range specified (they walk the entire buffer list anyway). If it finds
one that's buffer_new() it passes it to the journal unconditionally. You'd
also have to revert the change you did in fs/ext3/inode.c to at least always
make the call to walk_page_buffers().

I really don't like that we're hiding a detail of this interaction so deep
within the file system commit_write() callback. I suppose we can just do our
best to document it.


> But maybe it would be better to eliminate that case. OK.
> How about a zero-length commit_write? In that case again, they should
> be able to remain unchanged *except* that they are not to extend i_size
> or mark the page uptodate.

If we make the change I described above (looking for BH_New buffers outside
the range passed), then zero length or partial shouldn't matter, but zero
length instead of partial would be nicer imho just for the sake of reducing
the total number of cases down to the entire range or zero length.


> >For some reason, I'm not seeing where BH_New is being cleared in case with
> >no errors or faults. Hopefully I'm wrong, but if I'm not I believe we need
> >to clear the flag somewhere (perhaps in block_commit_write()?).
>
> Hmm, it is a bit inconsistent. It seems to be anywhere from prepare_write
> to block_write_full_page.
>
> Where do filesystems need the bit? It would be nice to clear it in
> commit_write if possible. Worst case we'll need a new bit.

->commit_write() would probably do fine. Currently, block_prepare_write()
uses it to know which buffers were newly allocated (the file system specific
get_block_t sets the bit after allocation). I think we could safely move
the clearing of that bit to block_commit_write(), thus still allowing us to
detect and zero those blocks in generic_file_buffered_write()

Thanks,
--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
[email protected]

2006-12-11 09:12:06

by Nick Piggin

[permalink] [raw]
Subject: Re: Status of buffered write path (deadlock fixes)

Mark Fasheh wrote:
> On Fri, Dec 08, 2006 at 02:28:10PM +1100, Nick Piggin wrote:
>
>>>In generic_file_buffered_write() we now do:
>>>
>>> status = a_ops->commit_write(file, page, offset,offset+copied);
>>>
>>>Which tells the file system to commit only the amount of data that
>>>filemap_copy_from_user() was able to pull in, despite our zeroing of
>>>the newly allocated buffers in the copied != bytes case. Shouldn't we be
>>>doing:
>>>
>>> status = a_ops->commit_write(file, page, offset,offset+bytes);
>>>
>>>instead, thus preserving ordered writeout (for ext3, ocfs2, etc) for those
>>>parts of the page which are properly allocated and zero'd but haven't been
>>>copied into yet? I think that in the case of a crash just after the
>>>transaction is closed in ->commit_write(), we might lose those guarantees,
>>>exposing stale data on disk.
>>
>>No, because we might be talking about buffers that haven't been newly
>>allocated, but are not uptodate (so the pagecache can contain junk).
>>
>>We can't zero these guys and do the commit_write, because that exposes
>>transient zeroes to a concurrent reader.
>
>
> Ahh ok - zeroing would populate with incorrect data and we can't write those
> buffers because we'd write junk over good data.
>
> And of course, the way it works right now will break ordered write mode.
>
> Sigh.

Yes :P

>>What we *could* do, is to do the full length commit_write for uptodate
>>pages, even if the copy is short. But we still need to do a zero-length
>>commit_write if the page is not uptodate (this would reduce the number
>>of new cases that need to be considered).
>>
>>Does a short commit_write cause a problem for filesystems? They can
>>still do any and all operations they would have with a full-length one.
>
>
> If they've done allocation, yes. You're telling the file system to stop
> early in the page, even though there may be BH_New buffers further on which
> should be processed (for things like ordered data mode).

Well they must have done the allocation, right (or be prepared to do the
allocation at a later time) because from the point of view of the fs,
they don't know or care whether the copy has succeeded just so long as
the data is uptodate (ie. zeroed, in the case of a hole).

> Hmm, I think we should just just change functions like walk_page_buffers()
> in fs/ext3/inode.c and fs/ocfs2/aops.c to look for BH_New buffers outside
> the range specified (they walk the entire buffer list anyway). If it finds
> one that's buffer_new() it passes it to the journal unconditionally. You'd
> also have to revert the change you did in fs/ext3/inode.c to at least always
> make the call to walk_page_buffers().

Would this satisfy non jbd filesystems, though? How about data journalling
in the case where there are some underlying buffers which are *not* BH_New?

> I really don't like that we're hiding a detail of this interaction so deep
> within the file system commit_write() callback. I suppose we can just do our
> best to document it.

Well, supposing we do the full-length commit in the case of an uptodate
page, then the *only* thing we have to worry about is a zero length commit
to a !uptodate page.

I guess that still has the same block allocation and journalling problems.

>>But maybe it would be better to eliminate that case. OK.
>>How about a zero-length commit_write? In that case again, they should
>>be able to remain unchanged *except* that they are not to extend i_size
>>or mark the page uptodate.
>
>
> If we make the change I described above (looking for BH_New buffers outside
> the range passed), then zero length or partial shouldn't matter, but zero
> length instead of partial would be nicer imho just for the sake of reducing
> the total number of cases down to the entire range or zero length.

We don't want to do zero length, because we might make the theoretical
livelock much easier to hit (eg. in the case of many small iovecs). But
yes we can restrict ourselves to zero-length or full-length.

>>>For some reason, I'm not seeing where BH_New is being cleared in case with
>>>no errors or faults. Hopefully I'm wrong, but if I'm not I believe we need
>>>to clear the flag somewhere (perhaps in block_commit_write()?).
>>
>>Hmm, it is a bit inconsistent. It seems to be anywhere from prepare_write
>>to block_write_full_page.
>>
>>Where do filesystems need the bit? It would be nice to clear it in
>>commit_write if possible. Worst case we'll need a new bit.
>
>
> ->commit_write() would probably do fine. Currently, block_prepare_write()
> uses it to know which buffers were newly allocated (the file system specific
> get_block_t sets the bit after allocation). I think we could safely move
> the clearing of that bit to block_commit_write(), thus still allowing us to
> detect and zero those blocks in generic_file_buffered_write()

OK, great, I'll make a few patches and see how they look. What did you
think of those other uninitialised buffer problems in my first email?

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-12-11 14:21:47

by Nick Piggin

[permalink] [raw]
Subject: Re: Status of buffered write path (deadlock fixes)

Nick Piggin wrote:
> Mark Fasheh wrote:

>> If we make the change I described above (looking for BH_New buffers
>> outside
>> the range passed), then zero length or partial shouldn't matter, but zero
>> length instead of partial would be nicer imho just for the sake of
>> reducing
>> the total number of cases down to the entire range or zero length.
>
>
> We don't want to do zero length, because we might make the theoretical
> livelock much easier to hit (eg. in the case of many small iovecs). But
> yes we can restrict ourselves to zero-length or full-length.

On second thoughts, I think I'm wrong about that.

Consider the last page of a file, which is uptodate. A full length
commit, which extends the file, will expose transient zeroes if the
usercopy fails.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-12-11 15:53:19

by Nick Piggin

[permalink] [raw]
Subject: Re: Status of buffered write path (deadlock fixes)

Nick Piggin wrote:
> Mark Fasheh wrote:

>> ->commit_write() would probably do fine. Currently, block_prepare_write()
>> uses it to know which buffers were newly allocated (the file system
>> specific
>> get_block_t sets the bit after allocation). I think we could safely move
>> the clearing of that bit to block_commit_write(), thus still allowing
>> us to
>> detect and zero those blocks in generic_file_buffered_write()
>
>
> OK, great, I'll make a few patches and see how they look. What did you
> think of those other uninitialised buffer problems in my first email?

Hmm, doesn't look like we can do this either because at least GFS2
uses BH_New for its own special things.

Also, I don't know if the trick of only walking over BH_New buffers
will work anyway, since we may still need to release resources on
other buffers as well. As you say, filesystems are simply not set up
to expect this, which is a problem.

Maybe it isn't realistic to change the API this way, no matter how
bad it is presently.

What if we tackle the problem a different way?

1. In the case of no page in the pagecache (or an otherwise
!uptodate page), if the operation is a full-page write then we
first copy all the user data *then* lock the page *then* insert it
into pagecache and go on to call into the filesystem.

2. In the case of a !uptodate page and a partial-page write, then
we can first bring the page uptodate, then continue (goto 3).

3. In the case of an uptodate page, we could perform a full-length
commit_write so long as we didn't expand i_size further than was
copied, and were sure to trim off blocks allocated past that
point.

This scheme IMO is not as "nice" as the partial commit patches,
but in practical terms it may be much more realistic.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-12-11 16:07:44

by Steven Whitehouse

[permalink] [raw]
Subject: Re: Status of buffered write path (deadlock fixes)

Hi,

On Tue, 2006-12-12 at 02:52 +1100, Nick Piggin wrote:
> Nick Piggin wrote:
> > Mark Fasheh wrote:
>
> >> ->commit_write() would probably do fine. Currently, block_prepare_write()
> >> uses it to know which buffers were newly allocated (the file system
> >> specific
> >> get_block_t sets the bit after allocation). I think we could safely move
> >> the clearing of that bit to block_commit_write(), thus still allowing
> >> us to
> >> detect and zero those blocks in generic_file_buffered_write()
> >
> >
> > OK, great, I'll make a few patches and see how they look. What did you
> > think of those other uninitialised buffer problems in my first email?
>
> Hmm, doesn't look like we can do this either because at least GFS2
> uses BH_New for its own special things.
>
What makes you say that? As far as I know we are not doing anything we
shouldn't with this flag, and if we are, then I'm quite happy to
consider fixing it up so that we don't,

Steve.


2006-12-11 16:43:58

by Nick Piggin

[permalink] [raw]
Subject: Re: Status of buffered write path (deadlock fixes)

Steven Whitehouse wrote:

>>Hmm, doesn't look like we can do this either because at least GFS2
>>uses BH_New for its own special things.
>>
>
> What makes you say that? As far as I know we are not doing anything we
> shouldn't with this flag, and if we are, then I'm quite happy to
> consider fixing it up so that we don't,

Bad wording. Many other filesystems seem to only make use of buffer_new
between prepare and commit_write.

gfs2 seems to at least test it in a lot of places, so it is hard to know
whether we can change the current semantics or not. I didn't mean that
gfs2 is doing anything wrong.

So can we clear it in commit_write?

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-12-11 17:13:55

by Steven Whitehouse

[permalink] [raw]
Subject: Re: Status of buffered write path (deadlock fixes)

Hi,

On Tue, 2006-12-12 at 03:39 +1100, Nick Piggin wrote:
> Steven Whitehouse wrote:
>
> >>Hmm, doesn't look like we can do this either because at least GFS2
> >>uses BH_New for its own special things.
> >>
> >
> > What makes you say that? As far as I know we are not doing anything we
> > shouldn't with this flag, and if we are, then I'm quite happy to
> > consider fixing it up so that we don't,
>
> Bad wording. Many other filesystems seem to only make use of buffer_new
> between prepare and commit_write.
>
> gfs2 seems to at least test it in a lot of places, so it is hard to know
> whether we can change the current semantics or not. I didn't mean that
> gfs2 is doing anything wrong.
>
> So can we clear it in commit_write?
>
Are you perhaps looking at an older copy of the GFS2 code? We set it (or
clear it) in bmap.c:gfs2_block_map() as appropriate. It also seems to be
cleared when we release buffers (which may or may not be actually
required. I suspect not, but its harmless anyway). There is only one
test that I can find for it which is in bmap.c:gfs2_extent_map() where
its value is then later ignored in the only caller relevant to "normal"
files, which is ops_vm.c:alloc_page_backing(). So I think you should be
quite safe to clear it in commit_write().

Steve.


2006-12-11 18:17:36

by OGAWA Hirofumi

[permalink] [raw]
Subject: Re: Status of buffered write path (deadlock fixes)

Nick Piggin <[email protected]> writes:

> Finally, filesystems. Only OGAWA Hirofumi and Mark Fasheh have given much
> feedback so far. I've tried to grok ext2/3 and think they'll work OK, and
> have at least *looked* at all the rest. However in the worst case, there
> might be many subtle and different problems :( Filesystem developers need
> to review this, please. I don't want to cc every filesystem dev list, but
> if anybody thinks it would be helpful to forward this then please do.

BTW, there are still some from==to users.

fs/affs/file.c:affs_truncate
fs/hfs/extent.c:hfs_file_truncate
fs/hfsplus/extents.c:hfsplus_file_truncate
fs/reiserfs/ioctl.c:reiserfs_unpack

I'll see those this weekend.
--
OGAWA Hirofumi <[email protected]>

2006-12-12 22:31:24

by Mark Fasheh

[permalink] [raw]
Subject: Re: Status of buffered write path (deadlock fixes)

On Tue, Dec 12, 2006 at 02:52:26AM +1100, Nick Piggin wrote:
> Nick Piggin wrote:
> >Mark Fasheh wrote:
>
> >>->commit_write() would probably do fine. Currently, block_prepare_write()
> >>uses it to know which buffers were newly allocated (the file system
> >>specific
> >>get_block_t sets the bit after allocation). I think we could safely move
> >>the clearing of that bit to block_commit_write(), thus still allowing
> >>us to
> >>detect and zero those blocks in generic_file_buffered_write()
> >
> >
> >OK, great, I'll make a few patches and see how they look. What did you
> >think of those other uninitialised buffer problems in my first email?
>
> Hmm, doesn't look like we can do this either because at least GFS2
> uses BH_New for its own special things.
>
> Also, I don't know if the trick of only walking over BH_New buffers
> will work anyway, since we may still need to release resources on
> other buffers as well.

Oh, my idea was that only the range passed to ->commit() would be walked,
but any BH_New buffers (regardless of where they are in the page) would be
passed to the journal as well. So the logic would be:

for all the buffers in the page:
If the buffer is new, or it is within the range passed to commit, pass to
the journal.

Is there anything I'm still missing here?


> As you say, filesystems are simply not set up to expect this, which is a
> problem.
>
> Maybe it isn't realistic to change the API this way, no matter how
> bad it is presently.

We definitely agree. It's not intuitive that the range should change
between ->prepare_write() and ->commit_write() and IMHO, all the issues
we've found are good evidence that this particular approach will be
problematic.


> What if we tackle the problem a different way?
>
> 1. In the case of no page in the pagecache (or an otherwise
> !uptodate page), if the operation is a full-page write then we
> first copy all the user data *then* lock the page *then* insert it
> into pagecache and go on to call into the filesystem.

Silly question - what's preventing a reader from filling the !uptodate page with disk
data while the writer is copying the user buffer into it?


> 2. In the case of a !uptodate page and a partial-page write, then
> we can first bring the page uptodate, then continue (goto 3).
>
> 3. In the case of an uptodate page, we could perform a full-length
> commit_write so long as we didn't expand i_size further than was
> copied, and were sure to trim off blocks allocated past that
> point.
>
> This scheme IMO is not as "nice" as the partial commit patches,
> but in practical terms it may be much more realistic.

It seems more realistic in that it makes sure the write is properly setup
before calling into the file system. What do you think is not as nice about
it?
--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
[email protected]

2006-12-13 01:00:25

by Nick Piggin

[permalink] [raw]
Subject: Re: Status of buffered write path (deadlock fixes)

Mark Fasheh wrote:
> On Tue, Dec 12, 2006 at 02:52:26AM +1100, Nick Piggin wrote:
>
>>Nick Piggin wrote:
>>
>>Hmm, doesn't look like we can do this either because at least GFS2
>>uses BH_New for its own special things.
>>
>>Also, I don't know if the trick of only walking over BH_New buffers
>>will work anyway, since we may still need to release resources on
>>other buffers as well.
>
>
> Oh, my idea was that only the range passed to ->commit() would be walked,
> but any BH_New buffers (regardless of where they are in the page) would be
> passed to the journal as well. So the logic would be:
>
> for all the buffers in the page:
> If the buffer is new, or it is within the range passed to commit, pass to
> the journal.
>
> Is there anything I'm still missing here?

If the buffer is not new, outside the commit range (and inside the prepare),
then it does not get passed to the journal. I'm not familiar with jbd, but
this seems that it could cause a problem eg. in data journalling mode?

>>As you say, filesystems are simply not set up to expect this, which is a
>>problem.
>>
>>Maybe it isn't realistic to change the API this way, no matter how
>>bad it is presently.
>
>
> We definitely agree. It's not intuitive that the range should change
> between ->prepare_write() and ->commit_write() and IMHO, all the issues
> we've found are good evidence that this particular approach will be
> problematic.

Yes.

>>What if we tackle the problem a different way?
>>
>>1. In the case of no page in the pagecache (or an otherwise
>>!uptodate page), if the operation is a full-page write then we
>>first copy all the user data *then* lock the page *then* insert it
>>into pagecache and go on to call into the filesystem.
>
>
> Silly question - what's preventing a reader from filling the !uptodate page with disk
> data while the writer is copying the user buffer into it?

Not silly -- I guess that is the main sticking point. Luckily *most*
!uptodate pages will be ones that we have newly allocated so will
not be in pagecache yet.

If it is in pagecache, we could do one of a number of things: either
remove it or try to bring it uptodate ourselves. I'm not yet sure if
either of these actions will cause other problems, though :P

If both of those are really going to cause problems, then we could
solve this in a more brute force way (assuming that !uptodate, locked
pages, in pagecache at this point are very rare -- AFAIKS these will
only be caused by IO errors?). We could allocate another, temporary
page and copy the contents into there first, then into the target
page after the prepare_write.

>>2. In the case of a !uptodate page and a partial-page write, then
>>we can first bring the page uptodate, then continue (goto 3).
>>
>>3. In the case of an uptodate page, we could perform a full-length
>>commit_write so long as we didn't expand i_size further than was
>>copied, and were sure to trim off blocks allocated past that
>>point.
>>
>>This scheme IMO is not as "nice" as the partial commit patches,
>>but in practical terms it may be much more realistic.
>
>
> It seems more realistic in that it makes sure the write is properly setup
> before calling into the file system. What do you think is not as nice about
> it?

The corner cases. However I guess the current scheme was building up
corner cases as well, so you might be right.

--
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-12-13 01:57:24

by Nick Piggin

[permalink] [raw]
Subject: Re: Status of buffered write path (deadlock fixes)

Trond Myklebust wrote:
> On Wed, 2006-12-13 at 11:53 +1100, Nick Piggin wrote:
>
>
>>Not silly -- I guess that is the main sticking point. Luckily *most*
>>!uptodate pages will be ones that we have newly allocated so will
>>not be in pagecache yet.
>>
>>If it is in pagecache, we could do one of a number of things: either
>>remove it or try to bring it uptodate ourselves. I'm not yet sure if
>>either of these actions will cause other problems, though :P
>>
>>If both of those are really going to cause problems, then we could
>>solve this in a more brute force way (assuming that !uptodate, locked
>>pages, in pagecache at this point are very rare -- AFAIKS these will
>>only be caused by IO errors?). We could allocate another, temporary
>>page and copy the contents into there first, then into the target
>>page after the prepare_write.
>
>
> We are NOT going to mandate read-modify-write behaviour on
> prepare_write()/commit_write(). That would be a completely unnecessary
> slowdown for write-only workloads on NFS.

Note that these pages should be *really* rare. Definitely even for normal
filesystems I think RMW would use too much bandwidth if it were required
for any significant number of writes.

I don't want to mandate anything just yet, so I'm just going through our
options. The first two options (remove, and RMW) are probably trickier
than they need to be, given the 3rd option available (temp buffer). Given
your input, I'm increasingly thinking that the best course of action would
be to fix this with the temp buffer and look at improving that later if it
causes a noticable slowdown.

Thanks,
Nick

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-12-13 02:18:41

by Trond Myklebust

[permalink] [raw]
Subject: Re: Status of buffered write path (deadlock fixes)

On Wed, 2006-12-13 at 11:53 +1100, Nick Piggin wrote:

> Not silly -- I guess that is the main sticking point. Luckily *most*
> !uptodate pages will be ones that we have newly allocated so will
> not be in pagecache yet.
>
> If it is in pagecache, we could do one of a number of things: either
> remove it or try to bring it uptodate ourselves. I'm not yet sure if
> either of these actions will cause other problems, though :P
>
> If both of those are really going to cause problems, then we could
> solve this in a more brute force way (assuming that !uptodate, locked
> pages, in pagecache at this point are very rare -- AFAIKS these will
> only be caused by IO errors?). We could allocate another, temporary
> page and copy the contents into there first, then into the target
> page after the prepare_write.

We are NOT going to mandate read-modify-write behaviour on
prepare_write()/commit_write(). That would be a completely unnecessary
slowdown for write-only workloads on NFS.

Trond

2006-12-13 02:31:23

by Trond Myklebust

[permalink] [raw]
Subject: Re: Status of buffered write path (deadlock fixes)

On Wed, 2006-12-13 at 12:56 +1100, Nick Piggin wrote:
> Note that these pages should be *really* rare. Definitely even for normal
> filesystems I think RMW would use too much bandwidth if it were required
> for any significant number of writes.

If file "foo" exists on the server, and contains data, then something
like

fd = open("foo", O_WRONLY);
write(fd, "1", 1);

should never need to trigger a read. That's a fairly common workload
when you think about it (happens all the time in apps that do random
write).

> I don't want to mandate anything just yet, so I'm just going through our
> options. The first two options (remove, and RMW) are probably trickier
> than they need to be, given the 3rd option available (temp buffer). Given
> your input, I'm increasingly thinking that the best course of action would
> be to fix this with the temp buffer and look at improving that later if it
> causes a noticable slowdown.

What is the generic problem you are trying to resolve? I saw something
fly by about a reader filling the !uptodate page while the writer is
updating it: how is that going to happen if the writer has the page
locked?
AFAIK the only thing that can modify the page if it is locked (aside
from the process that has locked it) is a process that has the page
mmapped(). However mmapped pages are always uptodate, right?

Trond

2006-12-13 04:04:14

by Nick Piggin

[permalink] [raw]
Subject: Re: Status of buffered write path (deadlock fixes)

Trond Myklebust wrote:
> On Wed, 2006-12-13 at 12:56 +1100, Nick Piggin wrote:
>
>>Note that these pages should be *really* rare. Definitely even for normal
>>filesystems I think RMW would use too much bandwidth if it were required
>>for any significant number of writes.
>
>
> If file "foo" exists on the server, and contains data, then something
> like
>
> fd = open("foo", O_WRONLY);
> write(fd, "1", 1);
>
> should never need to trigger a read. That's a fairly common workload
> when you think about it (happens all the time in apps that do random
> write).

Right. What I'm currently looking at doing in that case is two copies,
first into a temporary buffer. Unfortunate, but we'll see what the
performance looks like.

>>I don't want to mandate anything just yet, so I'm just going through our
>>options. The first two options (remove, and RMW) are probably trickier
>>than they need to be, given the 3rd option available (temp buffer). Given
>>your input, I'm increasingly thinking that the best course of action would
>>be to fix this with the temp buffer and look at improving that later if it
>>causes a noticable slowdown.
>
>
> What is the generic problem you are trying to resolve? I saw something
> fly by about a reader filling the !uptodate page while the writer is
> updating it: how is that going to happen if the writer has the page
> locked?

The problem is that you can't take a pagefault while holding the page
lock. You can deadlock against another page, the same page, or the
mmap_sem.

> AFAIK the only thing that can modify the page if it is locked (aside
> from the process that has locked it) is a process that has the page
> mmapped(). However mmapped pages are always uptodate, right?

That's right (modulo the pagefault vs invalidate race bug).

But we need to unlock the destination page in order to be able to take
a pagefault to bring the source user memory uptodate. If the page is
not uptodate, then a read might see uninitialised data.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-12-13 12:22:11

by Trond Myklebust

[permalink] [raw]
Subject: Re: Status of buffered write path (deadlock fixes)

On Wed, 2006-12-13 at 15:03 +1100, Nick Piggin wrote:
> Trond Myklebust wrote:
> > On Wed, 2006-12-13 at 12:56 +1100, Nick Piggin wrote:
> >
> >>Note that these pages should be *really* rare. Definitely even for normal
> >>filesystems I think RMW would use too much bandwidth if it were required
> >>for any significant number of writes.
> >
> >
> > If file "foo" exists on the server, and contains data, then something
> > like
> >
> > fd = open("foo", O_WRONLY);
> > write(fd, "1", 1);
> >
> > should never need to trigger a read. That's a fairly common workload
> > when you think about it (happens all the time in apps that do random
> > write).
>
> Right. What I'm currently looking at doing in that case is two copies,
> first into a temporary buffer. Unfortunate, but we'll see what the
> performance looks like.

Yech.

> >>I don't want to mandate anything just yet, so I'm just going through our
> >>options. The first two options (remove, and RMW) are probably trickier
> >>than they need to be, given the 3rd option available (temp buffer). Given
> >>your input, I'm increasingly thinking that the best course of action would
> >>be to fix this with the temp buffer and look at improving that later if it
> >>causes a noticable slowdown.
> >
> >
> > What is the generic problem you are trying to resolve? I saw something
> > fly by about a reader filling the !uptodate page while the writer is
> > updating it: how is that going to happen if the writer has the page
> > locked?
>
> The problem is that you can't take a pagefault while holding the page
> lock. You can deadlock against another page, the same page, or the
> mmap_sem.
>
> > AFAIK the only thing that can modify the page if it is locked (aside
> > from the process that has locked it) is a process that has the page
> > mmapped(). However mmapped pages are always uptodate, right?
>
> That's right (modulo the pagefault vs invalidate race bug).
>
> But we need to unlock the destination page in order to be able to take
> a pagefault to bring the source user memory uptodate. If the page is
> not uptodate, then a read might see uninitialised data.

The NFS read code will automatically flush dirty data to disk if it sees
that the page is marked as dirty or partially dirty. We do this without
losing the page lock thanks to the helper function nfs_wb_page().

BTW: We will do the same thing in our mapping->release_page() in order
to fix the races you can have between invalidate_inode_pages2() and page
dirtying, however that call to test_and_clear_page_dirty() screws us
over for the case of mmapped files (although we're quite safe w.r.t.
prepare_write()/commit_write()).

Cheers
Trond

2006-12-13 13:56:15

by Trond Myklebust

[permalink] [raw]
Subject: Re: Status of buffered write path (deadlock fixes)

On Wed, 2006-12-13 at 08:49 -0500, Peter Staubach wrote:
> Trond Myklebust wrote:
> > On Wed, 2006-12-13 at 12:56 +1100, Nick Piggin wrote:
> >
> >> Note that these pages should be *really* rare. Definitely even for normal
> >> filesystems I think RMW would use too much bandwidth if it were required
> >> for any significant number of writes.
> >>
> >
> > If file "foo" exists on the server, and contains data, then something
> > like
> >
> > fd = open("foo", O_WRONLY);
> > write(fd, "1", 1);
> >
> > should never need to trigger a read. That's a fairly common workload
> > when you think about it (happens all the time in apps that do random
> > write).
>
> I have to admit that I've only been paying attention with one eye, but
> why doesn't this require a read? If "foo" is non-zero in size, then
> how does the client determine how much data in the buffer to write to
> the server?

That is what the 'struct nfs_page' does. Whenever possible (i.e.
whenever the VM uses prepare_write()/commit_write()), we use that to
track the exact area of the page that was dirtied. That means that we
don't need to care what is on the rest of the page, or whether or not
the page was originally uptodate since we will only flush out the area
of the page that contains data.

> Isn't RMW required for any i/o which is either not buffer aligned or
> a multiple of the buffer size?

Nope.

Cheers,
Trond

2006-12-13 14:24:59

by Peter Staubach

[permalink] [raw]
Subject: Re: Status of buffered write path (deadlock fixes)

Trond Myklebust wrote:
> On Wed, 2006-12-13 at 12:56 +1100, Nick Piggin wrote:
>
>> Note that these pages should be *really* rare. Definitely even for normal
>> filesystems I think RMW would use too much bandwidth if it were required
>> for any significant number of writes.
>>
>
> If file "foo" exists on the server, and contains data, then something
> like
>
> fd = open("foo", O_WRONLY);
> write(fd, "1", 1);
>
> should never need to trigger a read. That's a fairly common workload
> when you think about it (happens all the time in apps that do random
> write).

I have to admit that I've only been paying attention with one eye, but
why doesn't this require a read? If "foo" is non-zero in size, then
how does the client determine how much data in the buffer to write to
the server?

Isn't RMW required for any i/o which is either not buffer aligned or
a multiple of the buffer size?

Thanx...

ps