Hi Nick,
On Wed, Oct 18, 2006 at 02:50:21PM -0700, [email protected] wrote:
> Some prepare/commit_write implementations have possible pre-existing bugs,
> possible data leaks (by setting uptodate too early) and data corruption (by
> not reading in non-modified parts of a !uptodate page).
>
> Others are (also) broken when commit_write passes in a 0 length commit with
> a !uptodate page (a change caused by buffered write deadlock fix patch).
>
> Fix filesystems as best we can. GFS2, OCFS2, Reiserfs, JFFS are nontrivial
> and are likely broken. All others at least need a glance.
I would have liked a CC on this patch, considering that ypu might have just
broken ocfs2 :) I wouldn't have even seen the patch had I not been looking
through my mm-commits mailbox for an unrelated patch.
> commit_write: If prepare_write succeeds, new data will be copied
> - into the page and then commit_write will be called. It will
> - typically update the size of the file (if appropriate) and
> - mark the inode as dirty, and do any other related housekeeping
> - operations. It should avoid returning an error if possible -
> - errors should have been handled by prepare_write.
> + into the page and then commit_write will be called. commit_write may
> + be called with a range that is smaller than that passed in to
> + prepare_write, it could even be zero. If the page is not uptodate,
> + the range will *only* be zero or the full length that was passed to
> + prepare_write, if it is zero, the page should not be marked uptodate
> + (success should still be returned, if possible -- the write will be
> + retried).
Doesn't this scheme have the potential to leave dirty data in holes? If a
file system does it's allocation for a hole in the middle of a file (so no
i_size update) in ->prepare_write(), but then in ->commit_write() it's not
allowed to write out the page, or add it to a transaction (in the case of
ext3/ocfs2 ordered writes), we might commit the allocation tree changes
without writing data to the actual disk region that the allocation covers. A
subsequent read would then return whatever junk was on disk.
> diff -puN fs/ext3/inode.c~fs-prepare_write-fixes fs/ext3/inode.c
> --- a/fs/ext3/inode.c~fs-prepare_write-fixes
> +++ a/fs/ext3/inode.c
> @@ -1214,21 +1214,24 @@ static int ext3_ordered_commit_write(str
> struct inode *inode = page->mapping->host;
> int ret = 0, ret2;
>
> - ret = walk_page_buffers(handle, page_buffers(page),
> - from, to, NULL, ext3_journal_dirty_data);
> + if (to - from > 0) {
> + ret = walk_page_buffers(handle, page_buffers(page),
> + from, to, NULL, ext3_journal_dirty_data);
I think this perhaps illustrates my worry. If we don't add all the page
buffers to the transaction which cover a hole that was filled in
->prepare_write(), we'll commit at the bottom of ext3_ordered_commit_write()
without having covered the entire range which we allocated for.
What probably needs to happen is one of two things:
a) The file system rolls back the allocation
b) We somehow write zero's to the part of the region skipped before
->commit_write() returns.
Thanks,
--Mark
--
Mark Fasheh
Senior Software Developer, Oracle
[email protected]
On Wed, Oct 18, 2006 at 06:42:09PM -0700, Mark Fasheh wrote:
> Hi Nick,
>
> On Wed, Oct 18, 2006 at 02:50:21PM -0700, [email protected] wrote:
> > Some prepare/commit_write implementations have possible pre-existing bugs,
> > possible data leaks (by setting uptodate too early) and data corruption (by
> > not reading in non-modified parts of a !uptodate page).
> >
> > Others are (also) broken when commit_write passes in a 0 length commit with
> > a !uptodate page (a change caused by buffered write deadlock fix patch).
> >
> > Fix filesystems as best we can. GFS2, OCFS2, Reiserfs, JFFS are nontrivial
> > and are likely broken. All others at least need a glance.
> I would have liked a CC on this patch, considering that ypu might have just
> broken ocfs2 :) I wouldn't have even seen the patch had I not been looking
> through my mm-commits mailbox for an unrelated patch.
I sent an RFC to linux-fsdevel, did you get that?
I was planning to cc some maintainers, including you, for those
filesystems that are non-trivial. I just hadn't had a chance to
test it properly last night.
> > commit_write: If prepare_write succeeds, new data will be copied
> > - into the page and then commit_write will be called. It will
> > - typically update the size of the file (if appropriate) and
> > - mark the inode as dirty, and do any other related housekeeping
> > - operations. It should avoid returning an error if possible -
> > - errors should have been handled by prepare_write.
> > + into the page and then commit_write will be called. commit_write may
> > + be called with a range that is smaller than that passed in to
> > + prepare_write, it could even be zero. If the page is not uptodate,
> > + the range will *only* be zero or the full length that was passed to
> > + prepare_write, if it is zero, the page should not be marked uptodate
> > + (success should still be returned, if possible -- the write will be
> > + retried).
> Doesn't this scheme have the potential to leave dirty data in holes? If a
> file system does it's allocation for a hole in the middle of a file (so no
> i_size update) in ->prepare_write(), but then in ->commit_write() it's not
> allowed to write out the page, or add it to a transaction (in the case of
> ext3/ocfs2 ordered writes), we might commit the allocation tree changes
> without writing data to the actual disk region that the allocation covers. A
> subsequent read would then return whatever junk was on disk.
That might be the case, I suppose.
If you don't want to track this yourself internally, we could think
about adding a new callback for the non-committed region if that
would help? (->abort_write).
>
>
> > diff -puN fs/ext3/inode.c~fs-prepare_write-fixes fs/ext3/inode.c
> > --- a/fs/ext3/inode.c~fs-prepare_write-fixes
> > +++ a/fs/ext3/inode.c
> > @@ -1214,21 +1214,24 @@ static int ext3_ordered_commit_write(str
> > struct inode *inode = page->mapping->host;
> > int ret = 0, ret2;
> >
> > - ret = walk_page_buffers(handle, page_buffers(page),
> > - from, to, NULL, ext3_journal_dirty_data);
> > + if (to - from > 0) {
> > + ret = walk_page_buffers(handle, page_buffers(page),
> > + from, to, NULL, ext3_journal_dirty_data);
> I think this perhaps illustrates my worry. If we don't add all the page
> buffers to the transaction which cover a hole that was filled in
> ->prepare_write(), we'll commit at the bottom of ext3_ordered_commit_write()
> without having covered the entire range which we allocated for.
>
>
> What probably needs to happen is one of two things:
>
> a) The file system rolls back the allocation
>
> b) We somehow write zero's to the part of the region skipped before
> ->commit_write() returns.
OK thanks for looking at that. If the length of the commit is greater
than 0 (but still short), then the page is uptodate so it should be
fine to commit what we have written, I think?
If the length is zero, then we probably want to roll back entirely.
On Thu, Oct 19, 2006 at 07:25:37AM +0200, Nick Piggin wrote:
> I sent an RFC to linux-fsdevel, did you get that?
Yeah, I don't think I thought of my concerns at the time.
> I was planning to cc some maintainers, including you, for those
> filesystems that are non-trivial. I just hadn't had a chance to
> test it properly last night.
Cool, I appreciate that.
> OK thanks for looking at that. If the length of the commit is greater
> than 0 (but still short), then the page is uptodate so it should be
> fine to commit what we have written, I think?
That seems to make sense to me.
> If the length is zero, then we probably want to roll back entirely.
The thing is, rollback might be hard (or expensive) for some file systems
with more complicated tree implementations, etc.
Do we have the option in this case of just zeroing the newly allocated
portions and writing them out? Userspace would then just be seeing
that like any other hole.
--Mark
--
Mark Fasheh
Senior Software Developer, Oracle
[email protected]
Mark Fasheh wrote:
> On Thu, Oct 19, 2006 at 07:25:37AM +0200, Nick Piggin wrote:
>
>>I sent an RFC to linux-fsdevel, did you get that?
>
> Yeah, I don't think I thought of my concerns at the time.
>
>
>
>>I was planning to cc some maintainers, including you, for those
>>filesystems that are non-trivial. I just hadn't had a chance to
>>test it properly last night.
>
> Cool, I appreciate that.
OK, I will be posting that mail tomorrow or next day... I'll summarise
your concerns you've posted in this thread too.
>>OK thanks for looking at that. If the length of the commit is greater
>>than 0 (but still short), then the page is uptodate so it should be
>>fine to commit what we have written, I think?
>
> That seems to make sense to me.
>
>
>
>>If the length is zero, then we probably want to roll back entirely.
>
> The thing is, rollback might be hard (or expensive) for some file systems
> with more complicated tree implementations, etc.
>
> Do we have the option in this case of just zeroing the newly allocated
> portions and writing them out? Userspace would then just be seeing
> that like any other hole.
Sure that's possible. You could even recognise that it is a hole in your
prepare_write, and zero the page and set it uptodate there.
You probably don't actually want to do that, because it means a double
overwrite in the case of a page sized,aligned write... but you have a
fairly broad scope of what you can do here. You are holding i_mutex and
the page lock, and the rest is up to you.
zeroing out the hole and marking it uptodate in case of a 0 length
->commit_write does sound like the right way to go. I probably haven't
handled that correctly if it needs to be done in ext? or generic fs/
routines...
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
On Fri, Oct 20, 2006 at 09:27:02AM +1000, Nick Piggin wrote:
> >Cool, I appreciate that.
>
> OK, I will be posting that mail tomorrow or next day... I'll summarise
> your concerns you've posted in this thread too.
Thanks.
> zeroing out the hole and marking it uptodate in case of a 0 length
> ->commit_write does sound like the right way to go. I probably haven't
> handled that correctly if it needs to be done in ext? or generic fs/
> routines...
I *think* we want to handle this in generic_file_buffered_write(). Between
->prepare_write() and ->commit_write(). Here's what I'm thinking:
generic_file_buffered_write() notices that it got a short copy from
copy_from_user_inatomic(). It can then call a helper which walks the buffer
heads attached to the page looking for BH_New regions which haven't been
written to yet. It can then zero those out. We pass the normal from/to
arguments over to ->commit_write() and those callbacks don't have to change
- they just continue as usual. The newly allocated regions get written out,
filling the holes with valid data and we avoid returning garbage from disk
on subsequent reads.
Ideas? "Holes" in the design? :)
--Mark
--
Mark Fasheh
Senior Software Developer, Oracle
[email protected]