This reverts commit d87815cb2090e07b0b0b2d73dc9740706e92c80c.
This patch causes any filesystem with an allocation unit larger than the
filesystem blocksize will leak unzeroed data. During a file extend, the
entire allocation unit is zeroed. However, this patch prevents the tail
blocks of the allocation unit from being written back to disk. When the
file is next extended, i_size will now cover these unzeroed blocks,
leaking the old contents of the disk to userspace and creating a corrupt
file.
This affects ocfs2 directly. As Tao Ma mentioned in his reporting
email:
1. all the place we use filemap_fdatawrite in ocfs2 doesn't flush pages
after i_size now.
2. sync, fsync, fdatasync and umount don't flush pages after i_size(they
are called from writeback_single_inode).
3. reflink have a BUG_ON triggered because we have some dirty pages
while during CoW. http://oss.oracle.com/bugzilla/show_bug.cgi?id=1265
Because this patch breaks ocfs2 file extends, we need to request its
reversion.
Reported-by: Tao Ma <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Signed-off-by: Joel Becker <[email protected]>
---
mm/page-writeback.c | 15 ---------------
1 files changed, 0 insertions(+), 15 deletions(-)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index bbd396a..b3dbb80 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -851,22 +851,7 @@ int write_cache_pages(struct address_space *mapping,
if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
range_whole = 1;
cycled = 1; /* ignore range_cyclic tests */
-
- /*
- * If this is a data integrity sync, cap the writeback to the
- * current end of file. Any extension to the file that occurs
- * after this is a new write and we don't need to write those
- * pages out to fulfil our data integrity requirements. If we
- * try to write them out, we can get stuck in this scan until
- * the concurrent writer stops adding dirty pages and extending
- * EOF.
- */
- if (wbc->sync_mode == WB_SYNC_ALL &&
- wbc->range_end == LLONG_MAX) {
- end = i_size_read(mapping->host) >> PAGE_CACHE_SHIFT;
- }
}
-
retry:
done_index = index;
while (!done && (index <= end)) {
--
1.7.1
--
"Senator let's be sincere,
As much as you can."
Joel Becker
Consulting Software Developer
Oracle
E-mail: [email protected]
Phone: (650) 506-8127
On Mon, Jun 28, 2010 at 10:35:29AM -0700, Joel Becker wrote:
> This reverts commit d87815cb2090e07b0b0b2d73dc9740706e92c80c.
Hi Joel,
I have no problems with it being reverted - it's a really just a
WAR for the simplest case of the sync hold holdoff.
However, I had no idea that any filesystem relied on being able to
write pages beyond EOF, and I'd like to understand the implications
of it on the higher level code and, more importantly, understand how
the writes are getting to disk through multiple layers of
page-beyond-i_size checks in the writeback code....
> This patch causes any filesystem with an allocation unit larger than the
> filesystem blocksize will leak unzeroed data. During a file extend, the
> entire allocation unit is zeroed.
XFS has this same underlying issue - it can have uninitialised,
allocated blocks past EOF that have to be zeroed when extending the
file.
> However, this patch prevents the tail
> blocks of the allocation unit from being written back to disk. When the
> file is next extended, i_size will now cover these unzeroed blocks,
> leaking the old contents of the disk to userspace and creating a corrupt
> file.
XFS doesn't zero blocks at allocation. Instead, XFS zeros the range
between the old EOF and the new EOF on each extending write. Hence
these pages get written because they fall inside the new i_size that
is set during the write. The i_size on disk doesn't get changed
until after the data writes have completed, so even on a crash we
don't expose uninitialised blocks.
> This affects ocfs2 directly. As Tao Ma mentioned in his reporting
> email:
>
> 1. all the place we use filemap_fdatawrite in ocfs2 doesn't flush pages
> after i_size now.
> 2. sync, fsync, fdatasync and umount don't flush pages after i_size(they
> are called from writeback_single_inode).
I'm not sure this was ever supposed to work - my understanding is
that we should never do anything with pages beyong i_size as pages
beyond EOF as being beyond i_size implies we are racing with a
truncate and the page is no longer valid. In that case, we should
definitely not write it back to disk.
Looking at ocfs2_writepage(), it simply calls
block_write_full_page(), which does:
/* Is the page fully outside i_size? (truncate in progress) */
offset = i_size & (PAGE_CACHE_SIZE-1);
if (page->index >= end_index+1 || !offset) {
/*
* The page may have dirty, unmapped buffers. For example,
* they may have been added in ext3_writepage(). Make them
* freeable here, so the page does not leak.
*/
do_invalidatepage(page, 0);
unlock_page(page);
return 0; /* don't care */
}
i.e. pages beyond EOF get invalidated. If it somehow gets through
that check, __block_write_full_page() will avoid writing dirty
bufferheads beyond EOF because the write is "racing with truncate".
Hence there are multiple layers of protection against writing past
i_size, so I'm wondering how these pages are even getting to disk in
the first place....
> 3. reflink have a BUG_ON triggered because we have some dirty pages
> while during CoW. http://oss.oracle.com/bugzilla/show_bug.cgi?id=1265
I'd suggest that the reason you see the BUG_ON() with this patch is
that the pages beyond EOF are not being invalidated because they are
not being passed to ->writepage and hence are remaining dirty in the
cache. IOWs, I suspect that this commit has uncovered a bug in
ocfs2, not that it has caused a regression.
Your thoughts, Joel?
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Tue, Jun 29, 2010 at 10:24:21AM +1000, Dave Chinner wrote:
> On Mon, Jun 28, 2010 at 10:35:29AM -0700, Joel Becker wrote:
> > This reverts commit d87815cb2090e07b0b0b2d73dc9740706e92c80c.
>
> Hi Joel,
>
> I have no problems with it being reverted - it's a really just a
> WAR for the simplest case of the sync hold holdoff.
I have to insist that we revert it until we find a way to make
ocfs2 work. The rest of the email will discuss the ocfs2 issues
therein.
> > This patch causes any filesystem with an allocation unit larger than the
> > filesystem blocksize will leak unzeroed data. During a file extend, the
> > entire allocation unit is zeroed.
>
> XFS has this same underlying issue - it can have uninitialised,
> allocated blocks past EOF that have to be zeroed when extending the
> file.
Does XFS do this in get_blocks()? We deliberately do no
allocation in get_blocks(), which is where our need for up-front
allocation comes from.
> > However, this patch prevents the tail
> > blocks of the allocation unit from being written back to disk. When the
> > file is next extended, i_size will now cover these unzeroed blocks,
> > leaking the old contents of the disk to userspace and creating a corrupt
> > file.
>
> XFS doesn't zero blocks at allocation. Instead, XFS zeros the range
> between the old EOF and the new EOF on each extending write. Hence
> these pages get written because they fall inside the new i_size that
> is set during the write. The i_size on disk doesn't get changed
> until after the data writes have completed, so even on a crash we
> don't expose uninitialised blocks.
We do the same, but we zero the entire allocation. This works
both when filling holes and when extending, though obviously the
extending is what we're worried about here. We change i_size in
write_end, so our guarantee is the same as yours for the page containing
i_size.
> Looking at ocfs2_writepage(), it simply calls
> block_write_full_page(), which does:
>
> /* Is the page fully outside i_size? (truncate in progress) */
> offset = i_size & (PAGE_CACHE_SIZE-1);
> if (page->index >= end_index+1 || !offset) {
> /*
> * The page may have dirty, unmapped buffers. For example,
> * they may have been added in ext3_writepage(). Make them
> * freeable here, so the page does not leak.
> */
> do_invalidatepage(page, 0);
> unlock_page(page);
> return 0; /* don't care */
> }
>
> i.e. pages beyond EOF get invalidated. If it somehow gets through
> that check, __block_write_full_page() will avoid writing dirty
> bufferheads beyond EOF because the write is "racing with truncate".
Your contention is that we've never gotten those tail blocks to
disk. Instead, our code either handles the future extensions of i_size
or we've just gotten lucky with our testing. Our current BUG trigger is
because we have a new check that catches this case. Does that summarize
your position correctly?
I'm not averse to having a zero-only-till-i_size policy, but I
know we've visited this problem before and got bit. I have to go reload
that context.
Regarding XFS, how do you handle catching the tail of an
allocation with an lseek(2)'d write? That is, your current allocation
has a few blocks outside of i_size, then I lseek(2) a gigabyte past EOF
and write there. The code has to recognize to zero around old_i_size
before moving out to new_i_size, right? I think that's where our old
approaches had problems.
Joel
--
"The real reason GNU ls is 8-bit-clean is so that they can
start using ISO-8859-1 option characters."
- Christopher Davis ([email protected])
Joel Becker
Consulting Software Developer
Oracle
E-mail: [email protected]
Phone: (650) 506-8127
On Mon, Jun 28, 2010 at 5:54 PM, Joel Becker <[email protected]> wrote:
> On Tue, Jun 29, 2010 at 10:24:21AM +1000, Dave Chinner wrote:
>>
>> Looking at ocfs2_writepage(), it simply calls
>> block_write_full_page(), which does:
>>
>> ? ? ? /* Is the page fully outside i_size? (truncate in progress) */
>> ? ? ? offset = i_size & (PAGE_CACHE_SIZE-1);
>> ? ? ? if (page->index >= end_index+1 || !offset) {
>> ? ? ? ? ? ? ? /*
>> ? ? ? ? ? ? ? ?* The page may have dirty, unmapped buffers. ?For example,
>> ? ? ? ? ? ? ? ?* they may have been added in ext3_writepage(). ?Make them
>> ? ? ? ? ? ? ? ?* freeable here, so the page does not leak.
>> ? ? ? ? ? ? ? ?*/
>> ? ? ? ? ? ? ? do_invalidatepage(page, 0);
>> ? ? ? ? ? ? ? unlock_page(page);
>> ? ? ? ? ? ? ? return 0; /* don't care */
>> ? ? ? }
>>
>> i.e. pages beyond EOF get invalidated. ?If it somehow gets through
>> that check, __block_write_full_page() will avoid writing dirty
>> bufferheads beyond EOF because the write is "racing with truncate".
>
> ? ? ? ?Your contention is that we've never gotten those tail blocks to
> disk. ?Instead, our code either handles the future extensions of i_size
> or we've just gotten lucky with our testing. ?Our current BUG trigger is
> because we have a new check that catches this case. ?Does that summarize
> your position correctly?
Maybe Dave has some more exhaustive answer, but his point that
block_write_full_page() already just drops the page does seem to be
very valid. Which makes me suspect that it would be better to remove
the ocfs2 BUG_ON() as a stop-gap measure, rather than reverting the
commit. It seems to be true that the "don't bother flushing past EOF"
commit really just uncovered an older bug.
So maybe ocfs2 should just replace the bug-on with invalidating the
page (perhaps with a WARN_ONCE() to make sure the problem doesn't get
forgotten about?)
Linus
On Mon, Jun 28, 2010 at 05:54:04PM -0700, Joel Becker wrote:
> On Tue, Jun 29, 2010 at 10:24:21AM +1000, Dave Chinner wrote:
> > On Mon, Jun 28, 2010 at 10:35:29AM -0700, Joel Becker wrote:
> > > This reverts commit d87815cb2090e07b0b0b2d73dc9740706e92c80c.
> >
> > Hi Joel,
> >
> > I have no problems with it being reverted - it's a really just a
> > WAR for the simplest case of the sync hold holdoff.
>
> I have to insist that we revert it until we find a way to make
> ocfs2 work. The rest of the email will discuss the ocfs2 issues
> therein.
>
> > > This patch causes any filesystem with an allocation unit larger than the
> > > filesystem blocksize will leak unzeroed data. During a file extend, the
> > > entire allocation unit is zeroed.
> >
> > XFS has this same underlying issue - it can have uninitialised,
> > allocated blocks past EOF that have to be zeroed when extending the
> > file.
>
> Does XFS do this in get_blocks()? We deliberately do no
> allocation in get_blocks(), which is where our need for up-front
> allocation comes from.
No, it does it in xfs_file_aio_write() (i.e. ->aio_write()) so it
catches both buffered and direct IO. This can't be done in the
get_blocks() callback because (IMO) there really isn't the context
available to know exactly how we are extending the file in
get_blocks().
> > > However, this patch prevents the tail
> > > blocks of the allocation unit from being written back to disk. When the
> > > file is next extended, i_size will now cover these unzeroed blocks,
> > > leaking the old contents of the disk to userspace and creating a corrupt
> > > file.
> >
> > XFS doesn't zero blocks at allocation. Instead, XFS zeros the range
> > between the old EOF and the new EOF on each extending write. Hence
> > these pages get written because they fall inside the new i_size that
> > is set during the write. The i_size on disk doesn't get changed
> > until after the data writes have completed, so even on a crash we
> > don't expose uninitialised blocks.
>
> We do the same, but we zero the entire allocation. This works
> both when filling holes and when extending, though obviously the
> extending is what we're worried about here. We change i_size in
> write_end, so our guarantee is the same as yours for the page containing
> i_size.
Ok, so the you've got cached pages covering the file and the tail of
the last page/block zeroed in memory. I'd guess that ordered mode
journalling then ensures the inode size update doesn't hit the disk
until after the data does, so this is crash-safe. That would explain
(to me, at least) why you are not seeing stale data exposure on
crashes.
> > Looking at ocfs2_writepage(), it simply calls
> > block_write_full_page(), which does:
> >
> > /* Is the page fully outside i_size? (truncate in progress) */
> > offset = i_size & (PAGE_CACHE_SIZE-1);
> > if (page->index >= end_index+1 || !offset) {
> > /*
> > * The page may have dirty, unmapped buffers. For example,
> > * they may have been added in ext3_writepage(). Make them
> > * freeable here, so the page does not leak.
> > */
> > do_invalidatepage(page, 0);
> > unlock_page(page);
> > return 0; /* don't care */
> > }
> >
> > i.e. pages beyond EOF get invalidated. If it somehow gets through
> > that check, __block_write_full_page() will avoid writing dirty
> > bufferheads beyond EOF because the write is "racing with truncate".
>
> Your contention is that we've never gotten those tail blocks to
> disk. Instead, our code either handles the future extensions of i_size
> or we've just gotten lucky with our testing. Our current BUG trigger is
> because we have a new check that catches this case. Does that summarize
> your position correctly?
Yes, that summarises it pretty well ;)
> I'm not averse to having a zero-only-till-i_size policy, but I
> know we've visited this problem before and got bit. I have to go reload
> that context.
There's no hurry from my perspective - I just prefer to understand the
the root cause of a problem before jumping....
> Regarding XFS, how do you handle catching the tail of an
> allocation with an lseek(2)'d write? That is, your current allocation
> has a few blocks outside of i_size, then I lseek(2) a gigabyte past EOF
> and write there. The code has to recognize to zero around old_i_size
> before moving out to new_i_size, right? I think that's where our old
> approaches had problems.
xfs_file_aio_write() handles both those cases for us via
xfs_zero_eof(). What it does is map the region from the old EOF to
the start of the new write and zeroes any allocated blocks that are
not marked unwritten that lie within the range. It does this via the
internal mapping interface because we hide allocated blocks past EOF
from the page cache and higher layers.
FWIW, the way XFS does this is safe against crashes because the
inode size does not get updated on disk or in the journal until
after the data has hit the disk. Ordered journalling should also
provide this guarantee, i think.
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Mon, Jun 28, 2010 at 06:12:35PM -0700, Linus Torvalds wrote:
> On Mon, Jun 28, 2010 at 5:54 PM, Joel Becker <[email protected]> wrote:
> > ? ? ? ?Your contention is that we've never gotten those tail blocks to
> > disk. ?Instead, our code either handles the future extensions of i_size
> > or we've just gotten lucky with our testing. ?Our current BUG trigger is
> > because we have a new check that catches this case. ?Does that summarize
> > your position correctly?
>
> Maybe Dave has some more exhaustive answer, but his point that
> block_write_full_page() already just drops the page does seem to be
> very valid. Which makes me suspect that it would be better to remove
> the ocfs2 BUG_ON() as a stop-gap measure, rather than reverting the
> commit. It seems to be true that the "don't bother flushing past EOF"
> commit really just uncovered an older bug.
Well, shit. Something has changed in here, or we're really
really (un)lucky. We visited this code a year ago or so when we had
serious zeroing problems, and we tested the hell out of it. Now it is
broken again. And it sure looks like that block_write_full_page() check
has been there since before git.
> So maybe ocfs2 should just replace the bug-on with invalidating the
> page (perhaps with a WARN_ONCE() to make sure the problem doesn't get
> forgotten about?)
Oh, no, that's not it at all. This is a disaster. I can't see
for the life of me why we haven't had 100,000 bug reports. You're going
to have an ocfs2 patch by the end of the week. It will be ugly, I'm
sure of it, but it has to be done. For every extend, we're going to
have to zero and potentially CoW around old_i_size if the old allocation
isn't within the bounds of the current write.
Joel
--
"In a crisis, don't hide behind anything or anybody. They're going
to find you anyway."
- Paul "Bear" Bryant
Joel Becker
Consulting Software Developer
Oracle
E-mail: [email protected]
Phone: (650) 506-8127
On Tue, Jun 29, 2010 at 11:56:15AM +1000, Dave Chinner wrote:
> > Regarding XFS, how do you handle catching the tail of an
> > allocation with an lseek(2)'d write? That is, your current allocation
> > has a few blocks outside of i_size, then I lseek(2) a gigabyte past EOF
> > and write there. The code has to recognize to zero around old_i_size
> > before moving out to new_i_size, right? I think that's where our old
> > approaches had problems.
>
> xfs_file_aio_write() handles both those cases for us via
> xfs_zero_eof(). What it does is map the region from the old EOF to
> the start of the new write and zeroes any allocated blocks that are
> not marked unwritten that lie within the range. It does this via the
> internal mapping interface because we hide allocated blocks past EOF
> from the page cache and higher layers.
Makes sense as an approach. We deliberately do this through the
page cache to take advantage of its I/O patterns and tie in with JBD2.
Also, we don't feel like maintaining an entire shadow page cache ;-)
Joel
--
Life's Little Instruction Book #356
"Be there when people need you."
Joel Becker
Consulting Software Developer
Oracle
E-mail: [email protected]
Phone: (650) 506-8127
On Mon, Jun 28, 2010 at 6:58 PM, Joel Becker <[email protected]> wrote:
>
> ? ? ? ?Well, shit. ?Something has changed in here, or we're really
> really (un)lucky. ?We visited this code a year ago or so when we had
> serious zeroing problems, and we tested the hell out of it. ?Now it is
> broken again. ?And it sure looks like that block_write_full_page() check
> has been there since before git.
Hmm. I'm actually starting to worry that we should do the revert after all.
Why? Locking. That page-writeback.c thing decides to limit the end to
the inode size the same way that block_write_full_page() does, but
block_write_full_page() holds the page lock, while page-writeback.c
does not. Which means that as a race against somebody else doing a
truncate(), the two things really are pretty different.
That said, write_cache_pages() obviously doesn't actually invalidate
the page (the way block_write_full_page() does), so locking matters a
whole lot less for it. If somebody is doing a concurrent truncate or a
concurrent write, then for the data to really show up reliably on disk
there would obviously have to be a separate sync operation involved,
so even with the lack of any locking, it should be safe.
I dunno. Filesystem corruption makes me nervous. So I'm certainly
totally willing to do the revert if that makes ocfs2 work again. Even
if "work again" happens to be partly by mistake, and for some reason
that isn't obvious.
Your call, I guess. If any ocfs2 fix looks scary, and you'd prefer to
have an -rc4 (in a few days - not today) with just the revert, I'm ok
with that. Even if it's only a "at least no worse than 2.6.34"
situation rather than a real fix.
Linus
On Mon, Jun 28, 2010 at 07:04:20PM -0700, Joel Becker wrote:
> On Tue, Jun 29, 2010 at 11:56:15AM +1000, Dave Chinner wrote:
> > > Regarding XFS, how do you handle catching the tail of an
> > > allocation with an lseek(2)'d write? That is, your current allocation
> > > has a few blocks outside of i_size, then I lseek(2) a gigabyte past EOF
> > > and write there. The code has to recognize to zero around old_i_size
> > > before moving out to new_i_size, right? I think that's where our old
> > > approaches had problems.
> >
> > xfs_file_aio_write() handles both those cases for us via
> > xfs_zero_eof(). What it does is map the region from the old EOF to
> > the start of the new write and zeroes any allocated blocks that are
> > not marked unwritten that lie within the range. It does this via the
> > internal mapping interface because we hide allocated blocks past EOF
> > from the page cache and higher layers.
>
> Makes sense as an approach. We deliberately do this through the
> page cache to take advantage of its I/O patterns and tie in with JBD2.
> Also, we don't feel like maintaining an entire shadow page cache ;-)
Just to clarify any possible misunderstanding here, xfs_zero_eof()
also does it's IO through the page cache for similar reasons. It's
just the mappings are found via the internal interfaces before the
zeroing is done via the anonymous pagecache_write_begin()/
pagecache_write_end() functions (in xfs_iozero()) rather than using
the generic block functions.
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Mon, Jun 28, 2010 at 07:20:33PM -0700, Linus Torvalds wrote:
> On Mon, Jun 28, 2010 at 6:58 PM, Joel Becker <[email protected]> wrote:
> >
> > ? ? ? ?Well, shit. ?Something has changed in here, or we're really
> > really (un)lucky. ?We visited this code a year ago or so when we had
> > serious zeroing problems, and we tested the hell out of it. ?Now it is
> > broken again. ?And it sure looks like that block_write_full_page() check
> > has been there since before git.
>
> Hmm. I'm actually starting to worry that we should do the revert after all.
>
> Why? Locking. That page-writeback.c thing decides to limit the end to
> the inode size the same way that block_write_full_page() does, but
> block_write_full_page() holds the page lock, while page-writeback.c
> does not. Which means that as a race against somebody else doing a
> truncate(), the two things really are pretty different.
>
> That said, write_cache_pages() obviously doesn't actually invalidate
> the page (the way block_write_full_page() does), so locking matters a
> whole lot less for it. If somebody is doing a concurrent truncate or a
> concurrent write, then for the data to really show up reliably on disk
> there would obviously have to be a separate sync operation involved,
> so even with the lack of any locking, it should be safe.
Yes, that is the premise on which the "stop @ EOF" code in
write_cache_pages() is based. It's simply a snapshot of the EOF when
the data integrity sync starts and as such any subsequent extensions
to it that happen after the sync started are not something we have
to worry about for this sync operation.
OTOH, if there is a concurrent truncation while the loop is
operating, then the existing checks for truncation after locking the
page _must_ be sufficent to avoid writeback of such truncated pages
otherwise truncate would already be broken.
> I dunno. Filesystem corruption makes me nervous.
You're not alone in that feeling. :/
FWIW, it's taken us quite a long while (years) to iron out all of
the known sync+crash bugs in XFS and as a result f the process we
have a fair number of regression tests that tell us quickly when
sync is has been broken. This test hasn't indicated any problems
with XFS, so I'm fairly confident the change is safe.
That said, ....
> So I'm certainly
> totally willing to do the revert if that makes ocfs2 work again. Even
> if "work again" happens to be partly by mistake, and for some reason
> that isn't obvious.
>
> Your call, I guess. If any ocfs2 fix looks scary, and you'd prefer to
> have an -rc4 (in a few days - not today) with just the revert, I'm ok
> with that. Even if it's only a "at least no worse than 2.6.34"
> situation rather than a real fix.
... I agree with this.
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Tue, Jun 29, 2010 at 12:27:57PM +1000, Dave Chinner wrote:
> Just to clarify any possible misunderstanding here, xfs_zero_eof()
> also does it's IO through the page cache for similar reasons. It's
> just the mappings are found via the internal interfaces before the
> zeroing is done via the anonymous pagecache_write_begin()/
> pagecache_write_end() functions (in xfs_iozero()) rather than using
> the generic block functions.
Mark and I discussed this some earlier this evening. I think we
might be able to get away cheaper than does. In
ocfs2_write_begin_nolock(), we call ocfs2_expand_nonsparse_inode() in
the case of older filesystems that don't allow sparse files. That's
where we handle the zeroing from old_i_size to pos for those files.
I the exact same place, we could probably just detect we're
about to cover the unzeroed allocation and get it there. This needs
some more code eval until we're sure, and then the serious testing
happens.
Joel
--
"Nothing is wrong with California that a rise in the ocean level
wouldn't cure."
- Ross MacDonald
Joel Becker
Consulting Software Developer
Oracle
E-mail: [email protected]
Phone: (650) 506-8127
On Mon, Jun 28, 2010 at 07:20:33PM -0700, Linus Torvalds wrote:
> I dunno. Filesystem corruption makes me nervous. So I'm certainly
> totally willing to do the revert if that makes ocfs2 work again. Even
> if "work again" happens to be partly by mistake, and for some reason
> that isn't obvious.
Filesystem corruption makes me more than nervous. I'm quite
devastated by this.
> Your call, I guess. If any ocfs2 fix looks scary, and you'd prefer to
> have an -rc4 (in a few days - not today) with just the revert, I'm ok
> with that. Even if it's only a "at least no worse than 2.6.34"
> situation rather than a real fix.
I've checked both before this patch and with the patch reverted.
We corrupt in both cases. The problem is our assumption about zeroing
past i_size. The revert will fix our BUG_ON, but not the corruption.
Mark and I have ideas on how to fix the actual bug, but they
will take some time and especially testing. We also have some
shorter-term ideas on how to paper over the issue. We have to have to
have this fixed by .35.
If -rc4 isn't coming for a couple of days, can we hold off on
the decision until we get a chance to think about a paper-over solution
for it? Then we can avoid the revert.
Joel
--
You can use a screwdriver to screw in screws or to clean your ears,
however, the latter needs real skill, determination and a lack of fear
of injuring yourself. It is much the same with JavaScript.
- Chris Heilmann
Joel Becker
Consulting Software Developer
Oracle
E-mail: [email protected]
Phone: (650) 506-8127
On Tue, Jun 29, 2010 at 01:16:11AM -0700, Joel Becker wrote:
> On Mon, Jun 28, 2010 at 07:20:33PM -0700, Linus Torvalds wrote:
> > Your call, I guess. If any ocfs2 fix looks scary, and you'd prefer to
> > have an -rc4 (in a few days - not today) with just the revert, I'm ok
> > with that. Even if it's only a "at least no worse than 2.6.34"
> > situation rather than a real fix.
>
> If -rc4 isn't coming for a couple of days, can we hold off on
> the decision until we get a chance to think about a paper-over solution
> for it? Then we can avoid the revert.
Linus,
I'm going to withdraw the revert request for now. Our proposed
paper-over solution is too big, and we're just going to focus on the
actual fix. This will be for .35. Yes, .35-rc will have a BUG_ON with
refcount trees until we get the fix in, but I'd rather avoid the churn
when the final .35 should have Dave's patch and our fix.
Joel
--
"The first thing we do, let's kill all the lawyers."
-Henry VI, IV:ii
Joel Becker
Consulting Software Developer
Oracle
E-mail: [email protected]
Phone: (650) 506-8127
Linus et al,
Here's the first patch for the problem. This is the corruption
fix. It changes ocfs2 to expect that blocks past i_size will not be
zeroed; ocfs2 now zeros them when i_size expands to encompass them.
This has been tested with various ocfs2 configurations. My test script
was sent as a separate email to ocfs2-devel.
There is still one more patch to come. ocfs2 still tries to
zero entire clusters as it allocates them. Any extra pages past i_size
remain dirty but untouched by writeback. When combined with Dave's
patch, this will still trigger the BUG_ON() in CoW. My next job is to
stop zeroing the pages past i_size when we allocate clusters.
The combination of both patches is the complete fix. Linus, I
intend to send it through the fixes branch of ocfs2.git when I'm done.
I want to get some of our generic test workloads going once the second
patch is written. It will definitely be 2.6.35-rc; I don't want 2.6.35
going out with this problem.
Joel
-----------------------------
ocfs2's allocation unit is the cluster. This can be larger than a block
or even a memory page. This means that a file may have many blocks in
its last extent that are beyond the block containing i_size.
When ocfs2 grows a file, it zeros the entire cluster in order to ensure
future i_size growth will see cleared blocks. Unfortunately,
block_write_full_page() drops the pages past i_size. This means that
ocfs2 is actually leaking garbage data into the tail end of that last
cluster.
We adjust ocfs2_write_begin_nolock() and ocfs2_extend_file() to detect
when a write or truncate is past i_size. If there is any existing
allocation between the block containing the current i_size and the
location of the write or truncate, zeros will be written to that
allocation.
This is only for sparse filesystems. Non-sparse filesystems already get
this via ocfs2_extend_no_holes().
Signed-off-by: Joel Becker <[email protected]>
---
fs/ocfs2/aops.c | 22 ++++----
fs/ocfs2/file.c | 145 ++++++++++++++++++++++++++++++++++++++++++++++++-------
fs/ocfs2/file.h | 2 +
3 files changed, 141 insertions(+), 28 deletions(-)
diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
index 3623ca2..96e6aeb 100644
--- a/fs/ocfs2/aops.c
+++ b/fs/ocfs2/aops.c
@@ -196,15 +196,14 @@ int ocfs2_get_block(struct inode *inode, sector_t iblock,
dump_stack();
goto bail;
}
-
- past_eof = ocfs2_blocks_for_bytes(inode->i_sb, i_size_read(inode));
- mlog(0, "Inode %lu, past_eof = %llu\n", inode->i_ino,
- (unsigned long long)past_eof);
-
- if (create && (iblock >= past_eof))
- set_buffer_new(bh_result);
}
+ past_eof = ocfs2_blocks_for_bytes(inode->i_sb, i_size_read(inode));
+ mlog(0, "Inode %lu, past_eof = %llu\n", inode->i_ino,
+ (unsigned long long)past_eof);
+ if (create && (iblock >= past_eof))
+ set_buffer_new(bh_result);
+
bail:
if (err < 0)
err = -EIO;
@@ -1625,11 +1624,9 @@ static int ocfs2_expand_nonsparse_inode(struct inode *inode, loff_t pos,
struct ocfs2_write_ctxt *wc)
{
int ret;
- struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
loff_t newsize = pos + len;
- if (ocfs2_sparse_alloc(osb))
- return 0;
+ BUG_ON(ocfs2_sparse_alloc(OCFS2_SB(inode->i_sb)));
if (newsize <= i_size_read(inode))
return 0;
@@ -1679,7 +1676,10 @@ int ocfs2_write_begin_nolock(struct address_space *mapping,
}
}
- ret = ocfs2_expand_nonsparse_inode(inode, pos, len, wc);
+ if (ocfs2_sparse_alloc(osb))
+ ret = ocfs2_zero_tail(inode, di_bh, pos);
+ else
+ ret = ocfs2_expand_nonsparse_inode(inode, pos, len, wc);
if (ret) {
mlog_errno(ret);
goto out;
diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 6a13ea6..a64ec02 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -848,6 +848,128 @@ out:
return ret;
}
+/*
+ * This function is a helper for ocfs2_zero_tail(). It calculates
+ * what blocks need zeroing and does any CoW necessary.
+ */
+static int ocfs2_zero_tail_prepare(struct inode *inode,
+ struct buffer_head *di_bh,
+ loff_t pos, u64 *start_blkno,
+ u64 *blocks)
+{
+ int rc = 0;
+ struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
+ u32 tail_cpos, pos_cpos, p_cpos;
+ u64 tail_blkno, pos_blkno, blocks_to_zero;
+ unsigned int num_clusters = 0;
+ unsigned int ext_flags = 0;
+
+ /*
+ * The block containing i_size has already been zeroed, so our tail
+ * block is the first block after i_size. The block containing
+ * pos will be zeroed. So we only need to do anything if
+ * tail_blkno is before pos_blkno.
+ */
+ tail_blkno = (i_size_read(inode) >> inode->i_sb->s_blocksize_bits) + 1;
+ pos_blkno = pos >> inode->i_sb->s_blocksize_bits;
+ mlog(0, "tail_blkno = %llu, pos_blkno = %llu\n",
+ (unsigned long long)tail_blkno, (unsigned long long)pos_blkno);
+ if (pos_blkno <= tail_blkno)
+ goto out;
+ blocks_to_zero = pos_blkno - tail_blkno;
+
+ /*
+ * If tail_blkno is in the cluster past i_size, we don't need
+ * to touch the cluster containing i_size at all.
+ */
+ tail_cpos = i_size_read(inode) >> osb->s_clustersize_bits;
+ if (ocfs2_blocks_to_clusters(inode->i_sb, tail_blkno) > tail_cpos)
+ tail_cpos = ocfs2_blocks_to_clusters(inode->i_sb,
+ tail_blkno);
+
+ rc = ocfs2_get_clusters(inode, tail_cpos, &p_cpos, &num_clusters,
+ &ext_flags);
+ if (rc) {
+ mlog_errno(rc);
+ goto out;
+ }
+ /* Are we off the end of the allocation? */
+ if (!p_cpos) {
+ BUG_ON(tail_cpos <=
+ (i_size_read(inode) >> osb->s_clustersize_bits));
+ goto out;
+ }
+
+ pos_cpos = pos >> osb->s_clustersize_bits;
+ if ((tail_cpos + num_clusters) >= pos_cpos) {
+ num_clusters = pos_cpos - tail_cpos;
+ if (pos_blkno >
+ ocfs2_clusters_to_blocks(inode->i_sb, pos_cpos))
+ num_clusters += 1;
+ } else {
+ blocks_to_zero =
+ ocfs2_clusters_to_blocks(inode->i_sb,
+ tail_cpos + num_clusters);
+ blocks_to_zero -= tail_blkno;
+ }
+
+ /* Now CoW the clusters we're about to zero */
+ if (ext_flags & OCFS2_EXT_REFCOUNTED) {
+ rc = ocfs2_refcount_cow(inode, di_bh, tail_cpos,
+ num_clusters, UINT_MAX);
+ if (rc) {
+ mlog_errno(rc);
+ goto out;
+ }
+ }
+
+ *start_blkno = tail_blkno;
+ *blocks = blocks_to_zero;
+ mlog(0, "start_blkno = %llu, blocks = %llu\n",
+ (unsigned long long)(*start_blkno),
+ (unsigned long long)(*blocks));
+
+out:
+ return rc;
+}
+
+/*
+ * This function only does work for sparse filesystems.
+ * ocfs2_extend_no_holes() will do the same work for non-sparse * files.
+ *
+ * If the last extent of the file has blocks beyond i_size, we must zero
+ * them before we can grow i_size to cover them. Specifically, any
+ * allocation between the block containing the current i_size and the block
+ * containing pos must be zeroed.
+ */
+int ocfs2_zero_tail(struct inode *inode, struct buffer_head *di_bh,
+ loff_t pos)
+{
+ int rc = 0;
+ u64 tail_blkno = 0, blocks_to_zero = 0;
+
+ BUG_ON(!ocfs2_sparse_alloc(OCFS2_SB(inode->i_sb)));
+
+ rc = ocfs2_zero_tail_prepare(inode, di_bh, pos, &tail_blkno,
+ &blocks_to_zero);
+ if (rc) {
+ mlog_errno(rc);
+ goto out;
+ }
+
+ if (!blocks_to_zero)
+ goto out;
+
+ rc = ocfs2_zero_extend(inode,
+ (tail_blkno + blocks_to_zero) <<
+ inode->i_sb->s_blocksize_bits);
+ if (rc)
+ mlog_errno(rc);
+
+out:
+ return rc;
+}
+
static int ocfs2_extend_file(struct inode *inode,
struct buffer_head *di_bh,
u64 new_i_size)
@@ -862,27 +984,15 @@ static int ocfs2_extend_file(struct inode *inode,
goto out;
if (i_size_read(inode) == new_i_size)
- goto out;
+ goto out;
BUG_ON(new_i_size < i_size_read(inode));
/*
- * Fall through for converting inline data, even if the fs
- * supports sparse files.
- *
- * The check for inline data here is legal - nobody can add
- * the feature since we have i_mutex. We must check it again
- * after acquiring ip_alloc_sem though, as paths like mmap
- * might have raced us to converting the inode to extents.
- */
- if (!(oi->ip_dyn_features & OCFS2_INLINE_DATA_FL)
- && ocfs2_sparse_alloc(OCFS2_SB(inode->i_sb)))
- goto out_update_size;
-
- /*
* The alloc sem blocks people in read/write from reading our
* allocation until we're done changing it. We depend on
* i_mutex to block other extend/truncate calls while we're
- * here.
+ * here. We even have to hold it for sparse files because there
+ * might be some tail zeroing.
*/
down_write(&oi->ip_alloc_sem);
@@ -899,13 +1009,14 @@ static int ocfs2_extend_file(struct inode *inode,
ret = ocfs2_convert_inline_data_to_extents(inode, di_bh);
if (ret) {
up_write(&oi->ip_alloc_sem);
-
mlog_errno(ret);
goto out;
}
}
- if (!ocfs2_sparse_alloc(OCFS2_SB(inode->i_sb)))
+ if (ocfs2_sparse_alloc(OCFS2_SB(inode->i_sb)))
+ ret = ocfs2_zero_tail(inode, di_bh, new_i_size);
+ else
ret = ocfs2_extend_no_holes(inode, new_i_size, new_i_size);
up_write(&oi->ip_alloc_sem);
diff --git a/fs/ocfs2/file.h b/fs/ocfs2/file.h
index d66cf4f..7493d97 100644
--- a/fs/ocfs2/file.h
+++ b/fs/ocfs2/file.h
@@ -56,6 +56,8 @@ int ocfs2_simple_size_update(struct inode *inode,
u64 new_i_size);
int ocfs2_extend_no_holes(struct inode *inode, u64 new_i_size,
u64 zero_to);
+int ocfs2_zero_tail(struct inode *inode, struct buffer_head *di_bh,
+ loff_t pos);
int ocfs2_setattr(struct dentry *dentry, struct iattr *attr);
int ocfs2_getattr(struct vfsmount *mnt, struct dentry *dentry,
struct kstat *stat);
--
1.7.1
--
"Time is an illusion, lunchtime doubly so."
-Douglas Adams
Joel Becker
Consulting Software Developer
Oracle
E-mail: [email protected]
Phone: (650) 506-8127
Here's the second patch, the one that keeps us from zeroing
pages past i_size. This should keep ocfs2 and Dave's writeback patch
happy.
Joel
-------------------------------------------------------
When ocfs2 fills a hole, it does so by allocating clusters. When a
cluster is larger than the write, ocfs2 must zero the portions of the
cluster outside of the write. If the clustersize is smaller than a
pagecache page, this is handled by the normal pagecache mechanisms, but
when the clustersize is larger than a page, ocfs2's write code will zero
the pages adjacent to the write. This makes sure the entire cluster is
zeroed correctly.
Currently ocfs2 behaves exactly the same when writing past i_size.
However, this means ocfs2 is writing zeroed pages for portions of a new
cluster that are beyond i_size. The page writeback code isn't expecting
this. It treats all pages past the one containing i_size as left behind
due to a previous truncate operation.
Thankfully, ocfs2 calculates the number of pages it will be working on
up front. The rest of the write code merely honors the original
calculation. We can simply trim the number of pages to only cover the
actual file data.
Signed-off-by: Joel Becker <[email protected]>
---
fs/ocfs2/aops.c | 15 +++++++++++----
1 files changed, 11 insertions(+), 4 deletions(-)
diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
index 96e6aeb..e90ad74 100644
--- a/fs/ocfs2/aops.c
+++ b/fs/ocfs2/aops.c
@@ -1130,11 +1130,12 @@ out:
*/
static int ocfs2_grab_pages_for_write(struct address_space *mapping,
struct ocfs2_write_ctxt *wc,
- u32 cpos, loff_t user_pos, int new,
+ u32 cpos, loff_t user_pos,
+ unsigned user_len, int new,
struct page *mmap_page)
{
int ret = 0, i;
- unsigned long start, target_index, index;
+ unsigned long start, target_index, end_index, index;
struct inode *inode = mapping->host;
target_index = user_pos >> PAGE_CACHE_SHIFT;
@@ -1142,11 +1143,17 @@ static int ocfs2_grab_pages_for_write(struct address_space *mapping,
/*
* Figure out how many pages we'll be manipulating here. For
* non allocating write, we just change the one
- * page. Otherwise, we'll need a whole clusters worth.
+ * page. Otherwise, we'll need a whole clusters worth. If we're
+ * writing past i_size, we only need enough pages to cover the
+ * last page of the write.
*/
if (new) {
wc->w_num_pages = ocfs2_pages_per_cluster(inode->i_sb);
start = ocfs2_align_clusters_to_page_index(inode->i_sb, cpos);
+ /* This is the index *past* the write */
+ end_index = ((user_pos + user_len) >> PAGE_CACHE_SHIFT) + 1;
+ if ((start + wc->w_num_pages) > end_index)
+ wc->w_num_pages = end_index - start;
} else {
wc->w_num_pages = 1;
start = target_index;
@@ -1789,7 +1796,7 @@ int ocfs2_write_begin_nolock(struct address_space *mapping,
* that we can zero and flush if we error after adding the
* extent.
*/
- ret = ocfs2_grab_pages_for_write(mapping, wc, wc->w_cpos, pos,
+ ret = ocfs2_grab_pages_for_write(mapping, wc, wc->w_cpos, pos, len,
cluster_of_pages, mmap_page);
if (ret) {
mlog_errno(ret);
--
1.7.1
--
"Vote early and vote often."
- Al Capone
Joel Becker
Consulting Software Developer
Oracle
E-mail: [email protected]
Phone: (650) 506-8127
Here's the second version of my corruption fix. It fixes two
bugs:
1) i_size can obviously be at a place that is a hole, so don't BUG on
that.
2) Fix an off-by-one when checking whether the write position is within
the tail allocation.
This version passes my tail corruption test as well as the kernel
compile that exposed the two bugs above.
Joel
---------------------------------------------------------------
ocfs2's allocation unit is the cluster. This can be larger than a block
or even a memory page. This means that a file may have many blocks in
its last extent that are beyond the block containing i_size.
When ocfs2 grows a file, it zeros the entire cluster in order to ensure
future i_size growth will see cleared blocks. Unfortunately,
block_write_full_page() drops the pages past i_size. This means that
ocfs2 is actually leaking garbage data into the tail end of that last
cluster.
We adjust ocfs2_write_begin_nolock() and ocfs2_extend_file() to detect
when a write or truncate is past i_size. If there is any existing
allocation between the block containing the current i_size and the
location of the write or truncate, zeros will be written to that
allocation.
This is only for sparse filesystems. Non-sparse filesystems already get
this via ocfs2_extend_no_holes().
Signed-off-by: Joel Becker <[email protected]>
---
fs/ocfs2/aops.c | 22 ++++----
fs/ocfs2/file.c | 154 +++++++++++++++++++++++++++++++++++++++++++++++++------
fs/ocfs2/file.h | 2 +
3 files changed, 150 insertions(+), 28 deletions(-)
diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
index 3623ca2..96e6aeb 100644
--- a/fs/ocfs2/aops.c
+++ b/fs/ocfs2/aops.c
@@ -196,15 +196,14 @@ int ocfs2_get_block(struct inode *inode, sector_t iblock,
dump_stack();
goto bail;
}
-
- past_eof = ocfs2_blocks_for_bytes(inode->i_sb, i_size_read(inode));
- mlog(0, "Inode %lu, past_eof = %llu\n", inode->i_ino,
- (unsigned long long)past_eof);
-
- if (create && (iblock >= past_eof))
- set_buffer_new(bh_result);
}
+ past_eof = ocfs2_blocks_for_bytes(inode->i_sb, i_size_read(inode));
+ mlog(0, "Inode %lu, past_eof = %llu\n", inode->i_ino,
+ (unsigned long long)past_eof);
+ if (create && (iblock >= past_eof))
+ set_buffer_new(bh_result);
+
bail:
if (err < 0)
err = -EIO;
@@ -1625,11 +1624,9 @@ static int ocfs2_expand_nonsparse_inode(struct inode *inode, loff_t pos,
struct ocfs2_write_ctxt *wc)
{
int ret;
- struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
loff_t newsize = pos + len;
- if (ocfs2_sparse_alloc(osb))
- return 0;
+ BUG_ON(ocfs2_sparse_alloc(OCFS2_SB(inode->i_sb)));
if (newsize <= i_size_read(inode))
return 0;
@@ -1679,7 +1676,10 @@ int ocfs2_write_begin_nolock(struct address_space *mapping,
}
}
- ret = ocfs2_expand_nonsparse_inode(inode, pos, len, wc);
+ if (ocfs2_sparse_alloc(osb))
+ ret = ocfs2_zero_tail(inode, di_bh, pos);
+ else
+ ret = ocfs2_expand_nonsparse_inode(inode, pos, len, wc);
if (ret) {
mlog_errno(ret);
goto out;
diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 6a13ea6..7fca78d 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -848,6 +848,137 @@ out:
return ret;
}
+/*
+ * This function is a helper for ocfs2_zero_tail(). It calculates
+ * what blocks need zeroing and does any CoW necessary.
+ */
+static int ocfs2_zero_tail_prepare(struct inode *inode,
+ struct buffer_head *di_bh,
+ loff_t pos, u64 *start_blkno,
+ u64 *blocks)
+{
+ int rc = 0;
+ struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
+ u32 tail_cpos, pos_cpos, p_cpos;
+ u64 tail_blkno, pos_blkno, blocks_to_zero;
+ unsigned int num_clusters = 0;
+ unsigned int ext_flags = 0;
+
+ /*
+ * The block containing i_size has already been zeroed, so our tail
+ * block is the first block after i_size. The block containing
+ * pos will be zeroed. So we only need to do anything if
+ * tail_blkno is before pos_blkno.
+ */
+ tail_blkno = (i_size_read(inode) >> inode->i_sb->s_blocksize_bits) + 1;
+ pos_blkno = pos >> inode->i_sb->s_blocksize_bits;
+ mlog(0, "tail_blkno = %llu, pos_blkno = %llu\n",
+ (unsigned long long)tail_blkno, (unsigned long long)pos_blkno);
+ if (pos_blkno <= tail_blkno)
+ goto out;
+ blocks_to_zero = pos_blkno - tail_blkno;
+
+ /*
+ * If tail_blkno is in the cluster past i_size, we don't need
+ * to touch the cluster containing i_size at all.
+ */
+ tail_cpos = i_size_read(inode) >> osb->s_clustersize_bits;
+ if (ocfs2_blocks_to_clusters(inode->i_sb, tail_blkno) > tail_cpos)
+ tail_cpos = ocfs2_blocks_to_clusters(inode->i_sb,
+ tail_blkno);
+
+ rc = ocfs2_get_clusters(inode, tail_cpos, &p_cpos, &num_clusters,
+ &ext_flags);
+ if (rc) {
+ mlog_errno(rc);
+ goto out;
+ }
+
+ /* Is there a cluster to zero? */
+ if (!p_cpos)
+ goto out;
+
+ pos_cpos = pos >> osb->s_clustersize_bits;
+ mlog(0, "tail_cpos = %u, num_clusters = %u, pos_cpos = %u, tail_blkno = %llu, pos_blkno = %llu\n",
+ (unsigned int)tail_cpos, (unsigned int)num_clusters,
+ (unsigned int)pos_cpos, (unsigned long long)tail_blkno,
+ (unsigned long long)pos_blkno);
+ if ((tail_cpos + num_clusters) > pos_cpos) {
+ num_clusters = pos_cpos - tail_cpos;
+ if (pos_blkno >
+ ocfs2_clusters_to_blocks(inode->i_sb, pos_cpos))
+ num_clusters += 1;
+ } else {
+ blocks_to_zero =
+ ocfs2_clusters_to_blocks(inode->i_sb,
+ tail_cpos + num_clusters);
+ blocks_to_zero -= tail_blkno;
+ }
+
+ /* Now CoW the clusters we're about to zero */
+ if (ext_flags & OCFS2_EXT_REFCOUNTED) {
+ rc = ocfs2_refcount_cow(inode, di_bh, tail_cpos,
+ num_clusters, UINT_MAX);
+ if (rc) {
+ mlog_errno(rc);
+ goto out;
+ }
+ }
+
+ *start_blkno = tail_blkno;
+ *blocks = blocks_to_zero;
+ mlog(0, "start_blkno = %llu, blocks = %llu\n",
+ (unsigned long long)(*start_blkno),
+ (unsigned long long)(*blocks));
+
+out:
+ return rc;
+}
+
+/*
+ * This function only does work for sparse filesystems.
+ * ocfs2_extend_no_holes() will do the same work for non-sparse * files.
+ *
+ * If the last extent of the file has blocks beyond i_size, we must zero
+ * them before we can grow i_size to cover them. Specifically, any
+ * allocation between the block containing the current i_size and the block
+ * containing pos must be zeroed.
+ */
+int ocfs2_zero_tail(struct inode *inode, struct buffer_head *di_bh,
+ loff_t pos)
+{
+ int rc = 0;
+ u64 tail_blkno = 0, blocks_to_zero = 0;
+
+ BUG_ON(!ocfs2_sparse_alloc(OCFS2_SB(inode->i_sb)));
+
+ rc = ocfs2_zero_tail_prepare(inode, di_bh, pos, &tail_blkno,
+ &blocks_to_zero);
+ if (rc) {
+ mlog_errno(rc);
+ goto out;
+ }
+
+ if (!blocks_to_zero)
+ goto out;
+
+ mlog(0, "i_size = %llu, tail_blkno = %llu, blocks_to_zero = %llu, pos = %llu, zero_to = %llu\n",
+ (unsigned long long)i_size_read(inode),
+ (unsigned long long)tail_blkno,
+ (unsigned long long)blocks_to_zero,
+ (unsigned long long)pos,
+ (unsigned long long)((tail_blkno + blocks_to_zero) <<
+ inode->i_sb->s_blocksize_bits));
+ rc = ocfs2_zero_extend(inode,
+ (tail_blkno + blocks_to_zero) <<
+ inode->i_sb->s_blocksize_bits);
+ if (rc)
+ mlog_errno(rc);
+
+out:
+ return rc;
+}
+
static int ocfs2_extend_file(struct inode *inode,
struct buffer_head *di_bh,
u64 new_i_size)
@@ -862,27 +993,15 @@ static int ocfs2_extend_file(struct inode *inode,
goto out;
if (i_size_read(inode) == new_i_size)
- goto out;
+ goto out;
BUG_ON(new_i_size < i_size_read(inode));
/*
- * Fall through for converting inline data, even if the fs
- * supports sparse files.
- *
- * The check for inline data here is legal - nobody can add
- * the feature since we have i_mutex. We must check it again
- * after acquiring ip_alloc_sem though, as paths like mmap
- * might have raced us to converting the inode to extents.
- */
- if (!(oi->ip_dyn_features & OCFS2_INLINE_DATA_FL)
- && ocfs2_sparse_alloc(OCFS2_SB(inode->i_sb)))
- goto out_update_size;
-
- /*
* The alloc sem blocks people in read/write from reading our
* allocation until we're done changing it. We depend on
* i_mutex to block other extend/truncate calls while we're
- * here.
+ * here. We even have to hold it for sparse files because there
+ * might be some tail zeroing.
*/
down_write(&oi->ip_alloc_sem);
@@ -899,13 +1018,14 @@ static int ocfs2_extend_file(struct inode *inode,
ret = ocfs2_convert_inline_data_to_extents(inode, di_bh);
if (ret) {
up_write(&oi->ip_alloc_sem);
-
mlog_errno(ret);
goto out;
}
}
- if (!ocfs2_sparse_alloc(OCFS2_SB(inode->i_sb)))
+ if (ocfs2_sparse_alloc(OCFS2_SB(inode->i_sb)))
+ ret = ocfs2_zero_tail(inode, di_bh, new_i_size);
+ else
ret = ocfs2_extend_no_holes(inode, new_i_size, new_i_size);
up_write(&oi->ip_alloc_sem);
diff --git a/fs/ocfs2/file.h b/fs/ocfs2/file.h
index d66cf4f..7493d97 100644
--- a/fs/ocfs2/file.h
+++ b/fs/ocfs2/file.h
@@ -56,6 +56,8 @@ int ocfs2_simple_size_update(struct inode *inode,
u64 new_i_size);
int ocfs2_extend_no_holes(struct inode *inode, u64 new_i_size,
u64 zero_to);
+int ocfs2_zero_tail(struct inode *inode, struct buffer_head *di_bh,
+ loff_t pos);
int ocfs2_setattr(struct dentry *dentry, struct iattr *attr);
int ocfs2_getattr(struct vfsmount *mnt, struct dentry *dentry,
struct kstat *stat);
--
1.7.1
--
"The lawgiver, of all beings, most owes the law allegiance. He of all
men should behave as though the law compelled him. But it is the
universal weakness of mankind that what we are given to administer we
presently imagine we own."
- H.G. Wells
Joel Becker
Consulting Software Developer
Oracle
E-mail: [email protected]
Phone: (650) 506-8127
Hi Joel,
On 07/04/2010 05:33 AM, Joel Becker wrote:
> Here's the second patch, the one that keeps us from zeroing
> pages past i_size. This should keep ocfs2 and Dave's writeback patch
> happy.
>
> Joel
>
> -------------------------------------------------------
>
> When ocfs2 fills a hole, it does so by allocating clusters. When a
> cluster is larger than the write, ocfs2 must zero the portions of the
> cluster outside of the write. If the clustersize is smaller than a
> pagecache page, this is handled by the normal pagecache mechanisms, but
> when the clustersize is larger than a page, ocfs2's write code will zero
> the pages adjacent to the write. This makes sure the entire cluster is
> zeroed correctly.
>
> Currently ocfs2 behaves exactly the same when writing past i_size.
> However, this means ocfs2 is writing zeroed pages for portions of a new
> cluster that are beyond i_size. The page writeback code isn't expecting
> this. It treats all pages past the one containing i_size as left behind
> due to a previous truncate operation.
>
> Thankfully, ocfs2 calculates the number of pages it will be working on
> up front. The rest of the write code merely honors the original
> calculation. We can simply trim the number of pages to only cover the
> actual file data.
>
> Signed-off-by: Joel Becker<[email protected]>
> ---
> fs/ocfs2/aops.c | 15 +++++++++++----
> 1 files changed, 11 insertions(+), 4 deletions(-)
>
> diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
> index 96e6aeb..e90ad74 100644
> --- a/fs/ocfs2/aops.c
> +++ b/fs/ocfs2/aops.c
<snip>
> @@ -1142,11 +1143,17 @@ static int ocfs2_grab_pages_for_write(struct address_space *mapping,
> /*
> * Figure out how many pages we'll be manipulating here. For
> * non allocating write, we just change the one
> - * page. Otherwise, we'll need a whole clusters worth.
> + * page. Otherwise, we'll need a whole clusters worth. If we're
> + * writing past i_size, we only need enough pages to cover the
> + * last page of the write.
The comments for the whole function before the function name also needs
this change accordingly?
> */
> if (new) {
> wc->w_num_pages = ocfs2_pages_per_cluster(inode->i_sb);
> start = ocfs2_align_clusters_to_page_index(inode->i_sb, cpos);
> + /* This is the index *past* the write */
> + end_index = ((user_pos + user_len)>> PAGE_CACHE_SHIFT) + 1;
should it be
end_index = ((user_pos + user_len - 1) >> PAGE_CACHE_SHIFT) + 1?
> + if ((start + wc->w_num_pages)> end_index)
> + wc->w_num_pages = end_index - start;
I just noticed that the below loop in ocfs2_grab_pages_for_write is
for (i = 0; i < wc->w_num_pages; i++)
I guess w_num_pages should be set to end_index -
start_page_of_the_cluster so that we can make sure we grab all the pages
in this cluster until i_size?
Regards,
Tao
Hi Joel,
On 07/04/2010 11:13 PM, Tao Ma wrote:
> Hi Joel,
>
> On 07/04/2010 05:33 AM, Joel Becker wrote:
>> Here's the second patch, the one that keeps us from zeroing
>> pages past i_size. This should keep ocfs2 and Dave's writeback patch
>> happy.
>>
>> Joel
>>
>> -------------------------------------------------------
>>
>> When ocfs2 fills a hole, it does so by allocating clusters. When a
>> cluster is larger than the write, ocfs2 must zero the portions of the
>> cluster outside of the write. If the clustersize is smaller than a
>> pagecache page, this is handled by the normal pagecache mechanisms, but
>> when the clustersize is larger than a page, ocfs2's write code will zero
>> the pages adjacent to the write. This makes sure the entire cluster is
>> zeroed correctly.
>>
>> Currently ocfs2 behaves exactly the same when writing past i_size.
>> However, this means ocfs2 is writing zeroed pages for portions of a new
>> cluster that are beyond i_size. The page writeback code isn't expecting
>> this. It treats all pages past the one containing i_size as left behind
>> due to a previous truncate operation.
>>
>> Thankfully, ocfs2 calculates the number of pages it will be working on
>> up front. The rest of the write code merely honors the original
>> calculation. We can simply trim the number of pages to only cover the
>> actual file data.
>>
>> Signed-off-by: Joel Becker<[email protected]>
>> ---
>> fs/ocfs2/aops.c | 15 +++++++++++----
>> 1 files changed, 11 insertions(+), 4 deletions(-)
>>
>> diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
>> index 96e6aeb..e90ad74 100644
>> --- a/fs/ocfs2/aops.c
>> +++ b/fs/ocfs2/aops.c
> <snip>
>> @@ -1142,11 +1143,17 @@ static int ocfs2_grab_pages_for_write(struct
>> address_space *mapping,
>> /*
>> * Figure out how many pages we'll be manipulating here. For
>> * non allocating write, we just change the one
>> - * page. Otherwise, we'll need a whole clusters worth.
>> + * page. Otherwise, we'll need a whole clusters worth. If we're
>> + * writing past i_size, we only need enough pages to cover the
>> + * last page of the write.
> The comments for the whole function before the function name also needs
> this change accordingly?
>> */
>> if (new) {
>> wc->w_num_pages = ocfs2_pages_per_cluster(inode->i_sb);
>> start = ocfs2_align_clusters_to_page_index(inode->i_sb, cpos);
>> + /* This is the index *past* the write */
>> + end_index = ((user_pos + user_len)>> PAGE_CACHE_SHIFT) + 1;
> should it be
> end_index = ((user_pos + user_len - 1) >> PAGE_CACHE_SHIFT) + 1?
>
>
>> + if ((start + wc->w_num_pages)> end_index)
>> + wc->w_num_pages = end_index - start;
> I just noticed that the below loop in ocfs2_grab_pages_for_write is
> for (i = 0; i < wc->w_num_pages; i++)
>
> I guess w_num_pages should be set to end_index -
> start_page_of_the_cluster so that we can make sure we grab all the pages
> in this cluster until i_size?
oh, start is set to that value, sorry for this bit.
btw, do we ever have a chance that start + wc->w_num_pages > end_index?
I can't find it.
Regards,
Tao
Hi Joel,
On 07/04/2010 05:32 AM, Joel Becker wrote:
> Here's the second version of my corruption fix. It fixes two
> bugs:
>
> 1) i_size can obviously be at a place that is a hole, so don't BUG on
> that.
> 2) Fix an off-by-one when checking whether the write position is within
> the tail allocation.
>
> This version passes my tail corruption test as well as the kernel
> compile that exposed the two bugs above.
>
> Joel
>
> ---------------------------------------------------------------
>
> ocfs2's allocation unit is the cluster. This can be larger than a block
> or even a memory page. This means that a file may have many blocks in
> its last extent that are beyond the block containing i_size.
>
> When ocfs2 grows a file, it zeros the entire cluster in order to ensure
> future i_size growth will see cleared blocks. Unfortunately,
> block_write_full_page() drops the pages past i_size. This means that
> ocfs2 is actually leaking garbage data into the tail end of that last
> cluster.
>
> We adjust ocfs2_write_begin_nolock() and ocfs2_extend_file() to detect
> when a write or truncate is past i_size. If there is any existing
> allocation between the block containing the current i_size and the
> location of the write or truncate, zeros will be written to that
> allocation.
>
> This is only for sparse filesystems. Non-sparse filesystems already get
> this via ocfs2_extend_no_holes().
>
> Signed-off-by: Joel Becker<[email protected]>
> ---
> fs/ocfs2/aops.c | 22 ++++----
> fs/ocfs2/file.c | 154 +++++++++++++++++++++++++++++++++++++++++++++++++------
> fs/ocfs2/file.h | 2 +
> 3 files changed, 150 insertions(+), 28 deletions(-)
>
<snip>
> diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
> index 6a13ea6..7fca78d 100644
> --- a/fs/ocfs2/file.c
> +++ b/fs/ocfs2/file.c
> @@ -848,6 +848,137 @@ out:
> return ret;
> }
>
> +/*
> + * This function is a helper for ocfs2_zero_tail(). It calculates
> + * what blocks need zeroing and does any CoW necessary.
> + */
> +static int ocfs2_zero_tail_prepare(struct inode *inode,
> + struct buffer_head *di_bh,
> + loff_t pos, u64 *start_blkno,
> + u64 *blocks)
> +{
> + int rc = 0;
> + struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
> + u32 tail_cpos, pos_cpos, p_cpos;
> + u64 tail_blkno, pos_blkno, blocks_to_zero;
> + unsigned int num_clusters = 0;
> + unsigned int ext_flags = 0;
> +
> + /*
> + * The block containing i_size has already been zeroed, so our tail
> + * block is the first block after i_size. The block containing
> + * pos will be zeroed. So we only need to do anything if
> + * tail_blkno is before pos_blkno.
> + */
> + tail_blkno = (i_size_read(inode)>> inode->i_sb->s_blocksize_bits) + 1;
> + pos_blkno = pos>> inode->i_sb->s_blocksize_bits;
> + mlog(0, "tail_blkno = %llu, pos_blkno = %llu\n",
> + (unsigned long long)tail_blkno, (unsigned long long)pos_blkno);
> + if (pos_blkno<= tail_blkno)
> + goto out;
> + blocks_to_zero = pos_blkno - tail_blkno;
> +
> + /*
> + * If tail_blkno is in the cluster past i_size, we don't need
> + * to touch the cluster containing i_size at all.
> + */
> + tail_cpos = i_size_read(inode)>> osb->s_clustersize_bits;
> + if (ocfs2_blocks_to_clusters(inode->i_sb, tail_blkno)> tail_cpos)
> + tail_cpos = ocfs2_blocks_to_clusters(inode->i_sb,
> + tail_blkno);
Can we always set tail_cpos in one line?
tail_cpos = ocfs2_blocks_to_clusters(inode->i_sb, tail_blkno)?
tail_cpos is either the same cluster as i_size or the next cluster and
both works for tail_blkno I guess?
> +
> + rc = ocfs2_get_clusters(inode, tail_cpos,&p_cpos,&num_clusters,
> + &ext_flags);
> + if (rc) {
> + mlog_errno(rc);
> + goto out;
> + }
> +
> + /* Is there a cluster to zero? */
> + if (!p_cpos)
> + goto out;
For unwritten extent, we also need to clear the pages? If yes, the
solution doesn't complete if we have 2 unwritten extent, one contains
i_size while one passes i_size. Here we only clear the pages for the 1st
unwritten extent and leave the 2nd one untouched.
> +
> + pos_cpos = pos>> osb->s_clustersize_bits;
> + mlog(0, "tail_cpos = %u, num_clusters = %u, pos_cpos = %u, tail_blkno = %llu, pos_blkno = %llu\n",
> + (unsigned int)tail_cpos, (unsigned int)num_clusters,
> + (unsigned int)pos_cpos, (unsigned long long)tail_blkno,
> + (unsigned long long)pos_blkno);
From here to the call of CoW is a bit hard to understand. In 'if',
num_clusters is set for CoW and in 'else', blocks_to_zero is set. So it
isn't easy for the reader to tell why these 2 clauses are setting
different values. So how about my code below? It looks more
straightforward I think.
> + if ((tail_cpos + num_clusters)> pos_cpos) {
> + num_clusters = pos_cpos - tail_cpos;
> + if (pos_blkno>
> + ocfs2_clusters_to_blocks(inode->i_sb, pos_cpos))
> + num_clusters += 1;
> + } else {
> + blocks_to_zero =
> + ocfs2_clusters_to_blocks(inode->i_sb,
> + tail_cpos + num_clusters);
> + blocks_to_zero -= tail_blkno;
> + }
> +
> + /* Now CoW the clusters we're about to zero */
> + if (ext_flags& OCFS2_EXT_REFCOUNTED) {
> + rc = ocfs2_refcount_cow(inode, di_bh, tail_cpos,
> + num_clusters, UINT_MAX);
> + if (rc) {
> + mlog_errno(rc);
> + goto out;
> + }
> + }
/* Decrease blocks_to_zero if there is some hole after extent */
if (tail_cpos + num_clusters <= pos_cpos) {
blocks_to_zero =
ocfs2_clusters_to_blocks(inode->i_sb,
tail_cpos + num_clusters);
blocks_to_zero -= tail_blkno;
}
/* Now CoW if we have some refcounted clusters. */
if (ext_flags & OCFS2_EXT_REFCOUNTED) {
/*
* We add one more cluster here since it will be
* written shortly and if the pos_blkno isn't aligned
* to the cluster size, we have to zero the blocks
* before it.
*/
if (tail_cpos + num_clusters > pos_cpos)
num_clusters = pos_cpos - tail_cpos + 1;
rc = ocfs2_refcount_cow(inode, di_bh, tail_cpos,
num_clusters, UINT_MAX);
if (rc) {
mlog_errno(rc);
goto out;
}
}
> +
> + *start_blkno = tail_blkno;
> + *blocks = blocks_to_zero;
> + mlog(0, "start_blkno = %llu, blocks = %llu\n",
> + (unsigned long long)(*start_blkno),
> + (unsigned long long)(*blocks));
> +
> +out:
> + return rc;
> +}
Regards,
Tao
On Sun, Jul 04, 2010 at 11:13:01PM +0800, Tao Ma wrote:
> On 07/04/2010 05:33 AM, Joel Becker wrote:
> >@@ -1142,11 +1143,17 @@ static int ocfs2_grab_pages_for_write(struct address_space *mapping,
> > /*
> > * Figure out how many pages we'll be manipulating here. For
> > * non allocating write, we just change the one
> >- * page. Otherwise, we'll need a whole clusters worth.
> >+ * page. Otherwise, we'll need a whole clusters worth. If we're
> >+ * writing past i_size, we only need enough pages to cover the
> >+ * last page of the write.
> The comments for the whole function before the function name also
> needs this change accordingly?
Not really. That comment set a limit, this comment is more
detailed.
> > if (new) {
> > wc->w_num_pages = ocfs2_pages_per_cluster(inode->i_sb);
> > start = ocfs2_align_clusters_to_page_index(inode->i_sb, cpos);
> >+ /* This is the index *past* the write */
> >+ end_index = ((user_pos + user_len)>> PAGE_CACHE_SHIFT) + 1;
> should it be
> end_index = ((user_pos + user_len - 1) >> PAGE_CACHE_SHIFT) + 1?
Maybe. Gotta think about it and test.
Joel
--
Life's Little Instruction Book #232
"Keep your promises."
Joel Becker
Consulting Software Developer
Oracle
E-mail: [email protected]
Phone: (650) 506-8127
On Mon, Jul 05, 2010 at 09:38:42AM +0800, Tao Ma wrote:
> btw, do we ever have a chance that start + wc->w_num_pages >
> end_index? I can't find it.
Of course. If you have a 1MB clustersize, w_num_pages will be
256. But if you are only writing the first page of the cluster,
end_index is only 1.
Joel
--
"If you took all of the grains of sand in the world, and lined
them up end to end in a row, you'd be working for the government!"
- Mr. Interesting
Joel Becker
Consulting Software Developer
Oracle
E-mail: [email protected]
Phone: (650) 506-8127
On Mon, Jul 05, 2010 at 11:51:44AM +0800, Tao Ma wrote:
> >+ /*
> >+ * If tail_blkno is in the cluster past i_size, we don't need
> >+ * to touch the cluster containing i_size at all.
> >+ */
> >+ tail_cpos = i_size_read(inode)>> osb->s_clustersize_bits;
> >+ if (ocfs2_blocks_to_clusters(inode->i_sb, tail_blkno)> tail_cpos)
> >+ tail_cpos = ocfs2_blocks_to_clusters(inode->i_sb,
> >+ tail_blkno);
> Can we always set tail_cpos in one line?
> tail_cpos = ocfs2_blocks_to_clusters(inode->i_sb, tail_blkno)?
> tail_cpos is either the same cluster as i_size or the next cluster
> and both works for tail_blkno I guess?
I had the same thought on Friday, but the current version passes
testing and I was wary of changing that.
> >+ /* Is there a cluster to zero? */
> >+ if (!p_cpos)
> >+ goto out;
> For unwritten extent, we also need to clear the pages? If yes, the
> solution doesn't complete if we have 2 unwritten extent, one
> contains i_size while one passes i_size. Here we only clear the
> pages for the 1st unwritten extent and leave the 2nd one untouched.
We probably don't need to zero unwritten extents. We cannot
have an extent past i_size, can we?
> From here to the call of CoW is a bit hard to understand. In 'if',
> num_clusters is set for CoW and in 'else', blocks_to_zero is set. So
> it isn't easy for the reader to tell why these 2 clauses are setting
> different values. So how about my code below? It looks more
> straightforward I think.
> >+ if ((tail_cpos + num_clusters)> pos_cpos) {
> >+ num_clusters = pos_cpos - tail_cpos;
> >+ if (pos_blkno>
> >+ ocfs2_clusters_to_blocks(inode->i_sb, pos_cpos))
> >+ num_clusters += 1;
> >+ } else {
> >+ blocks_to_zero =
> >+ ocfs2_clusters_to_blocks(inode->i_sb,
> >+ tail_cpos + num_clusters);
> >+ blocks_to_zero -= tail_blkno;
> >+ }
> >+
> >+ /* Now CoW the clusters we're about to zero */
> >+ if (ext_flags& OCFS2_EXT_REFCOUNTED) {
> >+ rc = ocfs2_refcount_cow(inode, di_bh, tail_cpos,
> >+ num_clusters, UINT_MAX);
> >+ if (rc) {
> >+ mlog_errno(rc);
> >+ goto out;
> >+ }
> >+ }
> /* Decrease blocks_to_zero if there is some hole after extent */
> if (tail_cpos + num_clusters <= pos_cpos) {
> blocks_to_zero =
> ocfs2_clusters_to_blocks(inode->i_sb,
> tail_cpos + num_clusters);
> blocks_to_zero -= tail_blkno;
> }
Not a bad split-out here.
> /* Now CoW if we have some refcounted clusters. */
> if (ext_flags & OCFS2_EXT_REFCOUNTED) {
> /*
> * We add one more cluster here since it will be
> * written shortly and if the pos_blkno isn't aligned
> * to the cluster size, we have to zero the blocks
> * before it.
> */
> if (tail_cpos + num_clusters > pos_cpos)
> num_clusters = pos_cpos - tail_cpos + 1;
But you dropped the check for pos_blkno alignment.
Unconditionally adding the +1 doesn't seem like a good idea.
Joel
--
"Where are my angels?
Where's my golden one?
And where is my hope
Now that my heroes are gone?"
Joel Becker
Consulting Software Developer
Oracle
E-mail: [email protected]
Phone: (650) 506-8127
Hi Joel,
On 07/06/2010 03:17 PM, Joel Becker wrote:
> On Mon, Jul 05, 2010 at 11:51:44AM +0800, Tao Ma wrote:
>>> + /*
>>> + * If tail_blkno is in the cluster past i_size, we don't need
>>> + * to touch the cluster containing i_size at all.
>>> + */
>>> + tail_cpos = i_size_read(inode)>> osb->s_clustersize_bits;
>>> + if (ocfs2_blocks_to_clusters(inode->i_sb, tail_blkno)> tail_cpos)
>>> + tail_cpos = ocfs2_blocks_to_clusters(inode->i_sb,
>>> + tail_blkno);
>> Can we always set tail_cpos in one line?
>> tail_cpos = ocfs2_blocks_to_clusters(inode->i_sb, tail_blkno)?
>> tail_cpos is either the same cluster as i_size or the next cluster
>> and both works for tail_blkno I guess?
>
> I had the same thought on Friday, but the current version passes
> testing and I was wary of changing that.
ok, so as you wish.
>
>>> + /* Is there a cluster to zero? */
>>> + if (!p_cpos)
>>> + goto out;
>> For unwritten extent, we also need to clear the pages? If yes, the
>> solution doesn't complete if we have 2 unwritten extent, one
>> contains i_size while one passes i_size. Here we only clear the
>> pages for the 1st unwritten extent and leave the 2nd one untouched.
>
> We probably don't need to zero unwritten extents. We cannot
> have an extent past i_size, can we?
we can. AFAICS, ocfs2_change_file_space will allocate unwritten extents
and does't change i_size.
>
>> From here to the call of CoW is a bit hard to understand. In 'if',
>> num_clusters is set for CoW and in 'else', blocks_to_zero is set. So
>> it isn't easy for the reader to tell why these 2 clauses are setting
>> different values. So how about my code below? It looks more
>> straightforward I think.
>>> + if ((tail_cpos + num_clusters)> pos_cpos) {
>>> + num_clusters = pos_cpos - tail_cpos;
>>> + if (pos_blkno>
>>> + ocfs2_clusters_to_blocks(inode->i_sb, pos_cpos))
>>> + num_clusters += 1;
>>> + } else {
>>> + blocks_to_zero =
>>> + ocfs2_clusters_to_blocks(inode->i_sb,
>>> + tail_cpos + num_clusters);
>>> + blocks_to_zero -= tail_blkno;
>>> + }
>>> +
>>> + /* Now CoW the clusters we're about to zero */
>>> + if (ext_flags& OCFS2_EXT_REFCOUNTED) {
>>> + rc = ocfs2_refcount_cow(inode, di_bh, tail_cpos,
>>> + num_clusters, UINT_MAX);
>>> + if (rc) {
>>> + mlog_errno(rc);
>>> + goto out;
>>> + }
>>> + }
>> /* Decrease blocks_to_zero if there is some hole after extent */
>> if (tail_cpos + num_clusters<= pos_cpos) {
>> blocks_to_zero =
>> ocfs2_clusters_to_blocks(inode->i_sb,
>> tail_cpos + num_clusters);
>> blocks_to_zero -= tail_blkno;
>> }
>
> Not a bad split-out here.
>
>> /* Now CoW if we have some refcounted clusters. */
>> if (ext_flags& OCFS2_EXT_REFCOUNTED) {
>> /*
>> * We add one more cluster here since it will be
>> * written shortly and if the pos_blkno isn't aligned
>> * to the cluster size, we have to zero the blocks
>> * before it.
>> */
>> if (tail_cpos + num_clusters> pos_cpos)
>> num_clusters = pos_cpos - tail_cpos + 1;
>
> But you dropped the check for pos_blkno alignment.
> Unconditionally adding the +1 doesn't seem like a good idea.
You can add it as you wish.
I just thought that you add one more extra cluster if pos_blkno isn't
aligned so as to zero blocks in [pos_cpos_start_block, pos_blkno).
But As I said in the comments, you will soon write pos_blkno(it also
needs to be CoW since it is within this refcounted extent), so if we can
CoW it out now, maybe we have a chance to not call ocfs2_refcount_cow later.
Regards,
Tao
On Tue, Jul 06, 2010 at 03:54:58PM +0800, Tao Ma wrote:
> On 07/06/2010 03:17 PM, Joel Becker wrote:
> >>>+ /* Is there a cluster to zero? */
> >>>+ if (!p_cpos)
> >>>+ goto out;
> >>For unwritten extent, we also need to clear the pages? If yes, the
> >>solution doesn't complete if we have 2 unwritten extent, one
> >>contains i_size while one passes i_size. Here we only clear the
> >>pages for the 1st unwritten extent and leave the 2nd one untouched.
> >
> > We probably don't need to zero unwritten extents. We cannot
> >have an extent past i_size, can we?
> we can. AFAICS, ocfs2_change_file_space will allocate unwritten
> extents and does't change i_size.
Oh, you're right. We need to walk the entire extent range
between i_size and pos and figure out what needs CoW. This needs to
happen no matter what.
> > But you dropped the check for pos_blkno alignment.
> >Unconditionally adding the +1 doesn't seem like a good idea.
> You can add it as you wish.
> I just thought that you add one more extra cluster if pos_blkno
> isn't aligned so as to zero blocks in [pos_cpos_start_block,
> pos_blkno).
> But As I said in the comments, you will soon write pos_blkno(it also
> needs to be CoW since it is within this refcounted extent), so if we
> can CoW it out now, maybe we have a chance to not call
> ocfs2_refcount_cow later.
I'd much rather let the write handle its own contiguousness. If
we get lucky, that CoW melds with our CoW. If we don't get lucky, isn't
it better to have the newly changed area be fully contiguous rather than
have the first extent of it not be and then the remaining extents be?
Joel
--
Life's Little Instruction Book #3
"Watch a sunrise at least once a year."
Joel Becker
Consulting Software Developer
Oracle
E-mail: [email protected]
Phone: (650) 506-8127
On Tue, Jul 06, 2010 at 12:09:19AM -0700, Joel Becker wrote:
> On Sun, Jul 04, 2010 at 11:13:01PM +0800, Tao Ma wrote:
> > > if (new) {
> > > wc->w_num_pages = ocfs2_pages_per_cluster(inode->i_sb);
> > > start = ocfs2_align_clusters_to_page_index(inode->i_sb, cpos);
> > >+ /* This is the index *past* the write */
> > >+ end_index = ((user_pos + user_len)>> PAGE_CACHE_SHIFT) + 1;
> > should it be
> > end_index = ((user_pos + user_len - 1) >> PAGE_CACHE_SHIFT) + 1?
>
> Maybe. Gotta think about it and test.
I think you're right. Since there are other changes too, I'm
going to add this in and test it.
Joel
--
"I'm drifting and drifting
Just like a ship out on the sea.
Cause I ain't got nobody, baby,
In this world to care for me."
Joel Becker
Consulting Software Developer
Oracle
E-mail: [email protected]
Phone: (650) 506-8127
On Mon, Jul 05, 2010 at 11:51:44AM +0800, Tao Ma wrote:
> On 07/04/2010 05:32 AM, Joel Becker wrote:
> >+ /*
> >+ * If tail_blkno is in the cluster past i_size, we don't need
> >+ * to touch the cluster containing i_size at all.
> >+ */
> >+ tail_cpos = i_size_read(inode)>> osb->s_clustersize_bits;
> >+ if (ocfs2_blocks_to_clusters(inode->i_sb, tail_blkno)> tail_cpos)
> >+ tail_cpos = ocfs2_blocks_to_clusters(inode->i_sb,
> >+ tail_blkno);
> Can we always set tail_cpos in one line?
> tail_cpos = ocfs2_blocks_to_clusters(inode->i_sb, tail_blkno)?
> tail_cpos is either the same cluster as i_size or the next cluster
> and both works for tail_blkno I guess?
I'm taking this as well.
Joel
--
"Here's something to think about: How come you never see a headline
like ``Psychic Wins Lottery''?"
- Jay Leno
Joel Becker
Consulting Software Developer
Oracle
E-mail: [email protected]
Phone: (650) 506-8127
On Mon, Jul 05, 2010 at 11:51:44AM +0800, Tao Ma wrote:
> On 07/04/2010 05:32 AM, Joel Becker wrote:
> From here to the call of CoW is a bit hard to understand. In 'if',
> num_clusters is set for CoW and in 'else', blocks_to_zero is set. So
> it isn't easy for the reader to tell why these 2 clauses are setting
> different values. So how about my code below? It looks more
> straightforward I think.
I took your cleanup mostly.
Joel
--
"The cynics are right nine times out of ten."
- H. L. Mencken
Joel Becker
Consulting Software Developer
Oracle
E-mail: [email protected]
Phone: (650) 506-8127
On Mon, Jun 28, 2010 at 06:58:23PM -0700, Joel Becker wrote:
> Oh, no, that's not it at all. This is a disaster. I can't see
> for the life of me why we haven't had 100,000 bug reports. You're going
Btw, we've figured out why we don't have 100,000 bug reports.
Most normal usage will never encounter this. But it is still a serious
problem, and the fix is just as high a priority.
Joel
P.S.: Thanks, LWN, for making me look good on the QotW ;-)
--
"I inject pure kryptonite into my brain.
It improves my kung fu, and it eases the pain."
Joel Becker
Consulting Software Developer
Oracle
E-mail: [email protected]
Phone: (650) 506-8127
Joel Becker wrote:
> On Tue, Jul 06, 2010 at 03:54:58PM +0800, Tao Ma wrote:
>
>> On 07/06/2010 03:17 PM, Joel Becker wrote:
>>
>>>>> + /* Is there a cluster to zero? */
>>>>> + if (!p_cpos)
>>>>> + goto out;
>>>>>
>>>> For unwritten extent, we also need to clear the pages? If yes, the
>>>> solution doesn't complete if we have 2 unwritten extent, one
>>>> contains i_size while one passes i_size. Here we only clear the
>>>> pages for the 1st unwritten extent and leave the 2nd one untouched.
>>>>
>>> We probably don't need to zero unwritten extents. We cannot
>>> have an extent past i_size, can we?
>>>
>> we can. AFAICS, ocfs2_change_file_space will allocate unwritten
>> extents and does't change i_size.
>>
>
> Oh, you're right. We need to walk the entire extent range
> between i_size and pos and figure out what needs CoW. This needs to
> happen no matter what.
>
Actually we can only have unwritten extents after i_size and it
shouldn't hurt you in this case.
So do we really need to CoW all the unwritten extents?
All I want to say is that since they are unwritten, they should also
mean 'zero' for the user space.
So can we just need to skip clearing pages if i_size is in an unwritten
extent?
>
>>> But you dropped the check for pos_blkno alignment.
>>> Unconditionally adding the +1 doesn't seem like a good idea.
>>>
>> You can add it as you wish.
>> I just thought that you add one more extra cluster if pos_blkno
>> isn't aligned so as to zero blocks in [pos_cpos_start_block,
>> pos_blkno).
>> But As I said in the comments, you will soon write pos_blkno(it also
>> needs to be CoW since it is within this refcounted extent), so if we
>> can CoW it out now, maybe we have a chance to not call
>> ocfs2_refcount_cow later.
>>
>
> I'd much rather let the write handle its own contiguousness. If
> we get lucky, that CoW melds with our CoW. If we don't get lucky, isn't
> it better to have the newly changed area be fully contiguous rather than
> have the first extent of it not be and then the remaining extents be?
>
fair enough.
Regards,
Tao
On Wed, Jul 07, 2010 at 08:42:53AM +0800, Tao Ma wrote:
> > Oh, you're right. We need to walk the entire extent range
> >between i_size and pos and figure out what needs CoW. This needs to
> >happen no matter what.
> Actually we can only have unwritten extents after i_size and it
> shouldn't hurt you in this case.
> So do we really need to CoW all the unwritten extents?
> All I want to say is that since they are unwritten, they should also
> mean 'zero' for the user space.
> So can we just need to skip clearing pages if i_size is in an
> unwritten extent?
We can certainly have unwritten extents in the middle too ;-)
I've just reworked the entire ocfs2_zero_extend() logic to skip
unwritten extents and CoW refcounted ones. We have to CoW for nonsparse
anyway, so we needed this logic. We do need to walk the entire range,
just in case there are extents anywhere between i_size and pos.
Patches coming as soon as it stops breaking.
Joel
--
Life's Little Instruction Book #3
"Watch a sunrise at least once a year."
Joel Becker
Consulting Software Developer
Oracle
E-mail: [email protected]
Phone: (650) 506-8127
When ocfs2 fills a hole, it does so by allocating clusters. When a
cluster is larger than the write, ocfs2 must zero the portions of the
cluster outside of the write. If the clustersize is smaller than a
pagecache page, this is handled by the normal pagecache mechanisms, but
when the clustersize is larger than a page, ocfs2's write code will zero
the pages adjacent to the write. This makes sure the entire cluster is
zeroed correctly.
Currently ocfs2 behaves exactly the same when writing past i_size.
However, this means ocfs2 is writing zeroed pages for portions of a new
cluster that are beyond i_size. The page writeback code isn't expecting
this. It treats all pages past the one containing i_size as left behind
due to a previous truncate operation.
Thankfully, ocfs2 calculates the number of pages it will be working on
up front. The rest of the write code merely honors the original
calculation. We can simply trim the number of pages to only cover the
actual file data.
Signed-off-by: Joel Becker <[email protected]>
---
fs/ocfs2/aops.c | 15 +++++++++++----
1 files changed, 11 insertions(+), 4 deletions(-)
diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
index 8d6dc3f..9b3381a 100644
--- a/fs/ocfs2/aops.c
+++ b/fs/ocfs2/aops.c
@@ -1100,11 +1100,12 @@ out:
*/
static int ocfs2_grab_pages_for_write(struct address_space *mapping,
struct ocfs2_write_ctxt *wc,
- u32 cpos, loff_t user_pos, int new,
+ u32 cpos, loff_t user_pos,
+ unsigned user_len, int new,
struct page *mmap_page)
{
int ret = 0, i;
- unsigned long start, target_index, index;
+ unsigned long start, target_index, end_index, index;
struct inode *inode = mapping->host;
target_index = user_pos >> PAGE_CACHE_SHIFT;
@@ -1112,11 +1113,17 @@ static int ocfs2_grab_pages_for_write(struct address_space *mapping,
/*
* Figure out how many pages we'll be manipulating here. For
* non allocating write, we just change the one
- * page. Otherwise, we'll need a whole clusters worth.
+ * page. Otherwise, we'll need a whole clusters worth. If we're
+ * writing past i_size, we only need enough pages to cover the
+ * last page of the write.
*/
if (new) {
wc->w_num_pages = ocfs2_pages_per_cluster(inode->i_sb);
start = ocfs2_align_clusters_to_page_index(inode->i_sb, cpos);
+ /* This is the index *past* the write */
+ end_index = ((user_pos + user_len - 1) >> PAGE_CACHE_SHIFT) + 1;
+ if ((start + wc->w_num_pages) > end_index)
+ wc->w_num_pages = end_index - start;
} else {
wc->w_num_pages = 1;
start = target_index;
@@ -1761,7 +1768,7 @@ int ocfs2_write_begin_nolock(struct address_space *mapping,
* that we can zero and flush if we error after adding the
* extent.
*/
- ret = ocfs2_grab_pages_for_write(mapping, wc, wc->w_cpos, pos,
+ ret = ocfs2_grab_pages_for_write(mapping, wc, wc->w_cpos, pos, len,
cluster_of_pages, mmap_page);
if (ret) {
mlog_errno(ret);
--
1.7.1
ocfs2_zero_extend() does its zeroing block by block, but it calls a
function named ocfs2_write_zero_page(). Let's have
ocfs2_write_zero_page() handle the page level. From
ocfs2_zero_extend()'s perspective, it is now page-at-a-time.
Signed-off-by: Joel Becker <[email protected]>
---
fs/ocfs2/aops.c | 30 --------------
fs/ocfs2/file.c | 119 +++++++++++++++++++++++++++++++++++++++----------------
2 files changed, 85 insertions(+), 64 deletions(-)
diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
index 3623ca2..9a5c931 100644
--- a/fs/ocfs2/aops.c
+++ b/fs/ocfs2/aops.c
@@ -459,36 +459,6 @@ int walk_page_buffers( handle_t *handle,
return ret;
}
-handle_t *ocfs2_start_walk_page_trans(struct inode *inode,
- struct page *page,
- unsigned from,
- unsigned to)
-{
- struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
- handle_t *handle;
- int ret = 0;
-
- handle = ocfs2_start_trans(osb, OCFS2_INODE_UPDATE_CREDITS);
- if (IS_ERR(handle)) {
- ret = -ENOMEM;
- mlog_errno(ret);
- goto out;
- }
-
- if (ocfs2_should_order_data(inode)) {
- ret = ocfs2_jbd2_file_inode(handle, inode);
- if (ret < 0)
- mlog_errno(ret);
- }
-out:
- if (ret) {
- if (!IS_ERR(handle))
- ocfs2_commit_trans(osb, handle);
- handle = ERR_PTR(ret);
- }
- return handle;
-}
-
static sector_t ocfs2_bmap(struct address_space *mapping, sector_t block)
{
sector_t status;
diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 6a13ea6..a6e0eb6 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -724,28 +724,55 @@ leave:
return status;
}
+/*
+ * While a write will already be ordering the data, a truncate will not.
+ * Thus, we need to explicitly order the zeroed pages.
+ */
+static handle_t *ocfs2_zero_start_ordered_transaction(struct inode *inode)
+{
+ struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
+ handle_t *handle = NULL;
+ int ret = 0;
+
+ if (ocfs2_should_order_data(inode))
+ goto out;
+
+ handle = ocfs2_start_trans(osb, OCFS2_INODE_UPDATE_CREDITS);
+ if (IS_ERR(handle)) {
+ ret = -ENOMEM;
+ mlog_errno(ret);
+ goto out;
+ }
+
+ ret = ocfs2_jbd2_file_inode(handle, inode);
+ if (ret < 0)
+ mlog_errno(ret);
+
+out:
+ if (ret) {
+ if (!IS_ERR(handle))
+ ocfs2_commit_trans(osb, handle);
+ handle = ERR_PTR(ret);
+ }
+ return handle;
+}
+
/* Some parts of this taken from generic_cont_expand, which turned out
* to be too fragile to do exactly what we need without us having to
* worry about recursive locking in ->write_begin() and ->write_end(). */
-static int ocfs2_write_zero_page(struct inode *inode,
- u64 size)
+static int ocfs2_write_zero_page(struct inode *inode, u64 abs_from,
+ u64 abs_to)
{
struct address_space *mapping = inode->i_mapping;
struct page *page;
- unsigned long index;
- unsigned int offset;
+ unsigned long index = abs_from >> PAGE_CACHE_SHIFT;
handle_t *handle = NULL;
int ret;
+ unsigned zero_from, zero_to, block_start, block_end;
- offset = (size & (PAGE_CACHE_SIZE-1)); /* Within page */
- /* ugh. in prepare/commit_write, if from==to==start of block, we
- ** skip the prepare. make sure we never send an offset for the start
- ** of a block
- */
- if ((offset & (inode->i_sb->s_blocksize - 1)) == 0) {
- offset++;
- }
- index = size >> PAGE_CACHE_SHIFT;
+ BUG_ON(abs_from >= abs_to);
+ BUG_ON(abs_to > ((index + 1) << PAGE_CACHE_SHIFT));
+ BUG_ON(abs_from & (inode->i_blkbits - 1));
page = grab_cache_page(mapping, index);
if (!page) {
@@ -754,31 +781,52 @@ static int ocfs2_write_zero_page(struct inode *inode,
goto out;
}
- ret = ocfs2_prepare_write_nolock(inode, page, offset, offset);
- if (ret < 0) {
- mlog_errno(ret);
- goto out_unlock;
- }
+ /* Get the offsets within the page that we want to zero */
+ zero_from = abs_from & (PAGE_CACHE_SIZE - 1);
+ zero_to = abs_to & (PAGE_CACHE_SIZE - 1);
+ if (!zero_to)
+ zero_to = PAGE_CACHE_SIZE;
- if (ocfs2_should_order_data(inode)) {
- handle = ocfs2_start_walk_page_trans(inode, page, offset,
- offset);
- if (IS_ERR(handle)) {
- ret = PTR_ERR(handle);
- handle = NULL;
+ /* We know that zero_from is block aligned */
+ for (block_start = zero_from;
+ (block_start < PAGE_CACHE_SIZE) && (block_start < zero_to);
+ block_start = block_end) {
+ block_end = block_start + (1 << inode->i_blkbits);
+
+ /*
+ * block_start is block-aligned. Bump it by one to
+ * force ocfs2_{prepare,commit}_write() to zero the
+ * whole block.
+ */
+ ret = ocfs2_prepare_write_nolock(inode, page,
+ block_start + 1,
+ block_start + 1);
+ if (ret < 0) {
+ mlog_errno(ret);
goto out_unlock;
}
- }
- /* must not update i_size! */
- ret = block_commit_write(page, offset, offset);
- if (ret < 0)
- mlog_errno(ret);
- else
- ret = 0;
+ if (!handle) {
+ handle = ocfs2_zero_start_ordered_transaction(inode);
+ if (IS_ERR(handle)) {
+ ret = PTR_ERR(handle);
+ handle = NULL;
+ break;
+ }
+ }
+
+ /* must not update i_size! */
+ ret = block_commit_write(page, block_start + 1,
+ block_start + 1);
+ if (ret < 0)
+ mlog_errno(ret);
+ else
+ ret = 0;
+ }
if (handle)
ocfs2_commit_trans(OCFS2_SB(inode->i_sb), handle);
+
out_unlock:
unlock_page(page);
page_cache_release(page);
@@ -790,18 +838,21 @@ static int ocfs2_zero_extend(struct inode *inode,
u64 zero_to_size)
{
int ret = 0;
- u64 start_off;
+ u64 start_off, next_off;
struct super_block *sb = inode->i_sb;
start_off = ocfs2_align_bytes_to_blocks(sb, i_size_read(inode));
while (start_off < zero_to_size) {
- ret = ocfs2_write_zero_page(inode, start_off);
+ next_off = (start_off & PAGE_CACHE_MASK) + PAGE_CACHE_SIZE;
+ if (next_off > zero_to_size)
+ next_off = zero_to_size;
+ ret = ocfs2_write_zero_page(inode, start_off, next_off);
if (ret < 0) {
mlog_errno(ret);
goto out;
}
- start_off += sb->s_blocksize;
+ start_off = next_off;
/*
* Very large extends have the potential to lock up
--
1.7.1
This is version 3 of the ocfs2 tail zeroing fixes. This version
has some major changes. Tao correctly pointed out that we can have
multiple extents past i_size due to unwritten extents. I've reworked
the zeroing code to walk them all. Since I had to do that, and I had to
handle refcounted extents, I end up fixing a refcount bug with
non-sparse extentds.
There are now three patches. The first changes our zeroing
code to go page-by-page at the high level. The second actually changes
the zeroing code. The final patch, limiting zeroing to the end of a
write, is unchanged from v2.
Joel
ocfs2's allocation unit is the cluster. This can be larger than a block
or even a memory page. This means that a file may have many blocks in
its last extent that are beyond the block containing i_size. There also
may be more unwritten extents after that.
When ocfs2 grows a file, it zeros the entire cluster in order to ensure
future i_size growth will see cleared blocks. Unfortunately,
block_write_full_page() drops the pages past i_size. This means that
ocfs2 is actually leaking garbage data into the tail end of that last
cluster. This is a bug.
We adjust ocfs2_write_begin_nolock() and ocfs2_extend_file() to detect
when a write or truncate is past i_size. They will use
ocfs2_zero_extend() to ensure the data is properly zeroed.
Older versions of ocfs2_zero_extend() simply zeroed every block between
i_size and the zeroing position. This presumes three things:
1) There is allocation for all of these blocks.
2) The extents are not unwritten.
3) The extents are not refcounted.
(1) and (2) hold true for non-sparse filesystems, which used to be the
only users of ocfs2_zero_extend(). (3) is another bug.
Since we're now using ocfs2_zero_extend() for sparse filesystems as
well, we teach ocfs2_zero_extend() to check every extent between
i_size and the zeroing position. If the extent is unwritten, it is
ignored. If it is refcounted, it is CoWed. Then it is zeroed.
Signed-off-by: Joel Becker <[email protected]>
---
fs/ocfs2/aops.c | 30 ++++----
fs/ocfs2/file.c | 198 ++++++++++++++++++++++++++++++++++++++--------
fs/ocfs2/file.h | 6 +-
fs/ocfs2/quota_global.c | 2 +-
fs/ocfs2/quota_local.c | 4 +-
fs/ocfs2/refcounttree.c | 6 ++
6 files changed, 192 insertions(+), 54 deletions(-)
diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
index 9a5c931..8d6dc3f 100644
--- a/fs/ocfs2/aops.c
+++ b/fs/ocfs2/aops.c
@@ -196,15 +196,14 @@ int ocfs2_get_block(struct inode *inode, sector_t iblock,
dump_stack();
goto bail;
}
-
- past_eof = ocfs2_blocks_for_bytes(inode->i_sb, i_size_read(inode));
- mlog(0, "Inode %lu, past_eof = %llu\n", inode->i_ino,
- (unsigned long long)past_eof);
-
- if (create && (iblock >= past_eof))
- set_buffer_new(bh_result);
}
+ past_eof = ocfs2_blocks_for_bytes(inode->i_sb, i_size_read(inode));
+ mlog(0, "Inode %lu, past_eof = %llu\n", inode->i_ino,
+ (unsigned long long)past_eof);
+ if (create && (iblock >= past_eof))
+ set_buffer_new(bh_result);
+
bail:
if (err < 0)
err = -EIO;
@@ -1590,21 +1589,20 @@ out:
* write path can treat it as an non-allocating write, which has no
* special case code for sparse/nonsparse files.
*/
-static int ocfs2_expand_nonsparse_inode(struct inode *inode, loff_t pos,
- unsigned len,
+static int ocfs2_expand_nonsparse_inode(struct inode *inode,
+ struct buffer_head *di_bh,
+ loff_t pos, unsigned len,
struct ocfs2_write_ctxt *wc)
{
int ret;
- struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
loff_t newsize = pos + len;
- if (ocfs2_sparse_alloc(osb))
- return 0;
+ BUG_ON(ocfs2_sparse_alloc(OCFS2_SB(inode->i_sb)));
if (newsize <= i_size_read(inode))
return 0;
- ret = ocfs2_extend_no_holes(inode, newsize, pos);
+ ret = ocfs2_extend_no_holes(inode, di_bh, newsize, pos);
if (ret)
mlog_errno(ret);
@@ -1649,7 +1647,11 @@ int ocfs2_write_begin_nolock(struct address_space *mapping,
}
}
- ret = ocfs2_expand_nonsparse_inode(inode, pos, len, wc);
+ if (ocfs2_sparse_alloc(osb))
+ ret = ocfs2_zero_extend(inode, di_bh, pos);
+ else
+ ret = ocfs2_expand_nonsparse_inode(inode, di_bh, pos, len,
+ wc);
if (ret) {
mlog_errno(ret);
goto out;
diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index a6e0eb6..1fdc45a 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -787,6 +787,11 @@ static int ocfs2_write_zero_page(struct inode *inode, u64 abs_from,
if (!zero_to)
zero_to = PAGE_CACHE_SIZE;
+ mlog(0,
+ "abs_from = %llu, abs_to = %llu, index = %lu, zero_from = %u, zero_to = %u\n",
+ (unsigned long long)abs_from, (unsigned long long)abs_to,
+ index, zero_from, zero_to);
+
/* We know that zero_from is block aligned */
for (block_start = zero_from;
(block_start < PAGE_CACHE_SIZE) && (block_start < zero_to);
@@ -834,25 +839,114 @@ out:
return ret;
}
-static int ocfs2_zero_extend(struct inode *inode,
- u64 zero_to_size)
+/*
+ * Find the next range to zero. We do this in terms of bytes because
+ * that's what ocfs2_zero_extend() wants, and it is dealing with the
+ * pagecache. We may return multiple extents.
+ *
+ * zero_start and zero_end are ocfs2_zero_extend()s current idea of what
+ * needs to be zeroed. range_start and range_end return the next zeroing
+ * range. A subsequent call should pass the previous range_end as its
+ * zero_start. If range_end is 0, there's nothing to do.
+ *
+ * Unwritten extents are skipped over. Refcounted extents are CoWd.
+ */
+static int ocfs2_zero_extend_get_range(struct inode *inode,
+ struct buffer_head *di_bh,
+ u64 zero_start, u64 zero_end,
+ u64 *range_start, u64 *range_end)
{
- int ret = 0;
- u64 start_off, next_off;
- struct super_block *sb = inode->i_sb;
+ int rc = 0, needs_cow = 0;
+ u32 p_cpos, zero_clusters = 0;
+ u32 zero_cpos =
+ zero_start >> OCFS2_SB(inode->i_sb)->s_clustersize_bits;
+ u32 last_cpos = ocfs2_clusters_for_bytes(inode->i_sb, zero_end);
+ unsigned int num_clusters = 0;
+ unsigned int ext_flags = 0;
- start_off = ocfs2_align_bytes_to_blocks(sb, i_size_read(inode));
- while (start_off < zero_to_size) {
- next_off = (start_off & PAGE_CACHE_MASK) + PAGE_CACHE_SIZE;
- if (next_off > zero_to_size)
- next_off = zero_to_size;
- ret = ocfs2_write_zero_page(inode, start_off, next_off);
- if (ret < 0) {
- mlog_errno(ret);
+ while (zero_cpos < last_cpos) {
+ rc = ocfs2_get_clusters(inode, zero_cpos, &p_cpos,
+ &num_clusters, &ext_flags);
+ if (rc) {
+ mlog_errno(rc);
+ goto out;
+ }
+
+ if (p_cpos && !(ext_flags & OCFS2_EXT_UNWRITTEN)) {
+ zero_clusters = num_clusters;
+ if (ext_flags & OCFS2_EXT_REFCOUNTED)
+ needs_cow = 1;
+ break;
+ }
+
+ zero_cpos += num_clusters;
+ }
+ if (!zero_clusters) {
+ *range_end = 0;
+ goto out;
+ }
+
+ while ((zero_cpos + zero_clusters) < last_cpos) {
+ rc = ocfs2_get_clusters(inode, zero_cpos + zero_clusters,
+ &p_cpos, &num_clusters,
+ &ext_flags);
+ if (rc) {
+ mlog_errno(rc);
+ goto out;
+ }
+
+ if (!p_cpos || (ext_flags & OCFS2_EXT_UNWRITTEN))
+ break;
+ if (ext_flags & OCFS2_EXT_REFCOUNTED)
+ needs_cow = 1;
+ zero_clusters += num_clusters;
+ }
+ if ((zero_cpos + zero_clusters) > last_cpos)
+ zero_clusters = last_cpos - zero_cpos;
+
+ if (needs_cow) {
+ rc = ocfs2_refcount_cow(inode, di_bh, zero_cpos, zero_clusters,
+ UINT_MAX);
+ if (rc) {
+ mlog_errno(rc);
goto out;
}
+ }
- start_off = next_off;
+ *range_start = ocfs2_clusters_to_bytes(inode->i_sb, zero_cpos);
+ *range_end = ocfs2_clusters_to_bytes(inode->i_sb,
+ zero_cpos + zero_clusters);
+
+out:
+ return rc;
+}
+
+/*
+ * Zero one range returned from ocfs2_zero_extend_get_range(). The caller
+ * has made sure that the entire range needs zeroing.
+ */
+static int ocfs2_zero_extend_range(struct inode *inode, u64 range_start,
+ u64 range_end)
+{
+ int rc = 0;
+ u64 next_pos;
+ u64 zero_pos = range_start;
+
+ mlog(0, "range_start = %llu, range_end = %llu\n",
+ (unsigned long long)range_start,
+ (unsigned long long)range_end);
+ BUG_ON(range_start >= range_end);
+
+ while (zero_pos < range_end) {
+ next_pos = (zero_pos & PAGE_CACHE_MASK) + PAGE_CACHE_SIZE;
+ if (next_pos > range_end)
+ next_pos = range_end;
+ rc = ocfs2_write_zero_page(inode, zero_pos, next_pos);
+ if (rc < 0) {
+ mlog_errno(rc);
+ break;
+ }
+ zero_pos = next_pos;
/*
* Very large extends have the potential to lock up
@@ -861,16 +955,60 @@ static int ocfs2_zero_extend(struct inode *inode,
cond_resched();
}
-out:
+ return rc;
+}
+
+int ocfs2_zero_extend(struct inode *inode, struct buffer_head *di_bh,
+ loff_t zero_to_size)
+{
+ int ret = 0;
+ u64 zero_start, range_start = 0, range_end = 0;
+ struct super_block *sb = inode->i_sb;
+
+ zero_start = ocfs2_align_bytes_to_blocks(sb, i_size_read(inode));
+ while (zero_start < zero_to_size) {
+ ret = ocfs2_zero_extend_get_range(inode, di_bh, zero_start,
+ zero_to_size,
+ &range_start,
+ &range_end);
+ if (ret) {
+ mlog_errno(ret);
+ break;
+ }
+ if (!range_end)
+ break;
+ /* Trim the ends */
+ if (range_start < zero_start)
+ range_start = zero_start;
+ if (range_end > zero_to_size)
+ range_end = zero_to_size;
+
+ ret = ocfs2_zero_extend_range(inode, range_start,
+ range_end);
+ if (ret) {
+ mlog_errno(ret);
+ break;
+ }
+ zero_start = range_end;
+ }
+
return ret;
}
-int ocfs2_extend_no_holes(struct inode *inode, u64 new_i_size, u64 zero_to)
+int ocfs2_extend_no_holes(struct inode *inode, struct buffer_head *di_bh,
+ u64 new_i_size, u64 zero_to)
{
int ret;
u32 clusters_to_add;
struct ocfs2_inode_info *oi = OCFS2_I(inode);
+ /*
+ * Only quota files call this without a bh, and they can't be
+ * refcounted.
+ */
+ BUG_ON(!di_bh && (oi->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL));
+ BUG_ON(!di_bh && !(oi->ip_flags & OCFS2_INODE_SYSTEM_FILE));
+
clusters_to_add = ocfs2_clusters_for_bytes(inode->i_sb, new_i_size);
if (clusters_to_add < oi->ip_clusters)
clusters_to_add = 0;
@@ -891,7 +1029,7 @@ int ocfs2_extend_no_holes(struct inode *inode, u64 new_i_size, u64 zero_to)
* still need to zero the area between the old i_size and the
* new i_size.
*/
- ret = ocfs2_zero_extend(inode, zero_to);
+ ret = ocfs2_zero_extend(inode, di_bh, zero_to);
if (ret < 0)
mlog_errno(ret);
@@ -913,27 +1051,15 @@ static int ocfs2_extend_file(struct inode *inode,
goto out;
if (i_size_read(inode) == new_i_size)
- goto out;
+ goto out;
BUG_ON(new_i_size < i_size_read(inode));
/*
- * Fall through for converting inline data, even if the fs
- * supports sparse files.
- *
- * The check for inline data here is legal - nobody can add
- * the feature since we have i_mutex. We must check it again
- * after acquiring ip_alloc_sem though, as paths like mmap
- * might have raced us to converting the inode to extents.
- */
- if (!(oi->ip_dyn_features & OCFS2_INLINE_DATA_FL)
- && ocfs2_sparse_alloc(OCFS2_SB(inode->i_sb)))
- goto out_update_size;
-
- /*
* The alloc sem blocks people in read/write from reading our
* allocation until we're done changing it. We depend on
* i_mutex to block other extend/truncate calls while we're
- * here.
+ * here. We even have to hold it for sparse files because there
+ * might be some tail zeroing.
*/
down_write(&oi->ip_alloc_sem);
@@ -950,14 +1076,16 @@ static int ocfs2_extend_file(struct inode *inode,
ret = ocfs2_convert_inline_data_to_extents(inode, di_bh);
if (ret) {
up_write(&oi->ip_alloc_sem);
-
mlog_errno(ret);
goto out;
}
}
- if (!ocfs2_sparse_alloc(OCFS2_SB(inode->i_sb)))
- ret = ocfs2_extend_no_holes(inode, new_i_size, new_i_size);
+ if (ocfs2_sparse_alloc(OCFS2_SB(inode->i_sb)))
+ ret = ocfs2_zero_extend(inode, di_bh, new_i_size);
+ else
+ ret = ocfs2_extend_no_holes(inode, di_bh, new_i_size,
+ new_i_size);
up_write(&oi->ip_alloc_sem);
diff --git a/fs/ocfs2/file.h b/fs/ocfs2/file.h
index d66cf4f..97bf761 100644
--- a/fs/ocfs2/file.h
+++ b/fs/ocfs2/file.h
@@ -54,8 +54,10 @@ int ocfs2_add_inode_data(struct ocfs2_super *osb,
int ocfs2_simple_size_update(struct inode *inode,
struct buffer_head *di_bh,
u64 new_i_size);
-int ocfs2_extend_no_holes(struct inode *inode, u64 new_i_size,
- u64 zero_to);
+int ocfs2_extend_no_holes(struct inode *inode, struct buffer_head *di_bh,
+ u64 new_i_size, u64 zero_to);
+int ocfs2_zero_extend(struct inode *inode, struct buffer_head *di_bh,
+ loff_t zero_to);
int ocfs2_setattr(struct dentry *dentry, struct iattr *attr);
int ocfs2_getattr(struct vfsmount *mnt, struct dentry *dentry,
struct kstat *stat);
diff --git a/fs/ocfs2/quota_global.c b/fs/ocfs2/quota_global.c
index 2bb35fe..4607923 100644
--- a/fs/ocfs2/quota_global.c
+++ b/fs/ocfs2/quota_global.c
@@ -775,7 +775,7 @@ static int ocfs2_acquire_dquot(struct dquot *dquot)
* locking allocators ranks above a transaction start
*/
WARN_ON(journal_current_handle());
- status = ocfs2_extend_no_holes(gqinode,
+ status = ocfs2_extend_no_holes(gqinode, NULL,
gqinode->i_size + (need_alloc << sb->s_blocksize_bits),
gqinode->i_size);
if (status < 0)
diff --git a/fs/ocfs2/quota_local.c b/fs/ocfs2/quota_local.c
index 8bd70d4..dc78764 100644
--- a/fs/ocfs2/quota_local.c
+++ b/fs/ocfs2/quota_local.c
@@ -971,7 +971,7 @@ static struct ocfs2_quota_chunk *ocfs2_local_quota_add_chunk(
u64 p_blkno;
/* We are protected by dqio_sem so no locking needed */
- status = ocfs2_extend_no_holes(lqinode,
+ status = ocfs2_extend_no_holes(lqinode, NULL,
lqinode->i_size + 2 * sb->s_blocksize,
lqinode->i_size);
if (status < 0) {
@@ -1114,7 +1114,7 @@ static struct ocfs2_quota_chunk *ocfs2_extend_local_quota_file(
return ocfs2_local_quota_add_chunk(sb, type, offset);
/* We are protected by dqio_sem so no locking needed */
- status = ocfs2_extend_no_holes(lqinode,
+ status = ocfs2_extend_no_holes(lqinode, NULL,
lqinode->i_size + sb->s_blocksize,
lqinode->i_size);
if (status < 0) {
diff --git a/fs/ocfs2/refcounttree.c b/fs/ocfs2/refcounttree.c
index 4793f36..32949df 100644
--- a/fs/ocfs2/refcounttree.c
+++ b/fs/ocfs2/refcounttree.c
@@ -4166,6 +4166,12 @@ static int __ocfs2_reflink(struct dentry *old_dentry,
struct inode *inode = old_dentry->d_inode;
struct buffer_head *new_bh = NULL;
+ if (OCFS2_I(inode)->ip_flags & OCFS2_INODE_SYSTEM_FILE) {
+ ret = -EINVAL;
+ mlog_errno(ret);
+ goto out;
+ }
+
ret = filemap_fdatawrite(inode->i_mapping);
if (ret) {
mlog_errno(ret);
--
1.7.1
Hi Joel,
Joel Becker wrote:
> ocfs2_zero_extend() does its zeroing block by block, but it calls a
> function named ocfs2_write_zero_page(). Let's have
> ocfs2_write_zero_page() handle the page level. From
> ocfs2_zero_extend()'s perspective, it is now page-at-a-time.
>
> Signed-off-by: Joel Becker <[email protected]>
> ---
> fs/ocfs2/aops.c | 30 --------------
> fs/ocfs2/file.c | 119 +++++++++++++++++++++++++++++++++++++++----------------
> 2 files changed, 85 insertions(+), 64 deletions(-)
>
> diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
> index 3623ca2..9a5c931 100644
> --- a/fs/ocfs2/aops.c
> +++ b/fs/ocfs2/aops.c
> @@ -459,36 +459,6 @@ int walk_page_buffers( handle_t *handle,
> return ret;
> }
>
> -handle_t *ocfs2_start_walk_page_trans(struct inode *inode,
> - struct page *page,
> - unsigned from,
> - unsigned to)
> -{
> - struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
> - handle_t *handle;
> - int ret = 0;
> -
> - handle = ocfs2_start_trans(osb, OCFS2_INODE_UPDATE_CREDITS);
> - if (IS_ERR(handle)) {
> - ret = -ENOMEM;
> - mlog_errno(ret);
> - goto out;
> - }
> -
> - if (ocfs2_should_order_data(inode)) {
> - ret = ocfs2_jbd2_file_inode(handle, inode);
> - if (ret < 0)
> - mlog_errno(ret);
> - }
> -out:
> - if (ret) {
> - if (!IS_ERR(handle))
> - ocfs2_commit_trans(osb, handle);
> - handle = ERR_PTR(ret);
> - }
> - return handle;
> -}
> -
> static sector_t ocfs2_bmap(struct address_space *mapping, sector_t block)
> {
> sector_t status;
> diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
> index 6a13ea6..a6e0eb6 100644
> --- a/fs/ocfs2/file.c
> +++ b/fs/ocfs2/file.c
> @@ -724,28 +724,55 @@ leave:
> return status;
> }
>
> +/*
> + * While a write will already be ordering the data, a truncate will not.
> + * Thus, we need to explicitly order the zeroed pages.
> + */
> +static handle_t *ocfs2_zero_start_ordered_transaction(struct inode *inode)
> +{
> + struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
> + handle_t *handle = NULL;
> + int ret = 0;
> +
> + if (ocfs2_should_order_data(inode))
>
This should be if (!ocfs2_should_order_data(inode)) I guess? ;)
> + goto out;
> +
> + handle = ocfs2_start_trans(osb, OCFS2_INODE_UPDATE_CREDITS);
> + if (IS_ERR(handle)) {
> + ret = -ENOMEM;
> + mlog_errno(ret);
> + goto out;
> + }
> +
>
Regards,
Tao
On Wed, Jul 07, 2010 at 11:19:27PM +0800, Tao Ma wrote:
> >+static handle_t *ocfs2_zero_start_ordered_transaction(struct inode *inode)
> >+{
> >+ struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
> >+ handle_t *handle = NULL;
> >+ int ret = 0;
> >+
> >+ if (ocfs2_should_order_data(inode))
> This should be if (!ocfs2_should_order_data(inode)) I guess? ;)
Of course it should. Fixed ;-)
Joel
--
"Too much walking shoes worn thin.
Too much trippin' and my soul's worn thin.
Time to catch a ride it leaves today
Her name is what it means.
Too much walking shoes worn thin."
Joel Becker
Consulting Software Developer
Oracle
E-mail: [email protected]
Phone: (650) 506-8127
Hi Joel,
On 07/07/2010 07:16 PM, Joel Becker wrote:
> ocfs2_zero_extend() does its zeroing block by block, but it calls a
> function named ocfs2_write_zero_page(). Let's have
> ocfs2_write_zero_page() handle the page level. From
> ocfs2_zero_extend()'s perspective, it is now page-at-a-time.
>
> Signed-off-by: Joel Becker<[email protected]>
> ---
> fs/ocfs2/aops.c | 30 --------------
> fs/ocfs2/file.c | 119 +++++++++++++++++++++++++++++++++++++++----------------
> 2 files changed, 85 insertions(+), 64 deletions(-)
>
<snip>
> -static int ocfs2_write_zero_page(struct inode *inode,
> - u64 size)
> +static int ocfs2_write_zero_page(struct inode *inode, u64 abs_from,
> + u64 abs_to)
> {
> struct address_space *mapping = inode->i_mapping;
> struct page *page;
> - unsigned long index;
> - unsigned int offset;
> + unsigned long index = abs_from>> PAGE_CACHE_SHIFT;
> handle_t *handle = NULL;
> int ret;
> + unsigned zero_from, zero_to, block_start, block_end;
>
> - offset = (size& (PAGE_CACHE_SIZE-1)); /* Within page */
> - /* ugh. in prepare/commit_write, if from==to==start of block, we
> - ** skip the prepare. make sure we never send an offset for the start
> - ** of a block
> - */
> - if ((offset& (inode->i_sb->s_blocksize - 1)) == 0) {
> - offset++;
> - }
> - index = size>> PAGE_CACHE_SHIFT;
> + BUG_ON(abs_from>= abs_to);
> + BUG_ON(abs_to> ((index + 1)<< PAGE_CACHE_SHIFT));
Sorry for not noticing this yesterday night. This can't work and will
overflow and bug out. I met with a similar bug in reflink test. See
commit d622b89.
> + BUG_ON(abs_from& (inode->i_blkbits - 1));
>
> page = grab_cache_page(mapping, index);
> if (!page) {
> @@ -754,31 +781,52 @@ static int ocfs2_write_zero_page(struct inode *inode,
> goto out;
> }
>
> - ret = ocfs2_prepare_write_nolock(inode, page, offset, offset);
> - if (ret< 0) {
> - mlog_errno(ret);
> - goto out_unlock;
> - }
> + /* Get the offsets within the page that we want to zero */
> + zero_from = abs_from& (PAGE_CACHE_SIZE - 1);
> + zero_to = abs_to& (PAGE_CACHE_SIZE - 1);
> + if (!zero_to)
> + zero_to = PAGE_CACHE_SIZE;
>
> - if (ocfs2_should_order_data(inode)) {
> - handle = ocfs2_start_walk_page_trans(inode, page, offset,
> - offset);
> - if (IS_ERR(handle)) {
> - ret = PTR_ERR(handle);
> - handle = NULL;
> + /* We know that zero_from is block aligned */
> + for (block_start = zero_from;
> + (block_start< PAGE_CACHE_SIZE)&& (block_start< zero_to);
> + block_start = block_end) {
Do we really need to check block_start < PAGE_CACHE_SIZE? I think just
check block_start < zero_to is enough since you have limit zero_to with
PAGE_CACHE_SIZE. What's more, it looks more natural(see below), does it?
for (block_start = zero_form; block_start < zero_to; block_start =
block_end) {
Regards,
Tao
On Thu, Jul 08, 2010 at 11:44:59AM +0800, Tao Ma wrote:
> On 07/07/2010 07:16 PM, Joel Becker wrote:
> >+ BUG_ON(abs_to> ((index + 1)<< PAGE_CACHE_SHIFT));
> Sorry for not noticing this yesterday night. This can't work and
> will overflow and bug out. I met with a similar bug in reflink test.
> See commit d622b89.
Good catch. It's obvious, now that you mention it.
> >+ /* We know that zero_from is block aligned */
> >+ for (block_start = zero_from;
> >+ (block_start< PAGE_CACHE_SIZE)&& (block_start< zero_to);
> >+ block_start = block_end) {
> Do we really need to check block_start < PAGE_CACHE_SIZE? I think
> just check block_start < zero_to is enough since you have limit
> zero_to with PAGE_CACHE_SIZE. What's more, it looks more natural(see
> below), does it?
>
> for (block_start = zero_form; block_start < zero_to; block_start =
> block_end) {
Yup. The code looked different halfway through, so I didn't
realize I was checking the same thing twice.
Joel
--
"Depend on the rabbit's foot if you will, but remember, it didn't
help the rabbit."
- R. E. Shay
Joel Becker
Consulting Software Developer
Oracle
E-mail: [email protected]
Phone: (650) 506-8127
On Wed, Jul 07, 2010 at 04:16:04AM -0700, Joel Becker wrote:
> This is version 3 of the ocfs2 tail zeroing fixes. This version
> has some major changes. Tao correctly pointed out that we can have
> multiple extents past i_size due to unwritten extents. I've reworked
> the zeroing code to walk them all. Since I had to do that, and I had to
> handle refcounted extents, I end up fixing a refcount bug with
> non-sparse extentds.
> There are now three patches. The first changes our zeroing
> code to go page-by-page at the high level. The second actually changes
> the zeroing code. The final patch, limiting zeroing to the end of a
> write, is unchanged from v2.
Version 4 of these fixes are now in the fixes and linux-next
branches of ocfs2.git. I'm just appending the diff to this file rather
than resending all the patches. They've passed the first round of heavy
testing from our testers, and we're going to keep pounding on them as we
get towards 2.6.35.
Linus, I'm going to let all of the ocfs2 fixes stew in
linux-next for a few days before I send the pull request. Figure it by
the end of the week.
Joel
diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
index 9b3381a..356e976 100644
--- a/fs/ocfs2/aops.c
+++ b/fs/ocfs2/aops.c
@@ -1107,6 +1107,7 @@ static int ocfs2_grab_pages_for_write(struct address_space *mapping,
int ret = 0, i;
unsigned long start, target_index, end_index, index;
struct inode *inode = mapping->host;
+ loff_t last_byte;
target_index = user_pos >> PAGE_CACHE_SHIFT;
@@ -1120,8 +1121,14 @@ static int ocfs2_grab_pages_for_write(struct address_space *mapping,
if (new) {
wc->w_num_pages = ocfs2_pages_per_cluster(inode->i_sb);
start = ocfs2_align_clusters_to_page_index(inode->i_sb, cpos);
- /* This is the index *past* the write */
- end_index = ((user_pos + user_len - 1) >> PAGE_CACHE_SHIFT) + 1;
+ /*
+ * We need the index *past* the last page we could possibly
+ * touch. This is the page past the end of the write or
+ * i_size, whichever is greater.
+ */
+ last_byte = max(user_pos + user_len, i_size_read(inode));
+ BUG_ON(last_byte < 1);
+ end_index = ((last_byte - 1) >> PAGE_CACHE_SHIFT) + 1;
if ((start + wc->w_num_pages) > end_index)
wc->w_num_pages = end_index - start;
} else {
@@ -1619,6 +1626,18 @@ static int ocfs2_expand_nonsparse_inode(struct inode *inode,
return ret;
}
+static int ocfs2_zero_tail(struct inode *inode, struct buffer_head *di_bh,
+ loff_t pos)
+{
+ int ret = 0;
+
+ BUG_ON(!ocfs2_sparse_alloc(OCFS2_SB(inode->i_sb)));
+ if (pos > i_size_read(inode))
+ ret = ocfs2_zero_extend(inode, di_bh, pos);
+
+ return ret;
+}
+
int ocfs2_write_begin_nolock(struct address_space *mapping,
loff_t pos, unsigned len, unsigned flags,
struct page **pagep, void **fsdata,
@@ -1655,7 +1674,7 @@ int ocfs2_write_begin_nolock(struct address_space *mapping,
}
if (ocfs2_sparse_alloc(osb))
- ret = ocfs2_zero_extend(inode, di_bh, pos);
+ ret = ocfs2_zero_tail(inode, di_bh, pos);
else
ret = ocfs2_expand_nonsparse_inode(inode, di_bh, pos, len,
wc);
diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 1fdc45a..ac15911 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -734,7 +734,7 @@ static handle_t *ocfs2_zero_start_ordered_transaction(struct inode *inode)
handle_t *handle = NULL;
int ret = 0;
- if (ocfs2_should_order_data(inode))
+ if (!ocfs2_should_order_data(inode))
goto out;
handle = ocfs2_start_trans(osb, OCFS2_INODE_UPDATE_CREDITS);
@@ -771,7 +771,7 @@ static int ocfs2_write_zero_page(struct inode *inode, u64 abs_from,
unsigned zero_from, zero_to, block_start, block_end;
BUG_ON(abs_from >= abs_to);
- BUG_ON(abs_to > ((index + 1) << PAGE_CACHE_SHIFT));
+ BUG_ON(abs_to > (((u64)index + 1) << PAGE_CACHE_SHIFT));
BUG_ON(abs_from & (inode->i_blkbits - 1));
page = grab_cache_page(mapping, index);
@@ -793,8 +793,7 @@ static int ocfs2_write_zero_page(struct inode *inode, u64 abs_from,
index, zero_from, zero_to);
/* We know that zero_from is block aligned */
- for (block_start = zero_from;
- (block_start < PAGE_CACHE_SIZE) && (block_start < zero_to);
+ for (block_start = zero_from; block_start < zero_to;
block_start = block_end) {
block_end = block_start + (1 << inode->i_blkbits);
@@ -966,6 +965,9 @@ int ocfs2_zero_extend(struct inode *inode, struct buffer_head *di_bh,
struct super_block *sb = inode->i_sb;
zero_start = ocfs2_align_bytes_to_blocks(sb, i_size_read(inode));
+ mlog(0, "zero_start %llu for i_size %llu\n",
+ (unsigned long long)zero_start,
+ (unsigned long long)i_size_read(inode));
while (zero_start < zero_to_size) {
ret = ocfs2_zero_extend_get_range(inode, di_bh, zero_start,
zero_to_size,
--
"The only way to get rid of a temptation is to yield to it."
- Oscar Wilde
Joel Becker
Consulting Software Developer
Oracle
E-mail: [email protected]
Phone: (650) 506-8127