2008-07-18 12:11:38

by Aneesh Kumar K.V

[permalink] [raw]
Subject: e2fsprogs and blocks outside i_size

Hi Ted,

With fallocate FALLOC_FL_KEEP_SIZE option, when we write to prealloc
space and if we hit ENOSPC when trying to insert the extent,
we actually zero out the extent. That means we can have blocks
outside i_size for an inode. I guess e2fsck currently doesn't
handle this. Or should we fix kernel to update i_size to the
right value if we do a zero out of the extent ?

With fallocate if the prealloc area is small we also aggressively zeroout.
This was needed so that a random write pattern on falloc area doesn't
results in too many extents. That also can result in the above
error on fsck.

-aneesh


2008-07-18 12:37:08

by Theodore Ts'o

[permalink] [raw]
Subject: Re: e2fsprogs and blocks outside i_size

On Fri, Jul 18, 2008 at 05:41:30PM +0530, Aneesh Kumar K.V wrote:
> Hi Ted,
>
> With fallocate FALLOC_FL_KEEP_SIZE option, when we write to prealloc
> space and if we hit ENOSPC when trying to insert the extent,
> we actually zero out the extent. That means we can have blocks
> outside i_size for an inode. I guess e2fsck currently doesn't
> handle this. Or should we fix kernel to update i_size to the
> right value if we do a zero out of the extent ?
>
> With fallocate if the prealloc area is small we also aggressively zeroout.
> This was needed so that a random write pattern on falloc area doesn't
> results in too many extents. That also can result in the above
> error on fsck.

It would seem to me that e2fsck should be fixed to not complain about
blocks outside of i_size, *if* the blocks in question are marked as
being unitialized. It also seems to me that updating i_size when the
extent is zero'ed out is also not the right thing to do. Some
applications depend on i_size.

In the case where you hit ENOSPC when you need to grow the tree to
insert an extent, that's a tough one. One approach would be to simply
return an error in that case, although that's unfortunate since in
addition to people using fallocate() to try to get better block
layout, some people want to preallocate blocks to guarantee that the
write will not fail! For people whose reason for using fallocate()
are in the second camp, we can satisfy their need simply by stealing a
block from the end of the preallocate segment and use it to grow the
extent tree. I'm inclined to think the second solution might be the
better one, but I'm sure that will be controversial.

For your second case, where we aggressively zero out blocks, one of
the reasons why we have to do that is because the kernel isn't
coalescing extents aggressively. My inclination here is to *not*
aggressively zero out blocks outside outside of i_size, and to split
the extent in that case --- and then to make sure our extent
management code is better about coalescing extents, so that when we
convert an extent from INIT to UNINIT, we also check the previous and
next extent, and if we can combine current extent with an adjacent
extent, we should do so.

I suppose the other hack we could do is have e2fsck check the blocks
that are outside of i_size, and if they are all zero and extents are
involved, that it's a case of pre-allocated blocks that needed to be
zero'ed for some reason, as opposed to a corrupted i_size. That seems
to be a really gross hack, though.

- Ted

2008-07-21 05:08:28

by Andreas Dilger

[permalink] [raw]
Subject: Re: e2fsprogs and blocks outside i_size

On Jul 18, 2008 08:37 -0400, Theodore Ts'o wrote:
> On Fri, Jul 18, 2008 at 05:41:30PM +0530, Aneesh Kumar K.V wrote:
> > With fallocate FALLOC_FL_KEEP_SIZE option, when we write to prealloc
> > space and if we hit ENOSPC when trying to insert the extent,
> > we actually zero out the extent. That means we can have blocks
> > outside i_size for an inode.

To clarify, doesn't FALLOC_FL_KEEP_SIZE put the extent beyond i_size,
regardless of whether the ENOSPC problem is hit?

> > I guess e2fsck currently doesn't handle this. Or should we fix kernel
> > to update i_size to the right value if we do a zero out of the extent ?
> >
> > With fallocate if the prealloc area is small we also aggressively zeroout.
> > This was needed so that a random write pattern on falloc area doesn't
> > results in too many extents. That also can result in the above
> > error on fsck.
>
> It would seem to me that e2fsck should be fixed to not complain about
> blocks outside of i_size, *if* the blocks in question are marked as
> being unitialized.

Yes, I think that is the right approach.

> It also seems to me that updating i_size when the
> extent is zero'ed out is also not the right thing to do. Some
> applications depend on i_size.

Yes, this was my thought exactly. Just because the fallocated extent
is zeroed out doesn't mean the file size can suddenly change. This
would appear as file corruption or "data loss" to many applications.

> In the case where you hit ENOSPC when you need to grow the tree to
> insert an extent, that's a tough one. One approach would be to simply

I think you misunderstand Aneesh's comments - the ext4 fallocate code
already handles all of these cases. If ENOSPC is hit during the
extent split the error recovery will zero out the whole uninitialized
extent. Slow, but effective.

> For your second case, where we aggressively zero out blocks, one of
> the reasons why we have to do that is because the kernel isn't
> coalescing extents aggressively. My inclination here is to *not*
> aggressively zero out blocks outside outside of i_size, and to split
> the extent in that case

That is only partly true. If an application is writing every other
block in a 64MB fallocated extent we don't want to create 64MB/4kB
separate extents. There is no guarantee that the application will
fill in the rest of the unwritten extents and allow them to merge.

Also, there is likely a net performance improvement by writing every
block (half with data, the other half with zeroes) compared to doing
write + seek over the entire file.

> I suppose the other hack we could do is have e2fsck check the blocks
> that are outside of i_size, and if they are all zero and extents are
> involved, that it's a case of pre-allocated blocks that needed to be
> zero'ed for some reason, as opposed to a corrupted i_size. That seems
> to be a really gross hack, though.

Yuck, with the added problem that there is no guarantee that these
data blocks ARE all zero.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


2008-07-21 05:59:26

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: e2fsprogs and blocks outside i_size

On Sun, Jul 20, 2008 at 11:08:25PM -0600, Andreas Dilger wrote:
> On Jul 18, 2008 08:37 -0400, Theodore Ts'o wrote:
> > On Fri, Jul 18, 2008 at 05:41:30PM +0530, Aneesh Kumar K.V wrote:
> > > With fallocate FALLOC_FL_KEEP_SIZE option, when we write to prealloc
> > > space and if we hit ENOSPC when trying to insert the extent,
> > > we actually zero out the extent. That means we can have blocks
> > > outside i_size for an inode.
>
> To clarify, doesn't FALLOC_FL_KEEP_SIZE put the extent beyond i_size,
> regardless of whether the ENOSPC problem is hit?

But the extent in that case would be marked as uninit using the extent
len. So e2fsck can check for that.


>
> > > I guess e2fsck currently doesn't handle this. Or should we fix kernel
> > > to update i_size to the right value if we do a zero out of the extent ?
> > >
> > > With fallocate if the prealloc area is small we also aggressively zeroout.
> > > This was needed so that a random write pattern on falloc area doesn't
> > > results in too many extents. That also can result in the above
> > > error on fsck.
> >
> > It would seem to me that e2fsck should be fixed to not complain about
> > blocks outside of i_size, *if* the blocks in question are marked as
> > being unitialized.
>
> Yes, I think that is the right approach.



That is fine for extents marked uninit. But when we zero out we zero out
the full extent. So that means a write of few bytes can result in blocks
being zeroed out outside i_size. My question was how e2fsck can handle
this. Because the extent will no more be marked as uninit and there
would be blocks outside i_size all carrying zero.


>
> > I suppose the other hack we could do is have e2fsck check the blocks
> > that are outside of i_size, and if they are all zero and extents are
> > involved, that it's a case of pre-allocated blocks that needed to be
> > zero'ed for some reason, as opposed to a corrupted i_size. That seems
> > to be a really gross hack, though.
>
> Yuck, with the added problem that there is no guarantee that these
> data blocks ARE all zero.
>

-aneesh

2008-07-21 12:34:07

by Theodore Ts'o

[permalink] [raw]
Subject: Re: e2fsprogs and blocks outside i_size

On Mon, Jul 21, 2008 at 11:29:18AM +0530, Aneesh Kumar K.V wrote:
>
> That is fine for extents marked uninit. But when we zero out we zero out
> the full extent. So that means a write of few bytes can result in blocks
> being zeroed out outside i_size. My question was how e2fsck can handle
> this. Because the extent will no more be marked as uninit and there
> would be blocks outside i_size all carrying zero.
>

Wel, as I said originally, we have four choices, only two of which are
tenable:

1) Don't change i_size and leave e2fsck confused about whether i_size
is confused or not; the next time e2fsck runs it can either fix it and
change i_size, confusing applications that depend on i_size, or not
fix it and in the case of a corrupted i_size, leave valid data
inaccessible or do the hack to which Andreas reacted, "Yuck", and
which Annesh quoted and I assume agree. (i.e., checking the data
blocks to see if they are non-zero, and electing to to risk confusing
the application in the case where they are non-zero). This is the
current case.

2) Change i_size and always confuse applications that depend on i_size
carrying some semantic meaning.

3) Don't aggressively zero-out (as it presents us with these two
untenable options) and try to explit the extent instead. If the block
application fails, return ENOSPC.

4) #3, except if the block allocation fails, try to steal a block that
had been previously preallocated for some other logical block in that
inode.


Are there any other choices? I think #3 and #4 are the only options
and #3 is certainly the simplest to implement, but it could lead to
confusing results since the filesystem would be returning ENOSPC even
though 'df' reports that space is available --- and some applications
preallocate in order to guarantee no write failures.

#4 is more complex, but it means that file might be more fragmented at
the end, which would be bad for applications that depend on
fallocate() to provide a more contiugous file. (Although fallocate
never guaranteed perfect layout, just that it might provide a better
one.) It also means that at the end, a file write might end up
failing anyway, since we ended up stealing a block that was meant for
use as a data bock.

The one other thing I would note is that at least for non-root users,
the reserved blocks will help save us most of the time, except for
when users explicitly set the reserved blocks down to zero. But maybe
this is one place where we just document that reserved blocks serve
yet another purpose, which is to get us out of this mess, and that
applications which depend on preallocated writes not failing need to
either (a) not use insane write patterns, or (b) not run as root and
to save a modest number of reserved blocks for this situation (or
otherwise leave some "slack space" in the filesystem.)

- Ted


2008-07-21 23:32:42

by Andreas Dilger

[permalink] [raw]
Subject: Re: e2fsprogs and blocks outside i_size

On Jul 21, 2008 08:34 -0400, Theodore Ts'o wrote:
> Wel, as I said originally, we have four choices, only two of which are
> tenable:
>
> 1) Don't change i_size and leave e2fsck confused about whether i_size
> is confused or not; the next time e2fsck runs it can either fix it and
> change i_size, confusing applications that depend on i_size, or not
> fix it and in the case of a corrupted i_size, leave valid data
> inaccessible or do the hack to which Andreas reacted, "Yuck", and
> which Annesh quoted and I assume agree. (i.e., checking the data
> blocks to see if they are non-zero, and electing to to risk confusing
> the application in the case where they are non-zero). This is the
> current case.
>
> 2) Change i_size and always confuse applications that depend on i_size
> carrying some semantic meaning.
>
> 3) Don't aggressively zero-out (as it presents us with these two
> untenable options) and try to explit the extent instead. If the block
> application fails, return ENOSPC.
>
> 4) #3, except if the block allocation fails, try to steal a block that
> had been previously preallocated for some other logical block in that
> inode.

5) Add a flag to the inode which means "blocks beyond i_size" if fallocate()
is called with "KEEP_SIZE" and allocation is actually beyond i_size
and not just filling a hole) so that e2fsck won't "fix" the size,
but allows the extent to be uninitialized. The flag is cleared
(by kernel and/or e2fsck) if the size is extended to the last block.

To avoid consuming our precious inode flags, we might consider to re-use
the EXT3_DIRSYNC_FL or EXT3_TOPDIR_FL for this purpose, since the are
definitely only having meaning for directories. I guess the question
is whether we would need this for directories, but I don't think so as
we could always just add empty directory blocks (at the expense of
having to scan them later).

> The one other thing I would note is that at least for non-root users,
> the reserved blocks will help save us most of the time, except for
> when users explicitly set the reserved blocks down to zero.

Would the index block be allocated from the reserved space tough?
This is also a good idea, but I'm not sure if that is what happens.
I guess the "allocate index block" code path needs to check for
"(uid == s_reserved_uid || is_metadata)"?

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.