LinuxLists.cc - Re: [PATCH 4/5] ext4: fallocate support in ext4

[permalink] [raw]

Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4

On May 03, 2007 21:31 -0700, Andrew Morton wrote:
> On Thu, 26 Apr 2007 23:43:32 +0530 "Amit K. Arora" <[email protected]> wrote:
> > + * ext4_fallocate:
> > + * preallocate space for a file
> > + * mode is for future use, e.g. for unallocating preallocated blocks etc.
> > + */
>
> This description is rather thin. What is the filesystem's actual behaviour
> here? If the file is using extents then the implementation will do
> <something>. If the file is using bitmaps then we will do <something else>.
>
> But what? Here is where it should be described.

My understanding is that glibc will handle zero-filling of files for
filesystems that do not support fallocate().

> > +int ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len)
> > +{
> > + handle_t *handle;
> > + ext4_fsblk_t block, max_blocks;
> > + int ret, ret2, nblocks = 0, retries = 0;
> > + struct buffer_head map_bh;
> > + unsigned int credits, blkbits = inode->i_blkbits;
> > +
> > + /* Currently supporting (pre)allocate mode _only_ */
> > + if (mode != FA_ALLOCATE)
> > + return -EOPNOTSUPP;
> > +
> > + if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL))
> > + return -ENOTTY;
>
> So we don't implement fallocate on bitmap-based files! Well that's huge
> news. The changelog would be an appropriate place to communicate this,
> along with reasons why, or a description of the plan to fix it.
>
> Also, posix says nothing about fallocate() returning ENOTTY.

I _think_ this is to convince glibc to do the zero-filling in userspace,
but I'm not up on the API specifics.

> > + block = offset >> blkbits;
> > + max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) >> blkbits)
> > + - block;
> > + mutex_lock(&EXT4_I(inode)->truncate_mutex);
> > + credits = ext4_ext_calc_credits_for_insert(inode, NULL);
> > + mutex_unlock(&EXT4_I(inode)->truncate_mutex);
>
> Now I'm mystified. Given that we're allocating an arbitrary amount of disk
> space, and that this disk space will require an arbitrary amount of
> metadata, how can we work out how much journal space we'll be needing
> without at least looking at `len'?

Good question.

The uninitialized extent can cover up to 128MB with a single entry.
If @path isn't specified, then ext4_ext_calc_credits_for_insert()
function returns the maximum number of extents needed to insert a leaf,
including splitting all of the index blocks. That would allow up to 43GB
(340 extents/block * 128MB) to be preallocated, but it still needs to take
the size of the preallocation into account (adding 3 blocks per 43GB - a
leaf block, a bitmap block and a group descriptor).

Also, since @path is not being given then truncate_mutex is not needed.

> > + ret = ext4_ext_get_blocks(handle, inode, block,
> > + max_blocks, &map_bh,
> > + EXT4_CREATE_UNINITIALIZED_EXT, 0);
> > + BUG_ON(!ret);
>
> BUG_ON is vicious. Is it really justified here? Possibly a WARN_ON and
> ext4_error() would be safer and more useful here.

Ouch, not very friendly error handling.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

2007-05-07 20:58:50

[permalink] [raw]

Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4

On Mon, 7 May 2007 05:37:54 -0600
Andreas Dilger <[email protected]> wrote:

> > > + block = offset >> blkbits;
> > > + max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) >> blkbits)
> > > + - block;
> > > + mutex_lock(&EXT4_I(inode)->truncate_mutex);
> > > + credits = ext4_ext_calc_credits_for_insert(inode, NULL);
> > > + mutex_unlock(&EXT4_I(inode)->truncate_mutex);
> >
> > Now I'm mystified. Given that we're allocating an arbitrary amount of disk
> > space, and that this disk space will require an arbitrary amount of
> > metadata, how can we work out how much journal space we'll be needing
> > without at least looking at `len'?
>
> Good question.
>
> The uninitialized extent can cover up to 128MB with a single entry.
> If @path isn't specified, then ext4_ext_calc_credits_for_insert()
> function returns the maximum number of extents needed to insert a leaf,
> including splitting all of the index blocks. That would allow up to 43GB
> (340 extents/block * 128MB) to be preallocated, but it still needs to take
> the size of the preallocation into account (adding 3 blocks per 43GB - a
> leaf block, a bitmap block and a group descriptor).

I think the use of ext4_journal_extend() (as Amit has proposed) will help
here, but it is not sufficient.

Because under some circumstances, a journal_extend() failure could mean
that we fail to allocate all the required disk space. If it is infrequent
enough, that is acceptable when the caller is using fallocate() for
performance reasons.

But it is very much not acceptable if the caller is using fallocate() for
space-reservation reasons. If you used fallocate to reserve 1GB of disk
and fallocate() "succeeded" and you later get ENOSPC then you'd have a
right to get a bit upset.

So I think the ext3/4 fallocate() implementation will need to be
implemented as a loop:

while (len) {
journal_start();
len -= do_fallocate(len, ...);
journal_stop();
}

Now the interesting question is: what do we do if we get halfway through
this loop and then run out of space? We could leave the disk all filled up
and then return failure to the caller, but that's pretty poor behaviour,
IMO.

Does the proposed implementation handle quotas correctly, btw? Has that
been tested?

Final point: it's fairly disappointing that the present implementation is
ext4-only, and extent-only. I do think we should be aiming at an ext4
bitmap-based implementation and an ext3 implementation.

2007-05-07 22:21:13

[permalink] [raw]

Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4

On May 07, 2007 13:58 -0700, Andrew Morton wrote:
> Final point: it's fairly disappointing that the present implementation is
> ext4-only, and extent-only. I do think we should be aiming at an ext4
> bitmap-based implementation and an ext3 implementation.

Actually, this is a non-issue. The reason that it is handled for extent-only
is that this is the only way to allocate space in the filesystem without
doing the explicit zeroing. For other filesystems (including ext3 and
ext4 with block-mapped files) the filesystem should return an error (e.g.
-EOPNOTSUPP) and glibc will do manual zero-filling of the file in userspace.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

2007-05-07 22:39:45

[permalink] [raw]

Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4

On Mon, 7 May 2007 15:21:04 -0700
Andreas Dilger <[email protected]> wrote:

> On May 07, 2007 13:58 -0700, Andrew Morton wrote:
> > Final point: it's fairly disappointing that the present implementation is
> > ext4-only, and extent-only. I do think we should be aiming at an ext4
> > bitmap-based implementation and an ext3 implementation.
>
> Actually, this is a non-issue. The reason that it is handled for extent-only
> is that this is the only way to allocate space in the filesystem without
> doing the explicit zeroing. For other filesystems (including ext3 and
> ext4 with block-mapped files) the filesystem should return an error (e.g.
> -EOPNOTSUPP) and glibc will do manual zero-filling of the file in userspace.

hrm, spose so.

It can be a bit suboptimal from the layout POV. The reservations code will
largely save us here, but kernel support might make it a bit better.

Totally blowing pagecache could be a problem. Fixable in userspace by
using sync_file_range()+fadvise() or O_DIRECT, but I bet it doesn't.

2007-05-07 23:02:43

by Jeff Garzik

[permalink] [raw]

Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4

Andreas Dilger wrote:
> On May 07, 2007 13:58 -0700, Andrew Morton wrote:
>> Final point: it's fairly disappointing that the present implementation is
>> ext4-only, and extent-only. I do think we should be aiming at an ext4
>> bitmap-based implementation and an ext3 implementation.
>
> Actually, this is a non-issue. The reason that it is handled for extent-only
> is that this is the only way to allocate space in the filesystem without
> doing the explicit zeroing. For other filesystems (including ext3 and

Precisely /how/ do you avoid the zeroing issue, for extents?

If I posix_fallocate() 20GB on ext4, it damn well better be zeroed,
otherwise the implementation is broken.

Jeff

2007-05-07 23:15:00

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4

On Mon, May 07, 2007 at 03:38:56PM -0700, Andrew Morton wrote:
> > Actually, this is a non-issue. The reason that it is handled for extent-only
> > is that this is the only way to allocate space in the filesystem without
> > doing the explicit zeroing. For other filesystems (including ext3 and
> > ext4 with block-mapped files) the filesystem should return an error (e.g.
> > -EOPNOTSUPP) and glibc will do manual zero-filling of the file in userspace.
>
> It can be a bit suboptimal from the layout POV. The reservations code will
> largely save us here, but kernel support might make it a bit better.

Actually, the reservations code won't matter, since glibc will fall
back to its current behavior, which is it will do the preallocation by
explicitly writing zeros to the file. This wlil result in the same
layout as if we had done the persistent preallocation, but of course
it will mean the posix_fallocate() could potentially take a long time
if you're a PVR and you're reserving a gig or two for a two hour movie
at high quality. That seems suboptimal, granted, and ideally the
application should be warned about this before it calls
posix_fallocate(). On the other hand, it's what happens today, all
the time, so applications won't be too badly surprised.

If we think applications programmers badly need to know in advance if
posix_fallocate() will be fast or slow, probably the right thing is to
define a new fpathconf() configuration option so they can query to see
whether a particular file will support a fast posix_fallocate(). I'm
not 100% convinced such complexity is really needed, but I'm willing
to be convinced.... what do folks think?

- Ted

2007-05-07 23:32:01

[permalink] [raw]

Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4

On Mon, 7 May 2007 19:14:42 -0400
Theodore Tso <[email protected]> wrote:

> On Mon, May 07, 2007 at 03:38:56PM -0700, Andrew Morton wrote:
> > > Actually, this is a non-issue. The reason that it is handled for extent-only
> > > is that this is the only way to allocate space in the filesystem without
> > > doing the explicit zeroing. For other filesystems (including ext3 and
> > > ext4 with block-mapped files) the filesystem should return an error (e.g.
> > > -EOPNOTSUPP) and glibc will do manual zero-filling of the file in userspace.
> >
> > It can be a bit suboptimal from the layout POV. The reservations code will
> > largely save us here, but kernel support might make it a bit better.
>
> Actually, the reservations code won't matter, since glibc will fall
> back to its current behavior, which is it will do the preallocation by
> explicitly writing zeros to the file.

No! Reservations code is *critical* here. Without reservations, we get
disastrously-bad layout if two processes were running a large fallocate()
at the same time. (This is an SMP-only problem, btw: on UP the timeslice
lengths save us).

My point is that even though reservations save us, we could do even-better
in-kernel.

But then, a smart application would bypass the glibc() fallocate()
implementation and would tune the reservation window size and would use
direct-IO or sync_file_range()+fadvise(FADV_DONTNEED).

> This wlil result in the same
> layout as if we had done the persistent preallocation, but of course
> it will mean the posix_fallocate() could potentially take a long time
> if you're a PVR and you're reserving a gig or two for a two hour movie
> at high quality. That seems suboptimal, granted, and ideally the
> application should be warned about this before it calls
> posix_fallocate(). On the other hand, it's what happens today, all
> the time, so applications won't be too badly surprised.

A PVR implementor would take all this over and would do it themselves, for
sure.

> If we think applications programmers badly need to know in advance if
> posix_fallocate() will be fast or slow, probably the right thing is to
> define a new fpathconf() configuration option so they can query to see
> whether a particular file will support a fast posix_fallocate(). I'm
> not 100% convinced such complexity is really needed, but I'm willing
> to be convinced.... what do folks think?
>

An application could do sys_fallocate(one-byte) to work out whether it's
supported in-kernel, I guess.

2007-05-07 23:36:44

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4

On Mon, May 07, 2007 at 07:02:32PM -0400, Jeff Garzik wrote:
> Andreas Dilger wrote:
> >On May 07, 2007 13:58 -0700, Andrew Morton wrote:
> >>Final point: it's fairly disappointing that the present implementation is
> >>ext4-only, and extent-only. I do think we should be aiming at an ext4
> >>bitmap-based implementation and an ext3 implementation.
> >
> >Actually, this is a non-issue. The reason that it is handled for
> >extent-only
> >is that this is the only way to allocate space in the filesystem without
> >doing the explicit zeroing. For other filesystems (including ext3 and
>
> Precisely /how/ do you avoid the zeroing issue, for extents?
>
> If I posix_fallocate() 20GB on ext4, it damn well better be zeroed,
> otherwise the implementation is broken.

There is a bit in the extent structure which indicates that the extent
has not been initialized. When reading from a block where the extent
is marked as unitialized, ext4 returns zero's, to avoid returning the
uninitalized contents of the disk, which might contain someone else's
love letters, p0rn, or other information which we shouldn't leak out.
When writing to an extent which is uninitalized, we may potentially
have to split the extent into three extents in the worst case.

My understanding is that XFS uses a similar implementation; it's a
pretty obvious and standard way to implement allocated-but-not-initialized
extents.

We thought about supporting persistent preallocation for inodes using
indirect blocks, but it would require stealing a bit from each entry
in the indirect block, reducing the maximum size of the filesystem by
two (i.e., 2**31 blocks). It was decided it wasn't worth the
complexity, given the tradeoffs.

- Ted

2007-05-08 00:00:32

[permalink] [raw]

Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4

On Mon, 2007-05-07 at 13:58 -0700, Andrew Morton wrote:
> On Mon, 7 May 2007 05:37:54 -0600
> Andreas Dilger <[email protected]> wrote:
>
> > > > + block = offset >> blkbits;
> > > > + max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) >> blkbits)
> > > > + - block;
> > > > + mutex_lock(&EXT4_I(inode)->truncate_mutex);
> > > > + credits = ext4_ext_calc_credits_for_insert(inode, NULL);
> > > > + mutex_unlock(&EXT4_I(inode)->truncate_mutex);
> > >
> > > Now I'm mystified. Given that we're allocating an arbitrary amount of disk
> > > space, and that this disk space will require an arbitrary amount of
> > > metadata, how can we work out how much journal space we'll be needing
> > > without at least looking at `len'?
> >
> > Good question.
> >
> > The uninitialized extent can cover up to 128MB with a single entry.
> > If @path isn't specified, then ext4_ext_calc_credits_for_insert()
> > function returns the maximum number of extents needed to insert a leaf,
> > including splitting all of the index blocks. That would allow up to 43GB
> > (340 extents/block * 128MB) to be preallocated, but it still needs to take
> > the size of the preallocation into account (adding 3 blocks per 43GB - a
> > leaf block, a bitmap block and a group descriptor).
>
> I think the use of ext4_journal_extend() (as Amit has proposed) will help
> here, but it is not sufficient.
>
> Because under some circumstances, a journal_extend() failure could mean
> that we fail to allocate all the required disk space. If it is infrequent
> enough, that is acceptable when the caller is using fallocate() for
> performance reasons.
>
> But it is very much not acceptable if the caller is using fallocate() for
> space-reservation reasons. If you used fallocate to reserve 1GB of disk
> and fallocate() "succeeded" and you later get ENOSPC then you'd have a
> right to get a bit upset.
>
> So I think the ext3/4 fallocate() implementation will need to be
> implemented as a loop:
>
> while (len) {
> journal_start();
> len -= do_fallocate(len, ...);
> journal_stop();
> }
>
>

I agree. There is already a loop in Amit's current's patch to call
ext4_ext_get_blocks() thoug. Question is how much credit should ext4 to
ask for in each journal_start()?
> +/*
> + * ext4_fallocate:
> + * preallocate space for a file
> + * mode is for future use, e.g. for unallocating preallocated blocks etc.
> + */
> +int ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len)
> +{
....

> + mutex_lock(&EXT4_I(inode)->truncate_mutex);
> + credits = ext4_ext_calc_credits_for_insert(inode, NULL);
> + mutex_unlock(&EXT4_I(inode)->truncate_mutex);

I think the calculation is based on the assumption that there is only a
single extent to be inserted, which is the ideal case. But in some cases
we may end up allocating several chunk of blocks(extents) for this
single preallocation request when fs is fragmented (or part of
preallocation request is already fulfilled)

I think we should move this calculation inside the loop as well,and we
really do not need to grab the lock to calculate the credit if the @path
is always NULL, all the function does is mathmatics.

I can't think of any good way to estimate the total credits needed for
this whole preallocation request. Looked at ext4_get_block(), which is
used for DIO code to deal with large amount of block allocation. The
credit reservation is quite weak there too. The DIO_CREDIT is only
(EXT4_RESERVE_TRANS_BLOCKS + 32)

> + handle=ext4_journal_start(inode, credits +
> + EXT4_DATA_TRANS_BLOCKS(inode->i_sb)+1);
> + if (IS_ERR(handle))
> + return PTR_ERR(handle);
> +retry:
> + ret = 0;
> + while (ret >= 0 && ret < max_blocks) {
> + block = block + ret;
> + max_blocks = max_blocks - ret;
> + ret = ext4_ext_get_blocks(handle, inode, block,
> + max_blocks, &map_bh,
> + EXT4_CREATE_UNINITIALIZED_EXT, 0);
> + BUG_ON(!ret);
> + if (ret > 0 && test_bit(BH_New, &map_bh.b_state)
> + && ((block + ret) > (i_size_read(inode) << blkbits)))
> + nblocks = nblocks + ret;
> + }
> +
> + if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
> + goto retry;
> +
> Now the interesting question is: what do we do if we get halfway through
> this loop and then run out of space? We could leave the disk all filled up
> and then return failure to the caller, but that's pretty poor behaviour,
> IMO.
>
The current code handles earlier ENOSPC by three times retries. After
that if we still run out of space, then it's propably right to notify
the caller there isn't much space left.

We could extend the block reservation window size before the while loop
so we could get a lower chance to get more fragmented.

>
> Does the proposed implementation handle quotas correctly, btw? Has that
> been tested?
>
I think so. The ext4_ext_get_blocks() will end up calling
ext4_new_blocks() to do the real block allocation, quota is being
handled there, therefor is tested already.

Mingming

2007-05-08 00:16:21

[permalink] [raw]

Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4

On Mon, 07 May 2007 17:00:24 -0700
Mingming Cao <[email protected]> wrote:

> > + while (ret >= 0 && ret < max_blocks) {
> > + block = block + ret;
> > + max_blocks = max_blocks - ret;
> > + ret = ext4_ext_get_blocks(handle, inode, block,
> > + max_blocks, &map_bh,
> > + EXT4_CREATE_UNINITIALIZED_EXT, 0);
> > + BUG_ON(!ret);
> > + if (ret > 0 && test_bit(BH_New, &map_bh.b_state)
> > + && ((block + ret) > (i_size_read(inode) << blkbits)))
> > + nblocks = nblocks + ret;
> > + }
> > +
> > + if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
> > + goto retry;
> > +
> > Now the interesting question is: what do we do if we get halfway through
> > this loop and then run out of space? We could leave the disk all filled up
> > and then return failure to the caller, but that's pretty poor behaviour,
> > IMO.
> >
> The current code handles earlier ENOSPC by three times retries. After
> that if we still run out of space, then it's propably right to notify
> the caller there isn't much space left.
>
> We could extend the block reservation window size before the while loop
> so we could get a lower chance to get more fragmented.

yes, but my point is that the proposed behaviour is really quite bad.

We will attempt to allocate the disk space and then we will return failure,
having consumed all the disk space and having partially and uselessly
populated an unknown amount of the file.

Userspace could presumably repair the mess in most situations by truncating
the file back again. The kernel cannot do that because there might be live
data in amongst there.

So we'd need to either keep track of which blocks were newly-allocated and
then free them all again on the error path (doesn't work right across
commit+crash+recovery) or we could later use the space-reservation scheme which
delayed allocation will need to introduce.

Or we could decide to live with the above IMO-crappy behaviour.

2007-05-08 00:30:43

[permalink] [raw]

Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4

On Mon, 2007-05-07 at 16:31 -0700, Andrew Morton wrote:
> On Mon, 7 May 2007 19:14:42 -0400
> Theodore Tso <[email protected]> wrote:
>
> > On Mon, May 07, 2007 at 03:38:56PM -0700, Andrew Morton wrote:
> > > > Actually, this is a non-issue. The reason that it is handled for extent-only
> > > > is that this is the only way to allocate space in the filesystem without
> > > > doing the explicit zeroing. For other filesystems (including ext3 and
> > > > ext4 with block-mapped files) the filesystem should return an error (e.g.
> > > > -EOPNOTSUPP) and glibc will do manual zero-filling of the file in userspace.
> > >
> > > It can be a bit suboptimal from the layout POV. The reservations code will
> > > largely save us here, but kernel support might make it a bit better.
> >
> > Actually, the reservations code won't matter, since glibc will fall
> > back to its current behavior, which is it will do the preallocation by
> > explicitly writing zeros to the file.
>
> No! Reservations code is *critical* here. Without reservations, we get
> disastrously-bad layout if two processes were running a large fallocate()
> at the same time. (This is an SMP-only problem, btw: on UP the timeslice
> lengths save us).
>
> My point is that even though reservations save us, we could do even-better
> in-kernel.
>

In this case, since the number of blocks to preallocate (eg. N=10GB) is
clear, we could improve the current reservation code, to allow callers
explicitly ask for a new window that have the minimum N free blocks for
the blocks-to-preallocated(rather than just have at least 1 free
blocks).

Before the ext4_fallocate() is called, the right reservation window size
is set with the flag to indicating "please spend time if needed to find
a window covers at least N free blocks".

So for ex4 block mapped files, later when glibc is doing allocation and
zeroing, the ext4 block-mapped allocator will knows to reserve the right
amount of free blocks before allocating and zeroing 10GB space.

I am not sure whether this worth the effort though.

> But then, a smart application would bypass the glibc() fallocate()
> implementation and would tune the reservation window size and would use
> direct-IO or sync_file_range()+fadvise(FADV_DONTNEED).
>
> > This wlil result in the same
> > layout as if we had done the persistent preallocation, but of course
> > it will mean the posix_fallocate() could potentially take a long time
> > if you're a PVR and you're reserving a gig or two for a two hour movie
> > at high quality. That seems suboptimal, granted, and ideally the
> > application should be warned about this before it calls
> > posix_fallocate(). On the other hand, it's what happens today, all
> > the time, so applications won't be too badly surprised.
>
> A PVR implementor would take all this over and would do it themselves, for
> sure.
>
> > If we think applications programmers badly need to know in advance if
> > posix_fallocate() will be fast or slow, probably the right thing is to
> > define a new fpathconf() configuration option so they can query to see
> > whether a particular file will support a fast posix_fallocate(). I'm
> > not 100% convinced such complexity is really needed, but I'm willing
> > to be convinced.... what do folks think?
> >
>
> An application could do sys_fallocate(one-byte) to work out whether it's
> supported in-kernel, I guess.
>

2007-05-08 00:41:56

[permalink] [raw]

Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4

On Mon, 2007-05-07 at 17:15 -0700, Andrew Morton wrote:
> On Mon, 07 May 2007 17:00:24 -0700
> Mingming Cao <[email protected]> wrote:
>
> > > + while (ret >= 0 && ret < max_blocks) {
> > > + block = block + ret;
> > > + max_blocks = max_blocks - ret;
> > > + ret = ext4_ext_get_blocks(handle, inode, block,
> > > + max_blocks, &map_bh,
> > > + EXT4_CREATE_UNINITIALIZED_EXT, 0);
> > > + BUG_ON(!ret);
> > > + if (ret > 0 && test_bit(BH_New, &map_bh.b_state)
> > > + && ((block + ret) > (i_size_read(inode) << blkbits)))
> > > + nblocks = nblocks + ret;
> > > + }
> > > +
> > > + if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
> > > + goto retry;
> > > +
> > > Now the interesting question is: what do we do if we get halfway through
> > > this loop and then run out of space? We could leave the disk all filled up
> > > and then return failure to the caller, but that's pretty poor behaviour,
> > > IMO.
> > >
> > The current code handles earlier ENOSPC by three times retries. After
> > that if we still run out of space, then it's propably right to notify
> > the caller there isn't much space left.
> >
> > We could extend the block reservation window size before the while loop
> > so we could get a lower chance to get more fragmented.
>
> yes, but my point is that the proposed behaviour is really quite bad.
>
I agree your point, that's why I mention it only helped the
fragmentation issue but not the ENOSPC case.

> We will attempt to allocate the disk space and then we will return failure,
> having consumed all the disk space and having partially and uselessly
> populated an unknown amount of the file.
>

Not totally useless I think. If only half of the space is preallocated
because run out of space, the application can decide whether it's good
enough to start to use this preallocated space or wait for the fs to
have more free space.

> Userspace could presumably repair the mess in most situations by truncating
> the file back again. The kernel cannot do that because there might be live
> data in amongst there.
>
> So we'd need to either keep track of which blocks were newly-allocated and
> then free them all again on the error path (doesn't work right across
> commit+crash+recovery) or we could later use the space-reservation scheme which
> delayed allocation will need to introduce.
>
> Or we could decide to live with the above IMO-crappy behaviour.

In fact Amit and I had raised this issue before, whether it's okay to do
allow partial preallocation. At that moment the feedback is it's no much
different than the current zero-out-preallocation behavior: people might
preallocating half-way then later deal with ENOSPC.

We could check the total number of fs free blocks account before
preallocation happens, if there isn't enough space left, there is no
need to bother preallocating.

If there is enough free space, we could make a reservation window that
have at least N free blocks and mark it not stealable by other files. So
later we will not run into the ENOSPC error.

The fs free blocks account is just a estimate though.

Mingming

2007-05-08 01:07:42

[permalink] [raw]

Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4

On May 07, 2007 19:02 -0400, Jeff Garzik wrote:
> Andreas Dilger wrote:
> >Actually, this is a non-issue. The reason that it is handled for
> >extent-only is that this is the only way to allocate space in the
> >filesystem without doing the explicit zeroing.
>
> Precisely /how/ do you avoid the zeroing issue, for extents?
>
> If I posix_fallocate() 20GB on ext4, it damn well better be zeroed,
> otherwise the implementation is broken.

In ext4 (as in XFS) there is a flag stored in the extent that tells if
the extent is initialized or not. Reads from uninitialized extents will
return zero-filled data, and writes that don't span the whole extent
will cause the uninitialized extent to be split into a regular extent
and one or two uninitialized extents (depending where the write is).

My comment was just that the extent doesn't have to be explicitly zero
filled on the disk, by virtue of the fact that the uninitialized flag
will cause reads to return zero.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

2007-05-08 01:26:09

by Jeff Garzik

[permalink] [raw]

Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4

Andreas Dilger wrote:
> My comment was just that the extent doesn't have to be explicitly zero
> filled on the disk, by virtue of the fact that the uninitialized flag
> will cause reads to return zero.

Agreed, thanks for the clarification.

Jeff

2007-05-08 01:43:57

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4

On Mon, May 07, 2007 at 05:41:39PM -0700, Mingming Cao wrote:
> We could check the total number of fs free blocks account before
> preallocation happens, if there isn't enough space left, there is no
> need to bother preallocating.

Checking against the fs free blocks is a good idea, since it will
prevent the obvious error case where someone tries to preallocate 10GB
when there is only 2GB left. But it won't help if there are multiple
processes trying to allocate blocks the same time. On the other hand,
that case is probably relatively rare, and in that case, the
filesystem was probably going to be left completely full in any case.

On Mon, May 07, 2007 at 05:15:41PM -0700, Andrew Morton wrote:
> Userspace could presumably repair the mess in most situations by truncating
> the file back again. The kernel cannot do that because there might be live
> data in amongst there.

Actually, the kernel could do it, in that could simply release all
unitialized extents back to the system. The problem is distinguishing
between the unitialized extents that had just been newly added, versus
the ones that had there from before. (On the other hand, if the
filesystem was completely full, releasing unitialized blocks wouldn't
be the worse thing in the world to do, although releasing previously
fallocated blocks probably does violate the princple of least
surprise, even if it's what the user would have wanted.)

On Mon, May 07, 2007 at 05:41:39PM -0700, Mingming Cao wrote:
> If there is enough free space, we could make a reservation window that
> have at least N free blocks and mark it not stealable by other files. So
> later we will not run into the ENOSPC error.

Could you really use a single reservation window? When the filesystem
is almost full, the free extents are likely going to be scattered all
over the disk. The general principle of grabbing all of the extents
and keeping them in an in-memory data structure, and only adding them
to the extent tree would work, though; I'm just not sure we could do
it using the existing reservation window code, since it only supports
a single reservation window per file, yes?

- Ted

2007-05-08 10:52:47

by Amit K. Arora

[permalink] [raw]

Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4

On Mon, May 07, 2007 at 10:24:37AM -0500, Dave Kleikamp wrote:
> On Mon, 2007-05-07 at 17:37 +0530, Amit K. Arora wrote:
> > On Thu, May 03, 2007 at 09:31:33PM -0700, Andrew Morton wrote:
> > > On Thu, 26 Apr 2007 23:43:32 +0530 "Amit K. Arora" <[email protected]> wrote:
>
> > > > +int ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len)
> > > > +{
> > > > + handle_t *handle;
> > > > + ext4_fsblk_t block, max_blocks;
> > > > + int ret, ret2, nblocks = 0, retries = 0;
> > > > + struct buffer_head map_bh;
> > > > + unsigned int credits, blkbits = inode->i_blkbits;
> > > > +
> > > > + /* Currently supporting (pre)allocate mode _only_ */
> > > > + if (mode != FA_ALLOCATE)
> > > > + return -EOPNOTSUPP;
> > > > +
> > > > + if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL))
> > > > + return -ENOTTY;
> > >
> > > So we don't implement fallocate on bitmap-based files! Well that's huge
> > > news. The changelog would be an appropriate place to communicate this,
> > > along with reasons why, or a description of the plan to fix it.
> >
> > Ok. Will add this in the function description as well.
> >
> > > Also, posix says nothing about fallocate() returning ENOTTY.
> >
> > Right. I don't seem to find any suitable error from posix description.
> > Can you please suggest an error code which might make more sense here ?
> > Will -ENOTSUPP be ok ? Since we want to say here that we don't support
> > non-extent files.
>
> Isn't the idea that libc will interpret -ENOTTY, or whatever is returned
> here, and fall back to the current library code to do preallocation?
> This way, the caller of fallocate() will never see this return code, so
> it won't violate posix.

You are right.

But, we still need to "standardize" (and limit) the error codes
which we should return from kernel when we want to fall back on the
library implementation. The posix_fallocate() library function will have
to look for a set of errors from fallocate() system call, upon receiving
which it will do preallocation from user level; or else, it will return
success/error-code returned by the system call to the user.

I think we can make it fall back to library implementation of fallocate,
whenever posix_fallocate() receives any of the following errors from
fallocate() system call:

1. ENOSYS
2. EOPNOTSUPP
3. ENOTTY (?)

Now the question is - should we limit the set of errors for this purpose
to just 1 & 2 above ? In that case I will need to change the error being
returned here to -EOPNOTSUPP (from current -ENOTTY).

--
Regards,
Amit Arora

2007-05-08 14:48:01

by Dave Kleikamp

[permalink] [raw]

Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4

On Tue, 2007-05-08 at 16:22 +0530, Amit K. Arora wrote:
> On Mon, May 07, 2007 at 10:24:37AM -0500, Dave Kleikamp wrote:
> > On Mon, 2007-05-07 at 17:37 +0530, Amit K. Arora wrote:
> > > On Thu, May 03, 2007 at 09:31:33PM -0700, Andrew Morton wrote:

> > > > So we don't implement fallocate on bitmap-based files! Well that's huge
> > > > news. The changelog would be an appropriate place to communicate this,
> > > > along with reasons why, or a description of the plan to fix it.
> > >
> > > Ok. Will add this in the function description as well.
> > >
> > > > Also, posix says nothing about fallocate() returning ENOTTY.
> > >
> > > Right. I don't seem to find any suitable error from posix description.
> > > Can you please suggest an error code which might make more sense here ?
> > > Will -ENOTSUPP be ok ? Since we want to say here that we don't support
> > > non-extent files.
> >
> > Isn't the idea that libc will interpret -ENOTTY, or whatever is returned
> > here, and fall back to the current library code to do preallocation?
> > This way, the caller of fallocate() will never see this return code, so
> > it won't violate posix.
>
> You are right.
>
> But, we still need to "standardize" (and limit) the error codes
> which we should return from kernel when we want to fall back on the
> library implementation. The posix_fallocate() library function will have
> to look for a set of errors from fallocate() system call, upon receiving
> which it will do preallocation from user level; or else, it will return
> success/error-code returned by the system call to the user.
>
> I think we can make it fall back to library implementation of fallocate,
> whenever posix_fallocate() receives any of the following errors from
> fallocate() system call:
>
> 1. ENOSYS
> 2. EOPNOTSUPP
> 3. ENOTTY (?)
>
> Now the question is - should we limit the set of errors for this purpose
> to just 1 & 2 above ? In that case I will need to change the error being
> returned here to -EOPNOTSUPP (from current -ENOTTY).

If you want my opinion, -EOPNOTSUPP is better than -ENOTTY.

Shaggy
--
David Kleikamp
IBM Linux Technology Center

2007-05-08 16:53:08

[permalink] [raw]

Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4

On May 07, 2007 21:43 -0400, Theodore Tso wrote:
> On Mon, May 07, 2007 at 05:15:41PM -0700, Andrew Morton wrote:
> > Userspace could presumably repair the mess in most situations by truncating
> > the file back again. The kernel cannot do that because there might be live
> > data in amongst there.
>
> Actually, the kernel could do it, in that could simply release all
> unitialized extents back to the system. The problem is distinguishing
> between the unitialized extents that had just been newly added, versus
> the ones that had there from before. (On the other hand, if the
> filesystem was completely full, releasing unitialized blocks wouldn't
> be the worse thing in the world to do, although releasing previously
> fallocated blocks probably does violate the princple of least
> surprise, even if it's what the user would have wanted.)

I tend to agree with this. Having fallocate() fill up the filesystem
is exactly what the caller asked. Doing a write() hit ENOSPC doesn't
trucate off the whole write either, nor does "dd" delete the whole file
when the filesystem is full.

Even checking the statfs() space before doing the fallocate() may be
counter intuitive, since it will return ENOSPC but the filesystem will
not actually be full. Some applications (e.g. database) may WANT to
fill the filesystem and then get the actual file size back to avoid
trusting statfs() because of metadata overhead (e.g. indirect blocks).

One of the design goals for sys_fallocate() was to allow FA_DELALLOC
to deallocate unwritten extents in a safe manner.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

2007-05-08 17:46:15