2004-04-22 00:08:48

by Badari Pulavarty

[permalink] [raw]
Subject: ext3 reservation question.

Hi Andrew,

I was just wondering, what would make sense..

Lets say I have a "goal" for allocation, but the goal is not inside my
reservation window. Is it worth *try* to satisfy the goal by throwing
out our window ? Or should we ignore goal and allocate from the
current reservation window ?

And also, how does ext3 determines the goal ?

I am worried about a case, where multiple threads writing to
different parts of same file - there by each thread thrashing
reservation window (since each one has its own goal).

BTW, the current reservation code honors "goal" and throws our
window and tries to get a new one to satisfy the goal.

Thanks,
Badari


2004-04-22 01:09:56

by Andrew Morton

[permalink] [raw]
Subject: Re: ext3 reservation question.

Badari Pulavarty <[email protected]> wrote:
>
> Hi Andrew,
>
> I was just wondering, what would make sense..
>
> Lets say I have a "goal" for allocation, but the goal is not inside my
> reservation window. Is it worth *try* to satisfy the goal by throwing
> out our window ? Or should we ignore goal and allocate from the
> current reservation window ?

That's a hard question.

Yes, it's worth throwing away the reservation and adopting a new one if
doing that is cheap.

> And also, how does ext3 determines the goal ?

Via various heuristics. i_next_alloc_goal is "the physical block where I
want to allocate, as long as that allocation corresponds to the logical
block in i_next_alloc_block". These are reasonably documented at their
definition site.

> I am worried about a case, where multiple threads writing to
> different parts of same file - there by each thread thrashing
> reservation window (since each one has its own goal).

Sure. The reservations should be per-fd, not per-inode. We've always had
that problem.

Making them per-fd is a little tricky, because there might not be any fd's
associated with the inode - this is the
"writepage-over-a-hole-after-the-file-was-closed" problem.

> BTW, the current reservation code honors "goal" and throws our
> window and tries to get a new one to satisfy the goal.

Right. That's a problem, because obtaining the new window is now
computationally expensive, and the situation which you describe will
certainly occur.

We need to solve this.

I suggest you move all the reservation info into file->private_data and
swap that into the inode in ext3_prepare_write(). We'd have to do
something along those lines because get_block() isn't passed the file*.
i_sem is held, so there's no competition for the inode. Make sure that
inode->reservation_info is set to NULL again before returning from
ext3_prepare_write().


This would require that all allocation-related fields be moved from the
inode into struct reserve_window, as we earlier discussed.

I think this is worth doing. Free the struct reserve_window at
file->private_data in ext3_release_file(). Allocate and initialise it
lazily, in ext3_prepare_write().

We'll need to be able to cope with a NULL inode->reservation_info in
writepage and get_block() for the pageout-over-a-hole problem, but if the
allocation patterns are crappy there it really doesn't matter. Your choice
here is to either allocate a new reserve_window, hosted at
inode->fallback_reservation_info just for this case, or to simply handle
the NULL inode->reservation_info in get_block() and try to do something
reasonable.






But even doing that doesn't solve the problem where we have just a single
file* writing to the file, and the application is seeking all over the
place, and there are a lot of other open files against the fs. In that
case we will also experience potentially very serious CPU consumption
problems.

Two ways of solving this:


a) Convert that linear search into an O(log(n)) one.

b) discard the current reservation window if the application seeked away and

c) only adopt a new reservation window if

i) the file has just been opened or

ii) we have seen "several" logically contiguous allocation attempts.


Alternatively, if the application did an lseek, we simply retain the
file*'s current reservation window and start using it for allcations at the
new logical offset. I think that's OK - it's no worse than what we have at
present.

2004-04-22 03:05:35

by Linus Torvalds

[permalink] [raw]
Subject: Re: ext3 reservation question.



On Wed, 21 Apr 2004, Badari Pulavarty wrote:
>
> I am worried about a case, where multiple threads writing to
> different parts of same file - there by each thread thrashing
> reservation window (since each one has its own goal).

Didn't we have a patch two years ago or something floating around with
doing lazy (delayed) block allocation on ext2 - doing the actual
allocation only when writing the thing out? Then you shouldn't have this
problem under any normal load, hopefully.

Or was it just some idle discussion that I remember?

Linus

2004-04-22 03:41:36

by Andrew Morton

[permalink] [raw]
Subject: Re: ext3 reservation question.

Linus Torvalds <[email protected]> wrote:
>
>
>
> On Wed, 21 Apr 2004, Badari Pulavarty wrote:
> >
> > I am worried about a case, where multiple threads writing to
> > different parts of same file - there by each thread thrashing
> > reservation window (since each one has its own goal).
>
> Didn't we have a patch two years ago or something floating around with
> doing lazy (delayed) block allocation on ext2 - doing the actual
> allocation only when writing the thing out? Then you shouldn't have this
> problem under any normal load, hopefully.

That would certainly help. I had delayed allocation for ext2 all up and
running in 2.5.7 or thereabouts - most of the complexity is in managing
filesystem space reservations. If you don't care about ENOSPC the VFS at
present "just works".

I do recall deciding that there were fundamental journal-related reasons
why delalloc couldn't be made to work properly on ext3.

ummm.

The code I had at the time would reserve space in the filesystem
correspnding to the worst-case occupancy based on file offset. When we
actually hit ENOSPC in prepare_write(), we force writeout, which results in
those worst-space reservations being collapsed into their _real_ space
usage, which is much less. So writeout reclaims space in the filesystem
and prepare_write() can proceed.

That worked fine on ext2. But on ext3 we have a transaction open in
prepare_write(), and the forced writeback will cause arbitrary amounts of
unexpected metadata to be pumped into the current transaction, causing the
fs to explode.

At least, I _think_ that was the problem. All is hazy.



Alex Tomas has current patches which do delalloc, but I don't know if they
do all the reservation stuff yet.

We would still face layout problems on SMP - two or more CPUs allocating
blocks in parallel. Could be solved by serialising writeback in some
manner - the fs-writeback.c code does that to some extent already.


2004-04-22 13:42:27

by Chris Mason

[permalink] [raw]
Subject: Re: ext3 reservation question.

On Wed, 2004-04-21 at 23:40, Andrew Morton wrote:

> The code I had at the time would reserve space in the filesystem
> correspnding to the worst-case occupancy based on file offset. When we
> actually hit ENOSPC in prepare_write(), we force writeout, which results in
> those worst-space reservations being collapsed into their _real_ space
> usage, which is much less. So writeout reclaims space in the filesystem
> and prepare_write() can proceed.
>
> That worked fine on ext2. But on ext3 we have a transaction open in
> prepare_write(), and the forced writeback will cause arbitrary amounts of
> unexpected metadata to be pumped into the current transaction, causing the
> fs to explode.
>
> At least, I _think_ that was the problem. All is hazy.
>
One possible solution is to allocate holes in the file during
prepare_write/commit_write, logging the metadata as you go. Then during
each commit fill any delayed allocations. You've still got a
potentially unbounded operation for logging the bitmaps, maybe solvable
through creative reservations.

-chris

2004-04-23 05:32:46

by Alex Tomas

[permalink] [raw]
Subject: Re: ext3 reservation question.

>>>>> Andrew Morton (AM) writes:

AM> That worked fine on ext2. But on ext3 we have a transaction open in
AM> prepare_write(), and the forced writeback will cause arbitrary amounts of
AM> unexpected metadata to be pumped into the current transaction, causing the
AM> fs to explode.

why to open transaction for ->prepare_write()? as for me, it doesn't
touch metadata to be stored on a disk.

I've partial implemented following idea:

->prepare_write() recognizes are blocks being written holes or reserved.
it they are holes and haven't reserved yet, then set a flag about this.
note that ->prepare_write() doesn't look right place to put reservation
in because copy_from_user() in generic_file_aio_write_nolock() may fail.

->commit_write() looks at that flag and if it's set tries to reserve blocks.
if reservation fails then ->commit_write() returns -ENOSPC, ext3_file_write()
recognizes this, requests flushing and wait for free space.

->invalidatepage() drops reservation if space for page still non-allocated

->writepages() and ->writepage() drop reservation upon real allocation

I expect data=ordered mode to be very simple to implement: just put bio
submited in ->writepages() on list for correspondend transaction and
wait for completion of bio's in commit_transaction().

does this all make sense?

thanks, Alex