Sorry, we're seriously, seriously, seriously short on flags in the page
struct and this patch is going to need one heck of a good case for it to be
acceptable.
Even then, we should put a lot of effort into finding some way of avoiding
adding that page flag. One option might be to add a new radix-tree tag,
and defining it as "for filesytem usage". Or use PG_checked (which should
be renamed to to PG_fs_misc) (if that doesn't conflict with ext4's existing
use of PG_checked). Or use !PageMappedToDisk()?
These patches seem to have a number of issues - we should get them properly
commented and properly changelogged then get them on the wire for decent
review before investing too much in them, please.
Andrew Morton wrote:
> Sorry, we're seriously, seriously, seriously short on flags in the page
> struct and this patch is going to need one heck of a good case for it to be
> acceptable.
This was for the delayed allocation patchset, right; and by managing
this at the page level that means we can't do this for block size < page
size, I think...
There are already buffer head flags for delalloc (block to be allocated
on flush) and unwritten (actual block allocated to a file but not yet
written) in the vfs - shouldn't we be looking at using those?
Unless there is a clear path from this patchset to one that supports
blocks < page, I'm hesitant about it... but I know I'm being a bit of an
armchair quarterback here, and I'll try to set aside some time to give
this a better look.
Thanks,
-Eric
On Thu, 15 Feb 2007 09:18:34 -0800 Eric Sandeen <[email protected]> wrote:
> Andrew Morton wrote:
> > Sorry, we're seriously, seriously, seriously short on flags in the page
> > struct and this patch is going to need one heck of a good case for it to be
> > acceptable.
>
> This was for the delayed allocation patchset, right; and by managing
> this at the page level that means we can't do this for block size < page
> size, I think...
If the page is partially mapped to disk then we can still reserve an entire
page's worth of blocks, as long as we unreserve an entire page's worth
once the page gets fully mapped to disk.
> There are already buffer head flags for delalloc (block to be allocated
> on flush) and unwritten (actual block allocated to a file but not yet
> written) in the vfs - shouldn't we be looking at using those?
We could, but we have this no-buffer-head mode which really should be the
preferred method, for 4k pagesize systems, at least.
On Thu, 15 Feb 2007 20:30:43 +0300 Alex Tomas <[email protected]> wrote:
> >>>>> Eric Sandeen (ES) writes:
>
> ES> Andrew Morton wrote:
> >> Sorry, we're seriously, seriously, seriously short on flags in the page
> >> struct and this patch is going to need one heck of a good case for it to be
> >> acceptable.
>
> ES> This was for the delayed allocation patchset, right; and by managing
> ES> this at the page level that means we can't do this for block size <
> ES> page size, I think...
>
> I still think that having PG_booked and special code to handle
> case when blocksize==pagesize is worthwhile -- we save 56 bytes
> on 32bit and 104 bytes on 64bit for every written page.
>
If the page doesn't have buffer-heads, set PG_private and clear page->private
If the page does have buffer_heads, use BH_Delay.
>>>>> Andrew Morton (AM) writes:
AM> If the page doesn't have buffer-heads, set PG_private and clear page->private
AM> If the page does have buffer_heads, use BH_Delay.
I did exactly this way in the first version, but later
got feeling that the community'd discard "ugly hack".
one more question: how much of it you want in VFS?
->get_block(with BH_Delay) can be used to signal
filesystem that no actual allocation is required.
so, aware filesystem can just reserve space. then
->writepages() should walk through the pages like
it does currently, collect contiugous sequences
and again call ->get_block(w/o BH_Delay) with b_size
covering all contiguous pages ...
thanks, Alex
>>>>> Eric Sandeen (ES) writes:
ES> Andrew Morton wrote:
>> Sorry, we're seriously, seriously, seriously short on flags in the page
>> struct and this patch is going to need one heck of a good case for it to be
>> acceptable.
ES> This was for the delayed allocation patchset, right; and by managing
ES> this at the page level that means we can't do this for block size <
ES> page size, I think...
I still think that having PG_booked and special code to handle
case when blocksize==pagesize is worthwhile -- we save 56 bytes
on 32bit and 104 bytes on 64bit for every written page.
thanks, Alex
>>>>> Andrew Morton (AM) writes:
AM> Sorry, we're seriously, seriously, seriously short on flags in the page
AM> struct and this patch is going to need one heck of a good case for it to be
AM> acceptable.
AM> Even then, we should put a lot of effort into finding some way of avoiding
AM> adding that page flag. One option might be to add a new radix-tree tag,
AM> and defining it as "for filesytem usage". Or use PG_checked (which should
AM> be renamed to to PG_fs_misc) (if that doesn't conflict with ext4's existing
AM> use of PG_checked). Or use !PageMappedToDisk()?
there is a difference between being mapped and "booked".
the latter means that page isn't allocated yet, but
space is reserved (including metadata) and we're sure
we'll be able to allocate space when the page is being
flushed.
AM> These patches seem to have a number of issues - we should get them properly
AM> commented and properly changelogged then get them on the wire for decent
AM> review before investing too much in them, please.
I've been reworking the delayed allocation patch to move
part of it into VFS. I'll try to comment things better
this time.
thanks, Alex
On Fri, 16 Feb 2007 00:07:55 +0300
Alex Tomas <[email protected]> wrote:
> >>>>> Andrew Morton (AM) writes:
>
> AM> If the page doesn't have buffer-heads, set PG_private and clear page->private
>
> AM> If the page does have buffer_heads, use BH_Delay.
>
> I did exactly this way in the first version, but later
> got feeling that the community'd discard "ugly hack".
These things can be wrapped in nice helper functions in header files. that
allows us to easily reimplement if a better idea comes along.
And we already do all sorts of horrid things to save space in the page
struct.
> one more question: how much of it you want in VFS?
As much as possible, really. It's probably too late for XFS to reuse any
new infrastructure, but if the infrastructure is well-designed it _should_
be possible to us it in new filesytems, and to convert ext2.
> ->get_block(with BH_Delay) can be used to signal
> filesystem that no actual allocation is required.
> so, aware filesystem can just reserve space. then
> ->writepages() should walk through the pages like
> it does currently, collect contiugous sequences
> and again call ->get_block(w/o BH_Delay) with b_size
> covering all contiguous pages ...
>
That sounds like it'd work, yes.
Except for an address_space which is using delayed allocation, its
->prepare_write wouldn't call get_block at all, so perhaps that isn't
needed.
>>>>> Andrew Morton (AM) writes:
-> get_block(with BH_Delay) can be used to signal
>> filesystem that no actual allocation is required.
>> so, aware filesystem can just reserve space. then
-> writepages() should walk through the pages like
>> it does currently, collect contiugous sequences
>> and again call ->get_block(w/o BH_Delay) with b_size
>> covering all contiguous pages ...
>>
AM> That sounds like it'd work, yes.
AM> Except for an address_space which is using delayed allocation, its
-> prepare_write wouldn't call get_block at all, so perhaps that isn't
AM> needed.
hmm. I thought it has to call get_block() at least to know whether
the block is already allocated. and I was going to reserve space
in prepare_write for which some fs-specific method is needed becase
only fs knows how much metadata it'd need.
thanks, Alex
On Fri, 16 Feb 2007 10:30:39 +0300 Alex Tomas <[email protected]> wrote:
> >>>>> Andrew Morton (AM) writes:
>
> -> get_block(with BH_Delay) can be used to signal
> >> filesystem that no actual allocation is required.
> >> so, aware filesystem can just reserve space. then
> -> writepages() should walk through the pages like
> >> it does currently, collect contiugous sequences
> >> and again call ->get_block(w/o BH_Delay) with b_size
> >> covering all contiguous pages ...
> >>
>
> AM> That sounds like it'd work, yes.
>
> AM> Except for an address_space which is using delayed allocation, its
> -> prepare_write wouldn't call get_block at all, so perhaps that isn't
> AM> needed.
>
> hmm. I thought it has to call get_block() at least to know whether
> the block is already allocated. and I was going to reserve space
> in prepare_write for which some fs-specific method is needed becase
> only fs knows how much metadata it'd need.
Well, one could just assume that the page has no disk mapping and go and
make the space reservation. Things will work out OK when we come to do
writepage().
Or one could do both: call get_block() only if the page was inside i_size.
>>>>> Andrew Morton (AM) writes:
AM> Well, one could just assume that the page has no disk mapping and go and
AM> make the space reservation. Things will work out OK when we come to do
AM> writepage().
AM> Or one could do both: call get_block() only if the page was inside i_size.
well, even so we need to reserve not that block ony, but
also needed metadata (for the worst case). probably this
is work for get_block or some different method? anyway,
we have to call it if the page is being written partial.
thanks, Alex
On Fri, 16 Feb 2007 10:56:55 +0300 Alex Tomas <[email protected]> wrote:
> >>>>> Andrew Morton (AM) writes:
>
> AM> Well, one could just assume that the page has no disk mapping and go and
> AM> make the space reservation. Things will work out OK when we come to do
> AM> writepage().
>
> AM> Or one could do both: call get_block() only if the page was inside i_size.
>
> well, even so we need to reserve not that block ony, but
> also needed metadata (for the worst case).
Right. Reserve all the blocks for the page and all the metadata for a page
at that file offset. Worst case.
> probably this
> is work for get_block or some different method?
umm, when I did this I think I added a new ->reservepage address_space op
and called that from within the VFS's delalloc_prepare_write().
> anyway,
> we have to call it if the page is being written partial.
Not necessarily. If we're operating in nobh mode that page is brought
uptodate in prepare_write().
On Feb 16, 2007 00:06 -0800, Andrew Morton wrote:
> On Fri, 16 Feb 2007 10:56:55 +0300 Alex Tomas <[email protected]> wrote:
> > well, even so we need to reserve not that block ony, but
> > also needed metadata (for the worst case).
>
> Right. Reserve all the blocks for the page and all the metadata for a page
> at that file offset. Worst case.
This only really becomes an issue when the filesystem (+reservations)
is close to being full. If we are close enough to worry about running
out of space we can also refuse to do delayed allocation.
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.
On Feb 15, 2007 09:18 -0800, Eric Sandeen wrote:
> This was for the delayed allocation patchset, right; and by managing
> this at the page level that means we can't do this for block size < page
> size, I think...
We could always disable delayed allocation in the PAGE_SIZE != blocksize
case. It would be no worse than what we have now, and for IO performance
you need larger blocksize anyways.
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.
On Thu, 2007-02-15 at 23:46 -0800, Andrew Morton wrote:
> On Fri, 16 Feb 2007 10:30:39 +0300 Alex Tomas <[email protected]> wrote:
> > hmm. I thought it has to call get_block() at least to know whether
> > the block is already allocated. and I was going to reserve space
> > in prepare_write for which some fs-specific method is needed becase
> > only fs knows how much metadata it'd need.
>
> Well, one could just assume that the page has no disk mapping and go and
> make the space reservation. Things will work out OK when we come to do
> writepage().
>
> Or one could do both: call get_block() only if the page was inside i_size.
Or call get_block() with create = 0. Or replace the create argument
with a flags field that can take either GET_BLOCK_CREATE or
GET_BLOCK_RESERVE.
--
David Kleikamp
IBM Linux Technology Center
On Thu, Feb 15, 2007 at 03:23:52PM -0800, Andrew Morton wrote:
> > one more question: how much of it you want in VFS?
>
> As much as possible, really. It's probably too late for XFS to reuse any
> new infrastructure, but if the infrastructure is well-designed it _should_
> be possible to us it in new filesytems, and to convert ext2.
Actually, in a hallway conversation with David Chinner at the Linux
filesystem/storage workshop this week, it sounded like he was willing
to at least consider changing XFS to use a VFS-based infrastructure.
So when we put together a proposal for how to do this in the XFS
layer, I would suggest cc'ing the linux-fsdevel list.
Regards,
- Ted