LinuxLists.cc - [prepatch] address

2002-04-10 11:21:36

Subject: [prepatch] address_space-based writeback

This is a largish patch which makes some fairly deep changes. It's
currently at the "wow, it worked" stage. Most of it is fairly
mature code, but some conceptual changes were recently made.
Hopefully it'll be in respectable state in a few days, but I'd
like people to take a look.

The idea is: all writeback is against address_spaces. All dirty data
has the dirty bit set against its page. So all dirty data is
accessible by

superblock list
-> superblock dirty inode list
-> inode mapping's dirty page list.
-> page_buffers(page) (maybe)

pdflush threads are used to implement periodic writeback (kupdate) by
descending the data structures decribed above. address_spaces now
carry a timestamp (when it was first dirtied) to permit the traditional
"write back stuff which is older than thirty second" behaviour.

pflush threads are used for global writeback (bdflush analogue).

New address_space operations are introduced for whole-file writeback
and for VM writeback. The VM writeback function is designed so that
large chunks of data (I grabbed 4 megs out of the air) will be sent
down to the I/O layers in a single operation.

Aside: Remember, although there's a lot of cleanup and uniformity being
introduced here, one goal is to get rid of the whole practice of
chunking data up into tiny pieces, passing them to the request layer,
adding tiny BIOs to them, sorting them, stringing them onto requests,
then feeding them to the device driver to puzzle over. A key objective
is to deal with maximal-sized chunks of data in an efficient manner.
Assemble large scatter/gather arrays directly against pagecache at the
filemap level. No single-buffer I/O. No single page I/O.

New rules are introduced for the relationship between buffers and their
page:

- If a page has one or more dirty buffers, the page is marked dirty.

- If a page is marked dirty, it *may* have dirty buffers.

- If a page is marked dirty but has no buffers, it is entirely dirty.
So if buffers are later attached to that page, they are all set
dirty.

So block_write_full_page will now only write the dirty buffers.

All this is desiged to support file metadata. Metadata is written back
via its blockdevice's address_space. So mark_buffer_dirty sets the
buffer dirty, sets the page dirty, attaches the page to the blockdev's
address_space's dirty_pages, attaches the blockdev's inode to its dummy
superblock's dirty inodes list and there we are. The blockdev mapping
is treated in an identical manner to other address_spaces.

A consequence of this is that if someone cleans a page by directly
writing back its buffers with ll_rw_block or submit_bh, the page will
still be marked dirty, but its buffers will be clean. That's fine -
block_write_full_page will perform no I/O. PG_dirty is advisory if
the page has buffers.

We need to be careful to not free buffers against a dirty page.
Because when they are reattached, *all* the buffers will be marked
dirty. Which, in the case of the blockdev mapping will corrupt file
data.

Numerous changes in fs/inode.c make it the means by which we perform
most sorts of writeback. I'll be splitting a new file out from
fs/inode.c for this.

As a consequence of all the above, some adjustments were possible at
the buffer layer.

The patch deletes the buffer LRUs, lru_list_lock, write_locked_buffers,
write_some_buffers, wait_for_buffers, sync_buffers, etc. The bdflush
tunables have been removed. All the sync functions operate at the
address_space level only. The kupdate and bdflush functions have been
removed.

The buffer hash has been removed - hash_table_lock, etc. Buffercache
lookups use the page hash and then a walk across the page's buffer list.

Per-inode locking for i_dirty_buffers and i_dirty_data_buffers has been
introduced. That same lock is held across try_to_free_buffers so that
spinlocking may be used to get exclusion against try_to_free_buffers.
This enabled the new __get_hash_table() to be non-blocking, which is what
some filesystems appear to expect.

sync_page_buffers() has been removed. This is because all VM writeback
occurs at the address_space level, and all pages which contain dirty
data are marked dirty. VM throttling occurs purely at the page level -
wait_on_page().

buffer_head.b_inode, unused_list and address_space.i_dirty_data_buffers
have not been removed yet.

The diff (along with another eight patches) is at

http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.8-pre3/dallocbase-70-writeback.patch

As I say, it's still a few days away from being presentable. I
definitely need to test a lot of other filesystems because it
did find a few warts in ext2.

-

2002-04-10 11:34:33

by Alexander Viro

[permalink] [raw]

Subject: Re: [prepatch] address_space-based writeback

On Wed, 10 Apr 2002, Andrew Morton wrote:

>
> This is a largish patch which makes some fairly deep changes. It's
> currently at the "wow, it worked" stage. Most of it is fairly
> mature code, but some conceptual changes were recently made.
> Hopefully it'll be in respectable state in a few days, but I'd
> like people to take a look.
>
> The idea is: all writeback is against address_spaces. All dirty data
> has the dirty bit set against its page. So all dirty data is
> accessible by
>
> superblock list
> -> superblock dirty inode list
> -> inode mapping's dirty page list.
> -> page_buffers(page) (maybe)

Wait. You are assuming that all address_spaces happen to be ->i_mapping of
some inode. Which is less than obvious - e.g. putting indirect blocks into
private per-inode address_space is not unreasonable.

What's more, I wonder how well does your scheme work with ->i_mapping
to a different inode's ->i_data (CODA et.al., file access to block devices).

BTW, CODA-like beasts can have several local filesystems for cache - i.e.
->i_mapping for dirty inodes from the same superblock can ultimately go
to different queues. Again, the same goes for stuff like
dd if=foo of=/dev/floppy - you get dirty inode of /dev/floppy with ->i_mapping
pointing to bdev filesystem and queue we end up with having nothing in common
with that of root fs' device.

I'd really like to see where are you going with all that stuff - if you
expect some correspondence between superblocks and devices, you've are
walking straight into major PITA.

2002-04-10 19:29:31

by Jeremy Jackson

[permalink] [raw]

Subject: Re: [prepatch] address_space-based writeback

This sounds like a wonderful piece of work.
I'm also inspired by the rmap stuff coming down
the pipe. I wonder if there would be any interference
between the two, or could they leverage each other?

Jeremy

----- Original Message -----
From: "Andrew Morton" <[email protected]>
Sent: Wednesday, April 10, 2002 4:21 AM

>
> This is a largish patch which makes some fairly deep changes. It's
> currently at the "wow, it worked" stage. Most of it is fairly
> mature code, but some conceptual changes were recently made.
> Hopefully it'll be in respectable state in a few days, but I'd
> like people to take a look.
>
> The idea is: all writeback is against address_spaces. All dirty data
> has the dirty bit set against its page. So all dirty data is
> accessible by
(snip)

2002-04-10 20:18:35

by Andrew Morton

[permalink] [raw]

Subject: Re: [prepatch] address_space-based writeback

Alexander Viro wrote:
>
> On Wed, 10 Apr 2002, Andrew Morton wrote:
>
> >
> > This is a largish patch which makes some fairly deep changes. It's
> > currently at the "wow, it worked" stage. Most of it is fairly
> > mature code, but some conceptual changes were recently made.
> > Hopefully it'll be in respectable state in a few days, but I'd
> > like people to take a look.
> >
> > The idea is: all writeback is against address_spaces. All dirty data
> > has the dirty bit set against its page. So all dirty data is
> > accessible by
> >
> > superblock list
> > -> superblock dirty inode list
> > -> inode mapping's dirty page list.
> > -> page_buffers(page) (maybe)
>
> Wait.

Hi, Al. Nothing has really changed wrt the things to which you
refer. ie: they would already be a problem. The relationships
between dirty pages, address_spaces, inodes and superblocks
are unchanged, except for one thing: __mark_inode_dirty will
now attach blockdev inodes to the dummy blockdev's dirty
inodes list.

The main thing which is being changed is buffers. The assumption is
that buffers can be hunted down via
superblocklist->superblock->dirty_inode->i_mapping->writeback_mapping,
not via the global LRU.

> You are assuming that all address_spaces happen to be ->i_mapping of
> some inode.

Sorry, the above diagram is not really accurate. The sync/kupdate/bdflush
writeback path is really:

superblock list
-> superblock dirty inode list
->inode->i_mapping->a_ops->writeback_mapping(mapping)

So core kernel does not actually assume that the to-be-written
pages are accessible via inode->i_mapping->dirty_pages.

I believe that the object relationship you're describing is
that the inode->i_mapping points to the main address_space,
and the `host' field of both the main and private address_spaces
both point at the same inode? That the inode owns two
address_spaces?

That's OK. When a page is dirtied, the kernel will attach
that page to the private address_space's dirty pages list and
will attach the common inode to its superblock's dirty inodes list.

For writeback, core kernel will perform

inode->i_mapping->writeback_mapping(mapping, nr_pages)

which will hit the inode's main address_space's writeback_mapping()
method will do:

my_writeback_mapping(mapping, nr_pages)
{
generic_writeback_mapping(mapping, nr_pages);
mapping->host->private_mapping->a_ops->writeback_mapping(
mapping->host->private_mapping, nr_pages);
}

> ...
> What's more, I wonder how well does your scheme work with ->i_mapping
> to a different inode's ->i_data (CODA et.al., file access to block devices).

Presumably, those different inodes have a superblock? In that
case set_page_dirty will mark that inode dirty wrt its own
superblock. set_page_dirty() is currently an optional a_op,
but it's not obvious that there will be a need for that.

The one thing which does worry me a bit is why __mark_inode_dirty
tests for a null ->i_sb. If the inode doesn't have a superblock
then its pages are hidden from the writeback functions.

This is not fatal per-se. The pages are still visible to the VM
via the LRU, and presumably the filesystem knows how to sync
its own stuff. But for memory-balancing and throttling purposes,
I'd prefer that the null ->i_sb not be allowed. Where can this
occur?

> BTW, CODA-like beasts can have several local filesystems for cache - i.e.
> ->i_mapping for dirty inodes from the same superblock can ultimately go
> to different queues.

When a page is dirtied, those inodes will be attached to their
->i_sb's dirty inode list; I haven't changed any of that.

> Again, the same goes for stuff like
> dd if=foo of=/dev/floppy - you get dirty inode of /dev/floppy with ->i_mapping
> pointing to bdev filesystem and queue we end up with having nothing in common
> with that of root fs' device.

OK. What has changed here is that the resulting mark_buffer_dirty
calls will now set the page dirty, and attach the page to its
mapping's dirty_pages, and attach its mapping's host to
mapping->host->i_sb->s_dirty.

So writeback for the floppy device is no longer via the
buffer LRU. It's via

dummy blockdev superblock
-> blockdev's inode
->i_mapping
->writeback_mapping

> I'd really like to see where are you going with all that stuff - if you
> expect some correspondence between superblocks and devices, you've are
> walking straight into major PITA.

Hopefully, none of that has changed. It's just the null inode->i_sb
which is a potential problem.

-

2002-04-10 20:43:42

by Andrew Morton

[permalink] [raw]

Subject: Re: [prepatch] address_space-based writeback

Jeremy Jackson wrote:
>
> This sounds like a wonderful piece of work.
> I'm also inspired by the rmap stuff coming down
> the pipe. I wonder if there would be any interference
> between the two, or could they leverage each other?
>

well the theory is that rmap doesn't need to know. It just
calls writepage when it sees dirty pages. That's a minimal
approach. But I'd like the VM to aggressively use the
"write out lots of pages in the same call" APIs, rather
than the current "send lots of individual pages" approach.

For a number of reasons:

- For delalloc filesystems, and for sparse mappings
against "syncalloc" filesystems, disk allocation is
performed at ->writepage time. It's important that
the writes be clustered effectively. Otherwise
the file gets fragmented on-disk.

- There's a reasonable chance that the pages on the
LRU lists get themselves out-of-order as the aging
process proceeds. So calling ->writepage in lru_list
order has the potential to result in fragmented write
patterns, and inefficient I/O.

- If the VM is trying to free pages from, say, ZONE_NORMAL
then it will only perform writeout against pages from
ZONE_NORMAL and ZONE_DMA. But there may be writable pages
from ZONE_HIGHMEM sprinkled amongst those pages within the
file. It would be really bad if we miss out on opportunistically
slotting those other pages into the same disk request.

- The current VM writeback is an enormous code path. For each
tiny little page, we send it off into writepage, pass it
to the filesystem, give it a disk mapping, give it a buffer_head,
submit the buffer_head, attach a tiny BIO to the buffer_head,
submit the BIO, merge that onto a request structure which
contains contiguous BIOs, feed that list-of-single-page-BIOs
to the device driver.

That's rather painful. So the intent is to batch this work
up. Instead, the VM says "write up to four megs from this
page's mapping, including this page".

That request passes through the filesystem and we wrap 64k
or larger BIOs around the pagecache data and put those into
the request queue. For some filesystems the buffer_heads
are ignored altogether. For others, the buffer_head
can be used at "build the big BIO" time to work out how to
segment the BIOs across sub-page boundaries.

The "including this page" requirement of the vm_writeback
method is there because the VM may be trying to free pages
from a specific zone, so it would be not useful if the
filesystem went and submitted I/O for a ton of pages which
are all from the wrong zone. This is a bit of an ugly
back-compatible placeholder to keep the VM happy before
we move on to that part of it.

-

2002-04-10 20:53:22

by Alexander Viro

[permalink] [raw]

Subject: Re: [prepatch] address_space-based writeback

On Wed, 10 Apr 2002, Andrew Morton wrote:

[I'll answer to the rest when I get some sleep]

> The one thing which does worry me a bit is why __mark_inode_dirty
> tests for a null ->i_sb. If the inode doesn't have a superblock
> then its pages are hidden from the writeback functions.
>
> This is not fatal per-se. The pages are still visible to the VM
> via the LRU, and presumably the filesystem knows how to sync
> its own stuff. But for memory-balancing and throttling purposes,
> I'd prefer that the null ->i_sb not be allowed. Where can this
> occur?

In rather old kernels. IOW, these checks are atavisms - these days
_all_ inodes have (constant) ->i_sb. The only way to create an
in-core inode is alloc_inode(superblock) and requires superblock
to be non-NULL. After that ->i_sb stays unchanged (and non-NULL)
until struct inode is freed.

2002-04-10 22:12:16

by Jan Harkes

[permalink] [raw]

Subject: Re: [prepatch] address_space-based writeback

On Wed, Apr 10, 2002 at 12:16:26PM -0700, Andrew Morton wrote:
> I believe that the object relationship you're describing is
> that the inode->i_mapping points to the main address_space,
> and the `host' field of both the main and private address_spaces
> both point at the same inode? That the inode owns two
> address_spaces?

Actually with Coda we've got 2 inodes that have an identical i_mapping.
The Coda inode's i_mapping is set to point to the hosting inode's
i_data. Hmm, I should probably set it to the i_mapping of the host
inode that way I can run Coda over a Coda(-like) filesystem.

> That's OK. When a page is dirtied, the kernel will attach
> that page to the private address_space's dirty pages list and
> will attach the common inode to its superblock's dirty inodes list.

But Coda has 2 inodes, which one are you connecting to whose superblock.
My guess is that it would be correct to add inode->i_mapping->host to
inode->i_mapping->host->i_sb, which will be the underlying inode in
Coda's case, but host isn't guaranteed to be an inode, it just happens
to be an inode in all existing situations.

> > What's more, I wonder how well does your scheme work with ->i_mapping
> > to a different inode's ->i_data (CODA et.al., file access to block devices).
>
> Presumably, those different inodes have a superblock? In that
> case set_page_dirty will mark that inode dirty wrt its own
> superblock. set_page_dirty() is currently an optional a_op,
> but it's not obvious that there will be a need for that.

Coda's inodes don't have to get dirtied because we never write them out,
although the associated dirty pages do need to hit the disk eventually :)

Jan

2002-04-10 22:46:41

by Andrew Morton

[permalink] [raw]

Subject: Re: [prepatch] address_space-based writeback

Jan Harkes wrote:
>
> On Wed, Apr 10, 2002 at 12:16:26PM -0700, Andrew Morton wrote:
> > I believe that the object relationship you're describing is
> > that the inode->i_mapping points to the main address_space,
> > and the `host' field of both the main and private address_spaces
> > both point at the same inode? That the inode owns two
> > address_spaces?
>
> Actually with Coda we've got 2 inodes that have an identical i_mapping.
> The Coda inode's i_mapping is set to point to the hosting inode's
> i_data.

I see. So this is all your fault :)

> ...
>
> But Coda has 2 inodes, which one are you connecting to whose superblock.
> My guess is that it would be correct to add inode->i_mapping->host to
> inode->i_mapping->host->i_sb, which will be the underlying inode in
> Coda's case, but host isn't guaranteed to be an inode, it just happens
> to be an inode in all existing situations.

When a page is marked dirty, the path which is followed
is page->mapping->host->i_sb. So in this case the page will
be attached to its page->mapping.dirty_pages, and
page->mapping->host will be attached to page->mapping->host->i_sb.s_dirty

This is as it always was - I didn't change any of this.

> > > What's more, I wonder how well does your scheme work with ->i_mapping
> > > to a different inode's ->i_data (CODA et.al., file access to block devices).
> >
> > Presumably, those different inodes have a superblock? In that
> > case set_page_dirty will mark that inode dirty wrt its own
> > superblock. set_page_dirty() is currently an optional a_op,
> > but it's not obvious that there will be a need for that.
>
> Coda's inodes don't have to get dirtied because we never write them out,
> although the associated dirty pages do need to hit the disk eventually :)
>

Right. Presumably, the pages hit the disk via the hosting inode's
filesystem's mechanics.

And it remains the case that Coda inodes will not be marked DIRTY_PAGES
because set_page_dirty()'s page->mapping->host walk will arrive at
the hosting inode.

-

2002-04-10 22:53:27

by Anton Altaparmakov

[permalink] [raw]

Subject: Re: [prepatch] address_space-based writeback

At 22:44 10/04/02, Andrew Morton wrote:
>When a page is marked dirty, the path which is followed
>is page->mapping->host->i_sb. So in this case the page will
>be attached to its page->mapping.dirty_pages, and
>page->mapping->host will be attached to page->mapping->host->i_sb.s_dirty
>
>This is as it always was - I didn't change any of this.

Um, NTFS uses address spaces for things where ->host is not an inode at all
so doing host->i_sb will give you god knows what but certainly not a super
block!

As long as your patches don't break that is possible to have I am happy...
But from what you are saying above I have a bad feeling you are somehow
assuming that a mapping's host is an inode...

Anton

--
"I've not lost my mind. It's backed up on tape somewhere." - Unknown
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Linux NTFS Maintainer / WWW: http://linux-ntfs.sf.net/
IRC: #ntfs on irc.openprojects.net / ICQ: 8561279
WWW: http://www-stu.christs.cam.ac.uk/~aia21/

2002-04-10 23:02:21

by Jan Harkes

[permalink] [raw]

Subject: Re: [prepatch] address_space-based writeback

On Wed, Apr 10, 2002 at 02:44:40PM -0700, Andrew Morton wrote:
> Jan Harkes wrote:
> > But Coda has 2 inodes, which one are you connecting to whose superblock.
> > My guess is that it would be correct to add inode->i_mapping->host to
> > inode->i_mapping->host->i_sb, which will be the underlying inode in
> > Coda's case, but host isn't guaranteed to be an inode, it just happens
> > to be an inode in all existing situations.
>
> When a page is marked dirty, the path which is followed
> is page->mapping->host->i_sb. So in this case the page will
> be attached to its page->mapping.dirty_pages, and
> page->mapping->host will be attached to page->mapping->host->i_sb.s_dirty
>
> This is as it always was - I didn't change any of this.

As far as I can see, it should work just fine.

> > Coda's inodes don't have to get dirtied because we never write them out,
> > although the associated dirty pages do need to hit the disk eventually :)
>
> Right. Presumably, the pages hit the disk via the hosting inode's
> filesystem's mechanics.
>
> And it remains the case that Coda inodes will not be marked DIRTY_PAGES
> because set_page_dirty()'s page->mapping->host walk will arrive at
> the hosting inode.

Correct, we rely completely on the hosting (inode's) filesystem to
implement the actual file operations. So the hosting inode is the
one that becomes dirty and pushed to disk by bdflush.

Jan

2002-04-10 23:33:14

by Andrew Morton

[permalink] [raw]

Subject: Re: [prepatch] address_space-based writeback

Anton Altaparmakov wrote:
>
> At 22:44 10/04/02, Andrew Morton wrote:
> >When a page is marked dirty, the path which is followed
> >is page->mapping->host->i_sb. So in this case the page will
> >be attached to its page->mapping.dirty_pages, and
> >page->mapping->host will be attached to page->mapping->host->i_sb.s_dirty
> >
> >This is as it always was - I didn't change any of this.
>
> Um, NTFS uses address spaces for things where ->host is not an inode at all
> so doing host->i_sb will give you god knows what but certainly not a super
> block!

But it's a `struct inode *' :(

What happens when someone runs set_page_dirty against one of
the address_space's pages? I guess that doesn't happen, because
it would explode. Do these address_spaces not support writable
mappings?

I like to think in terms of "top down" and "bottom up".

set_page_dirty is the core "bottom up" function which propagates
dirtiness information from the bottom of the superblock/inode/page
tree up to the top.

writeback is top-down. It goes from the superblock list down
to pages.

The assumption about page->mapping->host being an inode
only occurs in the bottom-up path, at set_page_dirty().

> As long as your patches don't break that is possible to have I am happy...
> But from what you are saying above I have a bad feeling you are somehow
> assuming that a mapping's host is an inode...

Well the default implementation of __set_page_dirty() will
make that assumption. (It always has).

But the address_space may implement its own a_ops->set_page_dirty(page),
so you can do whatever you need to do there, yes?

I currently have:

static inline int set_page_dirty(struct page *page)
{
if (page->mapping) {
int (*spd)(struct page *, int reserve_page);

spd = page->mapping->a_ops->set_page_dirty;
if (spd)
return (*spd)(page, 1);
}
return __set_page_dirty_buffers(page, 1);
}

Where __set_page_dirty_buffers() will dirty the buffers if
they exist. And non-buffer_head-backed filesystems which
use page->private MUST implement set_page_dirty().

The reserve_page stuff is for delayed-allocate, the priority
and timing of which has been pushed waaay back by this. I'm
keeping the reserve_page infrastructure around at present
because of vague thoughts that it may be useful to fix the
data-loss bug which occurs when a shared mapping of a sparse
file has insufficient disk space to satisfy new page instantiations.
Dunno yet.

(Sometime I need to go through and spell out all the new a_ops
methods in all the filesystems, and take out the fall-through-
to-default-handler stuff here, and in do_flushpage() and
try_to_release_page() and others. But not now).

-

2002-04-11 20:21:14

by Linus Torvalds

[permalink] [raw]

Subject: Re: [prepatch] address_space-based writeback

In article <[email protected]>,
Anton Altaparmakov <[email protected]> wrote:
>
>Um, NTFS uses address spaces for things where ->host is not an inode at all
>so doing host->i_sb will give you god knows what but certainly not a super
>block!

Then that should be fixed in NTFS.

The original meaning of "->host" was that it could be anything (and it
was a "void *", but the fact is that all the generic VM code etc needed
to know about host things like size, locking etc, so for over a year now
"host" has been a "struct inode", and if you need to have something
else, then that something else has to embed a proper inode.

>As long as your patches don't break that is possible to have I am happy...
>But from what you are saying above I have a bad feeling you are somehow
>assuming that a mapping's host is an inode...

It's not Andrew who is assuming anything: it _is_. Look at <linux/fs.h>,
and notice the

struct inode *host;

part.

Linus

2002-04-11 20:41:43

by Alexander Viro

[permalink] [raw]

Subject: Re: [prepatch] address_space-based writeback

On Thu, 11 Apr 2002, Linus Torvalds wrote:

> >As long as your patches don't break that is possible to have I am happy...
> >But from what you are saying above I have a bad feeling you are somehow
> >assuming that a mapping's host is an inode...
>
> It's not Andrew who is assuming anything: it _is_. Look at <linux/fs.h>,
> and notice the
>
> struct inode *host;

True. However, Andrew _is_ assuming that you can get from inode->i_mapping to
&FOOFS_I(inode)->secondary_address_space, which is simply not true.

Consider a filesystem that uses address_space to store, say it, EA of inode.
Now look at device node on such fs. What we have is

inode_1: sits on our fs, has i_mapping pointing to inode_2->i_data and
EA_address_space in fs-private part of inode;

inode_2: block device inode, sits on bdev.

inode_1->i_mapping == &inode_2->i_data
inode_1->i_mapping->host == inode2
FOOFS_I(inode_1)->EA_address_space.host = inode_1

Looks OK? Now look at Andrew's proposal - he suggests to have
method of inode_1->i_mapping to be responsible for writing out
&FOOFS_I(inode_1)->EA_address_space.

See where it breaks? After we'd followed ->i_mapping we lose information
about private parts of inode. And that's OK - that's precisely what we
want for, say it, block devices. There ->i_mapping should be the same
for _all_ inodes with this major:minor. However, that makes "scan all
superblocks, for each of them scan dirty inodes, for each of them do
some work on ->i_mapping" hopeless as a way to reach all address_spaces
in the system.

FWIW, correct solution might be to put dirty address_spaces on a list -
per-superblock or global.

2002-04-11 22:29:19

by Andrew Morton

[permalink] [raw]

Subject: Re: [prepatch] address_space-based writeback

Alexander Viro wrote:
>
> ...
> FWIW, correct solution might be to put dirty address_spaces on a list -
> per-superblock or global.

Another approach may be to implement address_space.private,
which appears to be what NTFS wants. My initial reaction
to that is fear, because it just makes the graph even more
tangly.

I agree that listing the dirty address_spaces against the
superblock makes sense - really it's what I'm trying to do,
and the intermediate inode is the only means of getting there.

Also, this splitup would clearly separate the concepts
of a dirty-inode and dirty-inode-pages. These seem to be
coupled in a non-obvious way at present.

AFAIK, the current superblock->inode->mapping approach won't
break any existing filesystems, so I'm inclined to follow
that for the while, get all the known problems collected
together and then take another pass at it. Maybe do something
to make inode_lock a bit more conventional as well.

This whole trend toward a flowering of tiny address_spaces
worries me a little from the performance POV, because
writeback likes big gobs of linear data to chew on. With the
global buffer LRU, even though large-scale sorting
opportunities were never implemented, we at least threw
a decent amount of data at the request queue.

With my current approach, all dirty conventional metadata
(the output from mark_buffer_dirty) is written back via
the blockdev's mapping. It becomes a bit of a dumping
ground for the output of legacy filesystems, but it does
offer the opportunity to implement a high-level sort before
sending it to disk. If that's needed.

I guess that as long as the periodic- and memory-pressure
based writeback continues to send a lot of pages down the
pipe we'll still get adequate merging, but it is something
to be borne in mind. At the end of the day we have to deal
with these funny spinning things.

One thing I'm not clear on with the private metadata address_space
concept: how will it handle blocksize less than PAGE_CACHE_SIZE?
The only means we have at present of representing sub-page
segments is the buffer_head. Do we want to generalise the buffer
layer so that it can be applied against private address_spaces?
That wouldn't be a big leap.

Or would the metadata address_spaces send their I/O through the
backing blockdev mapping in some manner?

-

2002-04-11 22:57:39

by Andreas Dilger

[permalink] [raw]

Subject: Re: [prepatch] address_space-based writeback

On Apr 11, 2002 14:27 -0700, Andrew Morton wrote:
> One thing I'm not clear on with the private metadata address_space
> concept: how will it handle blocksize less than PAGE_CACHE_SIZE?
> The only means we have at present of representing sub-page
> segments is the buffer_head. Do we want to generalise the buffer
> layer so that it can be applied against private address_spaces?
> That wouldn't be a big leap.

I was going to send you an email on this previously, but I (think I)
didn't in the end...

At one time Linus proposed having an array of dirty bits for a page,
which would allow us to mark only parts of a page dirty (say down to
the sector level). I believe this was in the discussion about moving
the block devices to the page cache around 2.4.10.

While that isn't a huge win in the most cases (it costs the same to
write 512 bytes as 4096 bytes to disk because of disk latencies) it may
be more important if/when we ever have larger pages. This also becomes
more important if you are working with a network filesystem where you
have to send all of the dirty data over a much smaller pipe, so sending
512 bytes takes 1/8 as long as sending 4096 bytes.

Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/

2002-04-11 23:10:54

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [prepatch] address_space-based writeback

On Thu, Apr 11, 2002 at 04:55:36PM -0600, Andreas Dilger wrote:
> At one time Linus proposed having an array of dirty bits for a page,
> which would allow us to mark only parts of a page dirty (say down to
> the sector level). I believe this was in the discussion about moving
> the block devices to the page cache around 2.4.10.

The early XFS code used to do this for kiobuf-based block I/O, but it
got dropped around 2.4.8 IIRC. The new page->private handling from akpm
which seems to be merged in Linus' BK tree (Linus: please push it to
bkbits.net, thanks :)) can be used to do the same if and only if we don't
need the buffer_heads attached to the page.

2002-04-11 23:22:20

by Anton Altaparmakov

[permalink] [raw]

Subject: Re: [prepatch] address_space-based writeback

On Thu, 11 Apr 2002, Andrew Morton wrote:
> Alexander Viro wrote:
> > ...
> > FWIW, correct solution might be to put dirty address_spaces on a list -
> > per-superblock or global.
>
> Another approach may be to implement address_space.private,
> which appears to be what NTFS wants. My initial reaction
> to that is fear, because it just makes the graph even more
> tangly.
>
> I agree that listing the dirty address_spaces against the
> superblock makes sense - really it's what I'm trying to do,
> and the intermediate inode is the only means of getting there.

If you take Al's approach then the intermediate inode becomes irrelevant.
I like the idea of per super block dirty mappings list a lot. In fact I
was planning to do that for ntfs and attach the list to the ntfs_volume
and then have a kntfsd thread walk those every 5 seconds and flush to
disk... (I was also thinking of throwing in barriers and
flush_to_this_barrier_now functionality for journalling purposes but I
was going to leave that for later...)

It would be great if the VFS implemented that in general instead. (-:

> Also, this splitup would clearly separate the concepts
> of a dirty-inode and dirty-inode-pages. These seem to be
> coupled in a non-obvious way at present.

Indeed. That would be very nice.

> AFAIK, the current superblock->inode->mapping approach won't
> break any existing filesystems, so I'm inclined to follow
> that for the while, get all the known problems collected
> together and then take another pass at it. Maybe do something
> to make inode_lock a bit more conventional as well.
>
> This whole trend toward a flowering of tiny address_spaces
> worries me a little from the performance POV,

Why tiny address_spaces? I have seen 4.5GiB large $MFT/$DATA (inode table)
on a 40GiB ntfs partition and that was year ago now... That's not what I
call tiny. (-;

The cluster allocation bitmap on a 40GiB partition is 10MiB in size, that
is not huge but not what I would call tiny either...

> because writeback likes big gobs of linear data to chew on. With the
> global buffer LRU, even though large-scale sorting opportunities were
> never implemented, we at least threw a decent amount of data at the
> request queue.
>
> With my current approach, all dirty conventional metadata
> (the output from mark_buffer_dirty) is written back via
> the blockdev's mapping. It becomes a bit of a dumping
> ground for the output of legacy filesystems, but it does
> offer the opportunity to implement a high-level sort before
> sending it to disk. If that's needed.

Considering modern harddrives have their own intelligent write back and
sorting of write requests and ever growing hd buffers the need for doing
this at the OS level is going to become less and less would be my guess, I
may be wrong of course...

> I guess that as long as the periodic- and memory-pressure
> based writeback continues to send a lot of pages down the
> pipe we'll still get adequate merging, but it is something
> to be borne in mind. At the end of the day we have to deal
> with these funny spinning things.
>
> One thing I'm not clear on with the private metadata address_space
> concept: how will it handle blocksize less than PAGE_CACHE_SIZE?

The new ntfs uses 512 byte blocksize ONLY (the hd sector size) and remaps
internally between the actual partition cluster size (block size) and the
512 byte block size set in s_blocksize.

This is the only way to support cluster sizes above PAGE_CACHE_SIZE. -
NTFS 3.1 can go up to 512kiB clusters (and bigger!, but Windows doesn't
allow you to format partitions with bigger cluster sizes AFAIK).

At the moment I am using a buffer head for each 512 byte disk sector which
is very counter productive to performance.

It would be great to be able to submit variable size "io entities" even
greater than PAGE_CACHE_SIZE (by giving a list of pages, starting offset
in first page and total request size for example) and saying write that to
the device starting at offset xyz. That would suit ntfs perfectly. (-:

> The only means we have at present of representing sub-page
> segments is the buffer_head. Do we want to generalise the buffer
> layer so that it can be applied against private address_spaces?
> That wouldn't be a big leap.

It is already generalised. I use it that way at least at the moment...

> Or would the metadata address_spaces send their I/O through the
> backing blockdev mapping in some manner?

Not sure what you mean...

Best regards,

Anton
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Linux NTFS maintainer / WWW: http://linux-ntfs.sf.net/
IRC: #ntfs on irc.openprojects.net / ICQ: 8561279
WWW: http://www-stu.christs.cam.ac.uk/~aia21/

2002-04-11 23:51:29

by Andrew Morton

[permalink] [raw]

Subject: Re: [prepatch] address_space-based writeback

Andreas Dilger wrote:
>
> On Apr 11, 2002 14:27 -0700, Andrew Morton wrote:
> > One thing I'm not clear on with the private metadata address_space
> > concept: how will it handle blocksize less than PAGE_CACHE_SIZE?
> > The only means we have at present of representing sub-page
> > segments is the buffer_head. Do we want to generalise the buffer
> > layer so that it can be applied against private address_spaces?
> > That wouldn't be a big leap.
>
> I was going to send you an email on this previously, but I (think I)
> didn't in the end...
>
> At one time Linus proposed having an array of dirty bits for a page,
> which would allow us to mark only parts of a page dirty (say down to
> the sector level). I believe this was in the discussion about moving
> the block devices to the page cache around 2.4.10.

What Christoph said :)

One problem with having a bitmap at page->private is that
we don't know how big the blocks are. You could have
0x00000003 but you don't know if that's two 1k blocks
or two 2k blocks. Not hard to work around I guess.

But you will, I suspect, end up needing additional
information. Whether that part of the page has a
disk mapping. If so, what block is it. Maybe not a
lot more.

In the delalloc no-buffer_head code I'm just sticking a block
number at page->private, to cache the disk mapping, to save
additional get_block calls when the file is overwritten. That
works if all the page's blocks are disk-contiguous. If they're
not, repeated get_block calls are needed. That all works OK.
Even when I wasn't caching get_block results at all I didn't
observe any particular performance problems from the need for
repeated get_block calls.

We could also just stick a dynamically-allocated array
of next-gen buffer_heads at page->private. There's really
no need to allocate each buffer separately. This would
save memory, save CPU and with a bit of pointer arithmetic,
liberate b_this_page and b_data.

Anyway, I puzzled it out. Private address_spaces will "just work"
at present. There are some wrinkles. Take the example of ext2
indirect blocks. Suppose they are in an address_space which is
contained in the private part of the inode. Presumably, that
address_space's ->host points back at the IS_REG inode. But even if
it doesn't, the indirect block's address_space needs to make damn
sure that it passes the main file's inode into the block allocator
so that the indirects are laid out in the right place. The cost
of not placing the first indirect immediately behind block 11 is
30%. Been there, done that :)

Now, if it's done right, that indirect block address_space is
disk-fragmented all over the shop. Each of its blocks are four
megabytes apart. This has the potential to come significantly
unstuck with address_space-based writeback, because the walk
across the file's pages will be in a separate pass from the
walk across the indirects.

In the 2.4-way of doing things, the regular data and the indirects
are nicely sorted on the global buffer LRU.

It is easy to solve this with delayed-allocate. Because here,
you *know* that when an indirect block is allocated, I/O
is underway right now against its neighbouring data blocks.
So you just start I/O against the indirect as soon as you
allocate it.

But for synchronous-allocate filesystems I have a problem. This
could significantly suck for writeout of very large files, when
the request queue is at its current teeny size. Maybe writepage
of regular data at an indirect boundary needs to hunt down its
associated indirect block and start low-level I/O against that.
That will work.

Hopefully when Jens gets the new I/O scheduler in place we can
comfortably extend the request queue to 1024 slots or so.

-

2002-04-12 00:05:31

by Andrew Morton

[permalink] [raw]

Subject: Re: [prepatch] address_space-based writeback

Anton Altaparmakov wrote:
>
> ...
> It would be great to be able to submit variable size "io entities" even
> greater than PAGE_CACHE_SIZE (by giving a list of pages, starting offset
> in first page and total request size for example) and saying write that to
> the device starting at offset xyz. That would suit ntfs perfectly. (-:
>

Yes, I'll be implementing that. Writeback will hit the filesystem
via a_ops->writeback_mapping(mapping, nr_pages). The filesytem
will then call in to generic_writeback_mapping(), which will walk
the pages, assemble BIOs and send them off.

The filesystem needs to pass a little state structure into the
generic writeback function. An important part of that is a
semaphore which is held while writeback is locking multiple
pages. To avoid ab/ba deadlocks.

The current implementation of this bio-assembler is for no-buffer
(delalloc) fileystems only. It need to be enhanced (or forked)
to also cope with buffer-backed pages. It will need to peek
inside the buffer-heads to detect clean buffers (maybe. It
definitely needs to skip unmapped ones). When such a buffer
is encountered the BIO will be sent off and a new one will be started.
The code for this is going to be quite horrendous. I suspect
the comment/code ratio will exceed 1.0, which is a bad sign :)

One thing upon which I am undecided: for syncalloc filesystems
like ext2, do we attach buffers at ->readpages() time, or do
we just leave the page bufferless?

That's a hard call. It helps the common case, but in the uncommon
case (we overwrite the file after reading it), we need to run
get_block again.

-

2002-04-12 00:13:23

by Linus Torvalds

[permalink] [raw]

Subject: Re: [prepatch] address_space-based writeback

On Thu, 11 Apr 2002, Andrew Morton wrote:
>
> One problem with having a bitmap at page->private is that
> we don't know how big the blocks are. You could have
> 0x00000003 but you don't know if that's two 1k blocks
> or two 2k blocks. Not hard to work around I guess.

I think the more fundamental problem is that we want to have a generic
mechanism, and for other filesystems the writeback data is a lot different
from "this sector is dirty".

That, in the end, convinced me that there's no point to trying to keep
per-sector dirty data around in a generic formal - it just wasn't generic
enough.

So now it's up to the writeback mechanism itself to keep track of which
part of the page is dirty, be it with the NFS kind of "nfs_page"
structures, or with things like "struct buffer_head", and I certainly no
longer have any plans at all to try to keep anything like a "dirty bitmap"
in the generic data structures.

Linus

2002-04-12 01:15:01

by Anton Altaparmakov

[permalink] [raw]

Subject: Re: [prepatch] address_space-based writeback

At 21:20 11/04/02, Linus Torvalds wrote:
>In article <[email protected]>,
>Anton Altaparmakov <[email protected]> wrote:
> >
> >Um, NTFS uses address spaces for things where ->host is not an inode at all
> >so doing host->i_sb will give you god knows what but certainly not a super
> >block!
>
>Then that should be fixed in NTFS.
>
>The original meaning of "->host" was that it could be anything (and it
>was a "void *", but the fact is that all the generic VM code etc needed
>to know about host things like size, locking etc, so for over a year now
>"host" has been a "struct inode", and if you need to have something
>else, then that something else has to embed a proper inode.
>
> >As long as your patches don't break that is possible to have I am happy...
> >But from what you are saying above I have a bad feeling you are somehow
> >assuming that a mapping's host is an inode...
>
>It's not Andrew who is assuming anything: it _is_. Look at <linux/fs.h>,
>and notice the
>
> struct inode *host;
>
>part.

Yes I know that. Why not extend address spaces beyond being just inode
caches? An address space with its mapping is a very useful generic data
cache object.

Requiring a whole struct inode seems silly. I suspect (without in depth
knowledge of the VM yet...) the actual number of fields in the struct inode
being used by the generic VM code is rather small... They could be split
away from the inode and moved into struct address_space instead if that is
where they actually belong...

Just a very basic example: Inode 0 on ntfs is a file named $MFT. It
contains two data parts: one containing the actual on-disk inode metadata
and one containing the inode allocation bitmap. At the moment I have placed
the inode metadata in the "normal" inode address_space mapping and of
course ->host points back to inode 0. But for the inode allocation bitmap I
set ->host to the current ntfs_volume structure instead and I am at present
using a special readpage function to access this address space. Making
->host point back to the original inode 0 makes no sense as i_size for
example would be the size of the other address space - the inode data one.
(And faking an inode is not fun, then I would have to keep track of fake
inodes, etc, and the kernel already restricts us to only 32-bits worth of
inodes while ntfs uses s64 for that... and inode locking than becomes even
more interesting than it already is...)

The mere fact that the VM requires to know the size of the data in an
address space just indicates to me that struct address_space ought to
contain a size field in it.

So while something should be fixed I think it is the kernel and not ntfs. (-;

Best regards,

Anton

--
"I've not lost my mind. It's backed up on tape somewhere." - Unknown
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Linux NTFS Maintainer / WWW: http://linux-ntfs.sf.net/
IRC: #ntfs on irc.openprojects.net / ICQ: 8561279
WWW: http://www-stu.christs.cam.ac.uk/~aia21/

2002-04-12 01:38:35

by Linus Torvalds

[permalink] [raw]

Subject: Re: [prepatch] address_space-based writeback

In article <[email protected]>,
Anton Altaparmakov <[email protected]> wrote:
>At 21:20 11/04/02, Linus Torvalds wrote:
>>
>>It's not Andrew who is assuming anything: it _is_. Look at <linux/fs.h>,
>>and notice the
>>
>> struct inode *host;
>>
>>part.
>
>Yes I know that. Why not extend address spaces beyond being just inode
>caches? An address space with its mapping is a very useful generic data
>cache object.

Sure it is. And at worst you'll just have to have a dummy inode to
attach to it.

Could we split up just the important fields and create a new level of
abstraction? Sure. But since there are no major users, and since it
would make the common case less readable, there simply isn't any point.

Over-abstraction is a real problem - it makes the code less obvious, and
if there is no real gain from it then over-abstracting things is a bad
thing.

>Just a very basic example: Inode 0 on ntfs is a file named $MFT. It
>contains two data parts: one containing the actual on-disk inode metadata
>and one containing the inode allocation bitmap. At the moment I have placed
>the inode metadata in the "normal" inode address_space mapping and of
>course ->host points back to inode 0. But for the inode allocation bitmap I
>set ->host to the current ntfs_volume structure instead and I am at present
>using a special readpage function to access this address space.

Why not just allocate a separate per-volume "bitmap" inode? It makes
sense, and means that you can then use all the generic routines to let
it be automatically written back to disk etc.

Also, don't you get the "ntfs_volume" simply through "inode->i_sb"? It
looks like you're just trying to save off the same piece of information
that you already have in another place. Redundancy isn't a virtue.

>The mere fact that the VM requires to know the size of the data in an
>address space just indicates to me that struct address_space ought to
>contain a size field in it.

It doesn't "require" it - in the sense that you can certainly use
address spaces without using those routines that assume that it has an
inode. It just means that you don't get the advantage of the
higher-level concepts.

>So while something should be fixed I think it is the kernel and not ntfs. (-;

There are a few fundamental concepts in UNIX, and one of them is
"everything is a file".

And like it or not, the Linux concept of a file _requires_ an inode (it
also requires a dentry to describe the path to that inode, and it
requires the "struct file" itself).

The notion of "struct inode" as the lowest level of generic IO entity is
very basic in the whole design of Linux. If you try to avoid it, you're
most likely doing something wrong. You think your special case is
somehow independent and more important than one of the basic building
blocks of the whole system.

Linus

2002-04-12 04:22:46

by Bill Davidsen

[permalink] [raw]

Subject: Re: [prepatch] address_space-based writeback

On Fri, 12 Apr 2002, Anton Altaparmakov wrote:

> Considering modern harddrives have their own intelligent write back and
> sorting of write requests and ever growing hd buffers the need for doing
> this at the OS level is going to become less and less would be my guess, I
> may be wrong of course...

Clearly this is true. However, while the benefits of o/s effort are down
on top end hardware, one of the strengths of Linux is that it runs well on
small, or embedded, or old, or totally obsolete hardware. So there is a PR
and social benefit from the efforts already invested in making the code
use the hardware carefully.

--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2002-04-12 07:57:29

by Anton Altaparmakov

[permalink] [raw]

Subject: Re: [prepatch] address_space-based writeback

At 02:37 12/04/02, Linus Torvalds wrote:
>In article <[email protected]>,
>Anton Altaparmakov <[email protected]> wrote:
> >At 21:20 11/04/02, Linus Torvalds wrote:
> >>
> >>It's not Andrew who is assuming anything: it _is_. Look at <linux/fs.h>,
> >>and notice the
> >>
> >> struct inode *host;
> >>
> >>part.
> >
> >Yes I know that. Why not extend address spaces beyond being just inode
> >caches? An address space with its mapping is a very useful generic data
> >cache object.
>
>Sure it is. And at worst you'll just have to have a dummy inode to
>attach to it.
>
>Could we split up just the important fields and create a new level of
>abstraction? Sure. But since there are no major users, and since it
>would make the common case less readable, there simply isn't any point.

Ok, fair enough. While ntfs would not be the only fs to benefit, major fs
like ext2/3 don't have named streams, EAs, bitmap attributes, and other
data holding constructs except the traditial "file contents data stream" so
they probably don't have the need for a new level of abstraction.

But that then leads to the question which fields in the dummy inode must be
initialized. My biggest concern is whether i_ino needs to be unique or
whether it can be copied from the real inode (or perhaps even better
whether I can just set it to a value like minus one to signify to ntfs
"treat as special").

ntfs will need to distinguish the dummies from the real inodes and not only
that, it will need to know which ntfs attribute they signify. Admittedly
that part is trivial as the ntfs specific part of the inode can contain a
NI_Dummy flag in the ->state bits and that can then change the meaning of
the remainder of the ntfs_inode structure.

A locking problem also arises. To write out a real inode, you need to lock
down all the dummy inodes associated with it and vice versa, so you get
into ab/ba deadlock country. Further, dirty inodes become interesting, what
happens if the real inode is evicted from ram but the attached dummies are
still around? Probably would lead to dereferencing random memory. )-: So
one would need to extend ntfs_clear_inode to kill off all dummy inodes
belonging to a real inode when the real inode is evicted. And that again
brings the problems along of some of the dummies may be dirty and/or
locked, so potential deadlocks again...

>Over-abstraction is a real problem - it makes the code less obvious, and
>if there is no real gain from it then over-abstracting things is a bad
>thing.

Fair comment but the current VM isn't exactly obvious either. </rant about
lack of vm documentation> Understanding readpage and helpers was easy but
the write code paths which I am trying to get my head round to are not
funny once you get into the details...

> >Just a very basic example: Inode 0 on ntfs is a file named $MFT. It
> >contains two data parts: one containing the actual on-disk inode metadata
> >and one containing the inode allocation bitmap. At the moment I have placed
> >the inode metadata in the "normal" inode address_space mapping and of
> >course ->host points back to inode 0. But for the inode allocation bitmap I
> >set ->host to the current ntfs_volume structure instead and I am at present
> >using a special readpage function to access this address space.
>
>Why not just allocate a separate per-volume "bitmap" inode? It makes
>sense, and means that you can then use all the generic routines to let
>it be automatically written back to disk etc.

Yes, I could do. I guess I can grab one from fs/inode.c::new_inode(). Do I
need to provide unique i_ino though?

>Also, don't you get the "ntfs_volume" simply through "inode->i_sb"? It
>looks like you're just trying to save off the same piece of information
>that you already have in another place. Redundancy isn't a virtue.

Yes, I can get it from there but I didn't want to use the original inode as
->host as that would have had potential for breaking things in odd ways in
the VM. Using "ntfs_volume" was more likely to give more obvious "BANG"
type of errors instead... (-;

> >The mere fact that the VM requires to know the size of the data in an
> >address space just indicates to me that struct address_space ought to
> >contain a size field in it.
>
>It doesn't "require" it - in the sense that you can certainly use
>address spaces without using those routines that assume that it has an
>inode. It just means that you don't get the advantage of the
>higher-level concepts.

Indeed. And that in turn means I have to duplicate a lot of functionality
from the core kernel in ntfs which is always a Bad Thing.

For file access I can use block_read_full_page with only ONE line changed
and that is the setting of the async io completion handler - NTFS requires
its own as it has 3 different versions of i_size which need to be
interpreted after the read from device in order to determine if any parts
of the buffer head need to be zeroed out after the read (and there are 4
i_size versions for compressed files). I don't see any way to do this at an
earlier stage than io completion as some partial buffers need to be zeroed
AFTER the io from disk has completed.

If there was a way to provide my custom handler to block_read_full_page or
at least if end_buffer_is_async would allow a hook to be placed which would
get run for each buffer head it is called with that would decrease the
amount of code duplication quite a lot... But I guess this falls under the
"there are no major users category"... )-:

> >So while something should be fixed I think it is the kernel and not
> ntfs. (-;
>
>There are a few fundamental concepts in UNIX, and one of them is
>"everything is a file".
>
>And like it or not, the Linux concept of a file _requires_ an inode (it
>also requires a dentry to describe the path to that inode, and it
>requires the "struct file" itself).
>
>The notion of "struct inode" as the lowest level of generic IO entity is
>very basic in the whole design of Linux. If you try to avoid it, you're
>most likely doing something wrong. You think your special case is
>somehow independent and more important than one of the basic building
>blocks of the whole system.

Thanks for the reminder. I have obviously been digging around Windows/NTFS
for too long so I had distanced myself from the concept. )-:

Yet, this really begs the question of defining the concept of a file. I am
quite happy with files being the io entity in ntfs. It is just that each
file in ntfs can contain loads of different data holding attributes which
are all worth placing in address spaces. Granted, a dummy inode could be
setup for each of those which just means a lot of wasted ram but ntfs is
not that important so I have to take the penalty there. But if I also need
unique inode numbers in those dummy inodes then the overhead is becoming
very, very high...

Best regards,

Anton

--
"I've not lost my mind. It's backed up on tape somewhere." - Unknown
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Linux NTFS Maintainer / WWW: http://linux-ntfs.sf.net/
IRC: #ntfs on irc.openprojects.net / ICQ: 8561279
WWW: http://www-stu.christs.cam.ac.uk/~aia21/

2002-04-15 08:48:25

by Andrew Morton

[permalink] [raw]

Subject: Re: [prepatch] address_space-based writeback

An update to these patches is at

http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.8/

All work at this time is on

dallocbase-10-page_alloc_fail.patch
dallocbase-35-pageprivate_fixes.patch
dallocbase-55-ext2_dir.patch
dallocbase-60-page_accounting.patch
ratcache-pf_memalloc.patch
mempool-node.patch
dallocbase-70-writeback.patch
ttd (my current 2.5 things-to-do-list, just for fun)

none of the patches beyond these have even been tested in a week..

The new buffer<->page relationship rules seem to be working
out OK. In a six-hour stress test on a quad Xeon with
1k blocksize ext3 in ordered-data mode there was one failure:
a block in a file which came up with wrong data. There appears
to be a race in ext3 or the patch or 2.5 or something somewhere.
Still hunting this one.

It is all relatively stable now. ramdisk, loop, reiserfs, ext2,
ext3-ordered, ext3-journalled, JFS and vfat have been tested.

minixfs and sysvfs are broken with these patches. They rely
on preservation of directory data outside i_size. Will fix.

-

2002-04-27 15:53:50

by Jan Harkes

[permalink] [raw]

Subject: Re: [prepatch] address_space-based writeback

On Fri, Apr 12, 2002 at 08:57:17AM +0100, Anton Altaparmakov wrote:
> Yet, this really begs the question of defining the concept of a file. I am
> quite happy with files being the io entity in ntfs. It is just that each
> file in ntfs can contain loads of different data holding attributes which
> are all worth placing in address spaces. Granted, a dummy inode could be
> setup for each of those which just means a lot of wasted ram but ntfs is
> not that important so I have to take the penalty there. But if I also need
> unique inode numbers in those dummy inodes then the overhead is becoming
> very, very high...

You could have all additional IO streams use the same inode number and
use iget4. Several inodes can have the same i_ino and the additional
argument would be a stream identifier that selects the correct 'IO
identity'.

Jan

2002-04-28 03:04:01

by Anton Altaparmakov

[permalink] [raw]

Subject: Re: [prepatch] address_space-based writeback

At 16:53 27/04/02, Jan Harkes wrote:
>On Fri, Apr 12, 2002 at 08:57:17AM +0100, Anton Altaparmakov wrote:
> > Yet, this really begs the question of defining the concept of a file. I am
> > quite happy with files being the io entity in ntfs. It is just that each
> > file in ntfs can contain loads of different data holding attributes which
> > are all worth placing in address spaces. Granted, a dummy inode could be
> > setup for each of those which just means a lot of wasted ram but ntfs is
> > not that important so I have to take the penalty there. But if I also need
> > unique inode numbers in those dummy inodes then the overhead is becoming
> > very, very high...
>
>You could have all additional IO streams use the same inode number and
>use iget4. Several inodes can have the same i_ino and the additional
>argument would be a stream identifier that selects the correct 'IO
>identity'.

Great idea! I quickly looked into the implementation details and using
iget4/read_inode2 perfectly reconciles my ideas of using an address space
mapping for each ntfs attribute with the kernels requirement of using
inodes as the i/o entity by allowing a clean and unique mapping between
multiple inodes with the same inode numbers and their attributes and
address spaces.

I need to work out exactly how to do it but I will definitely go that way.
That will make everything nice and clean and get rid of the existing
kludges of passing around other types of objects instead of struct file *
to my various readpage functions. Also I will be able to have fewer
readpage functions... (-:

Thanks for the suggestion!

Best regards,

Anton

--
"I've not lost my mind. It's backed up on tape somewhere." - Unknown
--
Anton Altaparmakov <aia21 at cantab.net> (replace at with @)
Linux NTFS Maintainer / IRC: #ntfs on irc.openprojects.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/

2002-04-29 09:03:53

by Nikita Danilov

[permalink] [raw]

Subject: Re: [prepatch] address_space-based writeback

Anton Altaparmakov writes:
> At 16:53 27/04/02, Jan Harkes wrote:
> >On Fri, Apr 12, 2002 at 08:57:17AM +0100, Anton Altaparmakov wrote:
> > > Yet, this really begs the question of defining the concept of a file. I am
> > > quite happy with files being the io entity in ntfs. It is just that each
> > > file in ntfs can contain loads of different data holding attributes which
> > > are all worth placing in address spaces. Granted, a dummy inode could be
> > > setup for each of those which just means a lot of wasted ram but ntfs is
> > > not that important so I have to take the penalty there. But if I also need
> > > unique inode numbers in those dummy inodes then the overhead is becoming
> > > very, very high...
> >
> >You could have all additional IO streams use the same inode number and
> >use iget4. Several inodes can have the same i_ino and the additional
> >argument would be a stream identifier that selects the correct 'IO
> >identity'.
>
> Great idea! I quickly looked into the implementation details and using
> iget4/read_inode2 perfectly reconciles my ideas of using an address space
> mapping for each ntfs attribute with the kernels requirement of using
> inodes as the i/o entity by allowing a clean and unique mapping between
> multiple inodes with the same inode numbers and their attributes and
> address spaces.
>

Please note that ->read_inode2() is reiserfs-specific hack. Adding more
users for it would make it permanent. The only reason for ->read_inode2
existence was that iget() was called by code external to the
file-system, knfsd used to do this, now it can call ->fh_to_dentry() in
stead. As iget() is never called outside of file-ssytem, you can set
ntfs->read_inode to no-op and write your own function ntfs_iget(...) to
be called from ntfs_lookup() and ntfs_fill_super().

ntfs_iget() calls iget() (->read_inode is no-op, hence iget doesn't
access disk) and, if new inode were allocated, reads data from the disk
and initializes inode, etc.

I guess coda_iget() is example of this.

> I need to work out exactly how to do it but I will definitely go that way.
> That will make everything nice and clean and get rid of the existing
> kludges of passing around other types of objects instead of struct file *
> to my various readpage functions. Also I will be able to have fewer
> readpage functions... (-:
>
> Thanks for the suggestion!
>
> Best regards,
>
> Anton
>

Nikita.

2002-04-29 11:10:17

by Anton Altaparmakov

[permalink] [raw]

Subject: Re: [prepatch] address_space-based writeback

Hi,

I am cc:-ing Al Viro, perhaps Al could comment on approach as this would
affect the future of read_inode2 in VFS?

At 10:03 29/04/02, Nikita Danilov wrote:
>Anton Altaparmakov writes:
> > At 16:53 27/04/02, Jan Harkes wrote:
> > >You could have all additional IO streams use the same inode number and
> > >use iget4. Several inodes can have the same i_ino and the additional
> > >argument would be a stream identifier that selects the correct 'IO
> > >identity'.
> >
> > Great idea! I quickly looked into the implementation details and using
> > iget4/read_inode2 perfectly reconciles my ideas of using an address space
> > mapping for each ntfs attribute with the kernels requirement of using
> > inodes as the i/o entity by allowing a clean and unique mapping between
> > multiple inodes with the same inode numbers and their attributes and
> > address spaces.
>
>Please note that ->read_inode2() is reiserfs-specific hack. Adding more
>users for it would make it permanent. The only reason for ->read_inode2
>existence was that iget() was called by code external to the
>file-system, knfsd used to do this, now it can call ->fh_to_dentry() in
>stead. As iget() is never called outside of file-ssytem, you can set
>ntfs->read_inode to no-op and write your own function ntfs_iget(...) to
>be called from ntfs_lookup() and ntfs_fill_super().
>
>ntfs_iget() calls iget() (->read_inode is no-op, hence iget doesn't
>access disk) and, if new inode were allocated, reads data from the disk
>and initializes inode, etc.
>
>I guess coda_iget() is example of this.

This will not work AFAICS.

coda_iget() -> iget4() -> get_new_inode(), which calls ->read_inode or
->read_inode2, and then unlocks the inode and wakes up the waiting tasks.

If ->read_inode and ->read_inode2 are NULL as you suggest for NTFS it means
that as soon as ntfs_iget() has called iget4() there will be an
uninitialized yet unlocked inode in memory which is guaranteed to cause
NTFS to oops... (And any other fs using this approach.)

Before the inode is unlocked it MUST be initialized. And the only way to do
this in the framework of the current VFS is to use ->read_inode or
->read_inode2.

Al, would you agree with NTFS using ->read_inode2 as well as ReiserFS?

Best regards,

Anton

--
"I've not lost my mind. It's backed up on tape somewhere." - Unknown
--
Anton Altaparmakov <aia21 at cantab.net> (replace at with @)
Linux NTFS Maintainer / IRC: #ntfs on irc.openprojects.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/

2002-04-29 11:59:48

by Nikita Danilov

[permalink] [raw]

Subject: Re: [prepatch] address_space-based writeback

Anton Altaparmakov writes:
> Hi,
>
> I am cc:-ing Al Viro, perhaps Al could comment on approach as this would
> affect the future of read_inode2 in VFS?
>
> At 10:03 29/04/02, Nikita Danilov wrote:
> >Anton Altaparmakov writes:
> > > At 16:53 27/04/02, Jan Harkes wrote:
> > > >You could have all additional IO streams use the same inode number and
> > > >use iget4. Several inodes can have the same i_ino and the additional
> > > >argument would be a stream identifier that selects the correct 'IO
> > > >identity'.
> > >
> > > Great idea! I quickly looked into the implementation details and using
> > > iget4/read_inode2 perfectly reconciles my ideas of using an address space
> > > mapping for each ntfs attribute with the kernels requirement of using
> > > inodes as the i/o entity by allowing a clean and unique mapping between
> > > multiple inodes with the same inode numbers and their attributes and
> > > address spaces.
> >
> >Please note that ->read_inode2() is reiserfs-specific hack. Adding more
> >users for it would make it permanent. The only reason for ->read_inode2
> >existence was that iget() was called by code external to the
> >file-system, knfsd used to do this, now it can call ->fh_to_dentry() in
> >stead. As iget() is never called outside of file-ssytem, you can set
> >ntfs->read_inode to no-op and write your own function ntfs_iget(...) to
> >be called from ntfs_lookup() and ntfs_fill_super().
> >
> >ntfs_iget() calls iget() (->read_inode is no-op, hence iget doesn't
> >access disk) and, if new inode were allocated, reads data from the disk
> >and initializes inode, etc.
> >
> >I guess coda_iget() is example of this.
>
> This will not work AFAICS.
>
> coda_iget() -> iget4() -> get_new_inode(), which calls ->read_inode or
> ->read_inode2, and then unlocks the inode and wakes up the waiting tasks.
>
> If ->read_inode and ->read_inode2 are NULL as you suggest for NTFS it means
> that as soon as ntfs_iget() has called iget4() there will be an
> uninitialized yet unlocked inode in memory which is guaranteed to cause
> NTFS to oops... (And any other fs using this approach.)

I see. While this can be worked around by adding flag set up after inode
initialization, this would become ugly shortly.

>
> Before the inode is unlocked it MUST be initialized. And the only way
> to do this in the framework of the current VFS is to use ->read_inode
> or ->read_inode2.
>
> Al, would you agree with NTFS using ->read_inode2 as well as ReiserFS?
>

->read_inode2 is a hack. And especially so is having both ->read_inode
and ->read_inode2. iget() interface was based on the assumption that
inodes can be located (and identified) by inode number. It is not so at
least for the reiserfs and ->read_inode2 works around this by passing
"cookie" with information sufficient for file system to locate inode.

I am concerned that (ab)using this cookie and ->read_inode2 to bypass
rigid iget() is not right way to go. What about VFS exporting function
that checks hash table, creates new inode if not there and returns it
still locked? This way each file system would be able to locate and load
inodes in a way it likes without encoding/decoding information in the
cookie.

> Best regards,
>
> Anton
>

Nikita.

>
> --

2002-04-29 12:35:30

by Anton Altaparmakov

[permalink] [raw]

Subject: Re: [prepatch] address_space-based writeback

At 12:59 29/04/02, Nikita Danilov wrote:
>[snip]
> > >Please note that ->read_inode2() is reiserfs-specific hack. Adding more
> > >users for it would make it permanent. The only reason for ->read_inode2
> > >existence was that iget() was called by code external to the
> > >file-system, knfsd used to do this, now it can call ->fh_to_dentry() in
> > >stead. As iget() is never called outside of file-ssytem, you can set
> > >ntfs->read_inode to no-op and write your own function ntfs_iget(...) to
> > >be called from ntfs_lookup() and ntfs_fill_super().
> > >
> > >ntfs_iget() calls iget() (->read_inode is no-op, hence iget doesn't
> > >access disk) and, if new inode were allocated, reads data from the disk
> > >and initializes inode, etc.
> > >
> > >I guess coda_iget() is example of this.
> >
> > This will not work AFAICS.
> >
> > coda_iget() -> iget4() -> get_new_inode(), which calls ->read_inode or
> > ->read_inode2, and then unlocks the inode and wakes up the waiting tasks.
> >
> > If ->read_inode and ->read_inode2 are NULL as you suggest for NTFS it
> means
> > that as soon as ntfs_iget() has called iget4() there will be an
> > uninitialized yet unlocked inode in memory which is guaranteed to cause
> > NTFS to oops... (And any other fs using this approach.)
>
>I see. While this can be worked around by adding flag set up after inode
>initialization, this would become ugly shortly.
>
> > Before the inode is unlocked it MUST be initialized. And the only way
> > to do this in the framework of the current VFS is to use ->read_inode
> > or ->read_inode2.
> >
> > Al, would you agree with NTFS using ->read_inode2 as well as ReiserFS?
>
>->read_inode2 is a hack.
>
>And especially so is having both ->read_inode and ->read_inode2. iget()
>interface was based on the assumption that inodes can be located (and
>identified) by inode number. It is not so at least for the reiserfs and
>->read_inode2 works around this by passing "cookie" with information
>sufficient for file system to locate inode.
>
>I am concerned that (ab)using this cookie and ->read_inode2 to bypass
>rigid iget() is not right way to go.

Perhaps. But it works now and can easily be modified later if better
approach comes along.

>What about VFS exporting function that checks hash table, creates new
>inode if not there and returns it still locked? This way each file system
>would be able to locate and load
>inodes in a way it likes without encoding/decoding information in the cookie.

Yes that would work fine. But it would still need the iget4 like find actor
and opaque pointer to do the "checks hash table" part...

Basically one could just change iget4 into two functions: iget calling
fs->read_inode (with read_inode2 removed) and iget4 without the
->read_inode and ->read_inode2 and returning a locked inode instead.

That would make it fs specific.

If we wanted to make it generic we do need a special method in the
operations, but wait, we already have read_inode2! So perhaps it isn't as
much of a hack after all...

If you wanted to get rid of the hackish nature of the beast, one could just
remove ->read_inode and rename ->read_inode2 to ->read_inode. Then change
all fs to accept two dummy parameters in their ->read_inode declaration...

If that would be an acceptable approach I would be happy to do a patch to
convert all in kernel file systems in 2.5.x.

Best regards,

Anton

--
"I've not lost my mind. It's backed up on tape somewhere." - Unknown
--
Anton Altaparmakov <aia21 at cantab.net> (replace at with @)
Linux NTFS Maintainer / IRC: #ntfs on irc.openprojects.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/

2002-04-29 13:01:36

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [prepatch] address_space-based writeback

On Mon, Apr 29, 2002 at 01:34:02PM +0100, Anton Altaparmakov wrote:
> Basically one could just change iget4 into two functions: iget calling
> fs->read_inode (with read_inode2 removed) and iget4 without the
> ->read_inode and ->read_inode2 and returning a locked inode instead.
>
> That would make it fs specific.
>
> If we wanted to make it generic we do need a special method in the
> operations, but wait, we already have read_inode2! So perhaps it isn't as
> much of a hack after all...
>
> If you wanted to get rid of the hackish nature of the beast, one could just
> remove ->read_inode and rename ->read_inode2 to ->read_inode. Then change
> all fs to accept two dummy parameters in their ->read_inode declaration...
>
> If that would be an acceptable approach I would be happy to do a patch to
> convert all in kernel file systems in 2.5.x.

Please take a look at icreate & surrounding code in the 2.5 XFS tree.
The code needs some cleanup, but I think the API is exactly what we want.

Christoph

2002-04-30 12:34:17

by Denis Vlasenko

[permalink] [raw]

Subject: Re: [prepatch] address_space-based writeback

On 29 April 2002 09:59, Nikita Danilov wrote:
> Anton Altaparmakov writes:
> > Al, would you agree with NTFS using ->read_inode2 as well as ReiserFS?
>
> ->read_inode2 is a hack. And especially so is having both ->read_inode
> and ->read_inode2. iget() interface was based on the assumption that
> inodes can be located (and identified) by inode number. It is not so at
> least for the reiserfs and ->read_inode2 works around this by passing
> "cookie" with information sufficient for file system to locate inode.

Why do we have to stich to concept of inode *numbers*?
Because there are inode numbers in traditional Unix filesystems?

What about reiserfs? NTFS? Even plain old FAT have trouble simulating
inode numbers for zero-length files.

Why? Because inode numbers (or lack of them) is fs implementation detail
which unfortunately leaked into Linux VFS API.

Or maybe I am just stupid.
--
vda

2002-04-30 13:13:58

by john slee

[permalink] [raw]

Subject: Re: [prepatch] address_space-based writeback

[ cc list trimmed ]

On Tue, Apr 30, 2002 at 03:19:17PM -0200, Denis Vlasenko wrote:
> Why do we have to stich to concept of inode *numbers*?
> Because there are inode numbers in traditional Unix filesystems?

probably because there is software out there relying on them being
numbers and being able to do 'if(inum_a == inum_b) { same_file(); }'
as appropriate. i can't think of a use for such a construct other than
preserving hardlinks in archives (does tar do this?) but i'm sure there
are others

like much of unix it's been there forever and has become such a natural
concept in people's heads that to change it now seems unthinkable.

much like the missing e in creat().

j.

--
R N G G "Well, there it goes again... And we just sit
I G G G here without opposable thumbs." -- gary larson

2002-04-30 13:24:06

by Billy O'Connor

[permalink] [raw]

Subject: Re: [prepatch] address_space-based writeback

On Tue, Apr 30, 2002 at 03:19:17PM -0200, Denis Vlasenko wrote:
> Why do we have to stich to concept of inode *numbers*?
> Because there are inode numbers in traditional Unix filesystems?

like much of unix it's been there forever and has become such a natural
concept in people's heads that to change it now seems unthinkable.

much like the missing e in creat().

Wasn't that the one thing Ken Thompson said he would do differently
if he had it to do all over(unix)?

2002-04-30 13:35:09

by James Lewis Nance

[permalink] [raw]

Subject: Re: [prepatch] address_space-based writeback

On Tue, Apr 30, 2002 at 08:24:26AM -0500, Billy O'Connor wrote:

> much like the missing e in creat().
>
> Wasn't that the one thing Ken Thompson said he would do differently
> if he had it to do all over(unix)?

Actually that was the only thing he said he would do differently, which
puts a little different spin on that statement :-)

Jim

2002-04-30 13:40:49

by Keith Owens

[permalink] [raw]

Subject: Re: [prepatch] address_space-based writeback

On Tue, 30 Apr 2002 23:15:23 +1000,
john slee <[email protected]> wrote:
>probably because there is software out there relying on them being
>numbers and being able to do 'if(inum_a == inum_b) { same_file(); }'
>as appropriate. i can't think of a use for such a construct other than
>preserving hardlinks in archives (does tar do this?) but i'm sure there
>are others

Any program that tries to preserve or detect hard links. cp, mv (files
a and b are the same file). tar, cpio, rsync -H, du, etc.

The assumption that inode numbers are unique within a mount point is
one of the reasons that NFS export does not cross mount points by
default. man exports, look for 'nohide'.

2002-04-30 16:13:01

by Peter Wächtler

[permalink] [raw]

Subject: Re: [prepatch] address_space-based writeback

john slee wrote:
> [ cc list trimmed ]
>
> On Tue, Apr 30, 2002 at 03:19:17PM -0200, Denis Vlasenko wrote:
>
>>Why do we have to stich to concept of inode *numbers*?
>>Because there are inode numbers in traditional Unix filesystems?
>>
>
> probably because there is software out there relying on them being
> numbers and being able to do 'if(inum_a == inum_b) { same_file(); }'
> as appropriate. i can't think of a use for such a construct other than
> preserving hardlinks in archives (does tar do this?) but i'm sure there
> are others
>
> like much of unix it's been there forever and has become such a natural
> concept in people's heads that to change it now seems unthinkable.
>
> much like the missing e in creat().
>

No. Not supplying inode numbers would break unix semantics.
The kernel (and binary loader) depends on a unique key:

major:minor device number + inode

Otherwise: how to decide if a shared object is the same?
checksuming? ;-)

But what would be different with characters? Despite more complexity?