LinuxLists.cc - (fwd) Re: [RFC] mount flag "direct"

[permalink] [raw]

Subject: Re: (fwd) Re: [RFC] mount flag "direct"

At 22:48 03/09/02, Peter T. Breuer wrote:
>What one has to get rid of is cached metadata state. I'm open to suggestions.

This is crazy. I don't think you understand what that actually implies. I
will give you a real world example below.

> > even for ext2 there are people (including linus I believe) that are saying
> > that major new features should not be added to ext2, but to a new
>
>I agree! But I wouldn't see adding VFS ops to replace FS-specific
>code as a new feature, rather a consolidation of common codes.

But you are not consolidating common code. There is no common code. Each fs
performs allocations in completely different ways and according to
completely different strategies because of the different methods of how the
metadata is stored/organized.

And here is the promised example: what we need to do in order to read one
single byte from the middle of a large file on an NTFS volume assuming that
no metadata whatsoever is cached:

- read boot sector to find start of inode table
- read start of inode table, decompress mapping pairs array into memory (oh
shit, we are caching something already! but how else will you make sense of
compressed metadata?!?) in order to find out where on disk the inodes are
stored
- unfortunately the inode we want is not described by the start of the
inode table, drat, using the above metadata & decompressed metadata, we
determine where the next piece of the inode table is described
- we read next piece of inode table, decompress the mapping pairs array,
and finally we find where on disk the inode is located (yey!)
- we read the inode from disk
- we decompress the mapping pairs array telling us where the inode's data
is stored
- unfortunately the data byte we want is not described by the base inode of
the file, so we need to read an extent inode, but shit, we aren't caching
anything!!!, so we don't know where the inode is stored on disk, so we have
to repeat _the_whole_ procedure described above just to be able to read the
extent inode
- having repeated all the above we read the extent inode, decompress the
mapping pairs array and finally locate where on disk the data we want is
located
- we read 1 byte from disk from the determined location

Ok, so to do a 1 byte read, we just had to perform over 10-20 reads from
very different disk locations (we are talking several seconds _just_ in
seek times, never mind read times!), we had to allocate a lot of memory
buffers to store metadata which we have read from disk temporarily, as well
as a lot of memory in order to be able to decompress the mapping pair
arrays which tell us the logical to physical block mapping.

I am completely serious, we are talking at least hundreds of milliseconds
possibly even several seconds to read that single byte.

What was that about 50GiB/sec performance again...?

Best regards,

Anton

--
"I've not lost my mind. It's backed up on tape somewhere." - Unknown
--
Anton Altaparmakov <aia21 at cantab.net> (replace at with @)
Linux NTFS Maintainer / IRC: #ntfs on irc.openprojects.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/

2002-09-03 22:38:31

[permalink] [raw]

Subject: Re: (fwd) Re: [RFC] mount flag "direct"

"Anton Altaparmakov wrote:"
[snip detailed description for which I am extremely grateful]
> Ok, so to do a 1 byte read, we just had to perform over 10-20 reads from
> very different disk locations (we are talking several seconds _just_ in
> seek times, never mind read times!), we had to allocate a lot of memory
> buffers to store metadata which we have read from disk temporarily, as well
> as a lot of memory in order to be able to decompress the mapping pair
> arrays which tell us the logical to physical block mapping.
>
> I am completely serious, we are talking at least hundreds of milliseconds
> possibly even several seconds to read that single byte.
>
> What was that about 50GiB/sec performance again...?

Let's maintain a single bit in the superblock that says whether any
directory structure or whatever else we're worried about has been
altered (ecch, well, it has to be a timestamp, never mind ..). Before
every read we check this "bit" ondisk. If it's not set, we happily dive
for our data where we expect to find it. Otherwise we go through the
rigmarole you describe.

Maybe our programs aren't going to do unexpected things with the file
structures. Maybe our file systems satisfy assumptions like not
moving existing data ondisk to make room for other data. I'd be willing
to only consider such systems as sane enough to work with in a
distributed shared environment.

Can we improve the single-bit approach? Yes. The FS is a tree. When
we make a change in it we can set a bit everywhere above the change,
all the way to the root. When we observe the root bot changed, we
can begin to retrace the path to our data, but abandon the retrace
when the bit-trail we are following down towards our data turns cold.

No?

Peter

2002-09-03 22:55:30

by Xavier Bestel

[permalink] [raw]

Subject: Re: (fwd) Re: [RFC] mount flag "direct"

Le mer 04/09/2002 ? 00:42, Peter T. Breuer a ?crit :

> Let's maintain a single bit in the superblock that says whether any
> directory structure or whatever else we're worried about has been
> altered (ecch, well, it has to be a timestamp, never mind ..). Before
> every read we check this "bit" ondisk. If it's not set, we happily dive
> for our data where we expect to find it. Otherwise we go through the
> rigmarole you describe.

Won't work. You would need an atomic read-and-write operation for that
(read previous timestamp and write a special timestamp meaning
"currently writing this block"), and you don't have that.

Xav

2002-09-03 23:09:55

[permalink] [raw]

Subject: Re: (fwd) Re: [RFC] mount flag "direct"

On Tue, 3 Sep 2002, Peter T. Breuer wrote:

> > in addition you will have lots of potential races as one system reads a
> > block of data, modifies it, then writes it while the other system does the
>
> Uh, I am confident that there can be no races with respect to data
> writes provided I manage to make the VFS operations atomic via
> appropriate shared locking. What one has to get rid of is cached
> metadata state. I'm open to suggestions.

well you will have to change the filesystems to attempt your new atomic
VFS operations, today the kernel ext2 driver can just aquire a lock,
diddle with whatever it wants, and write the result. it doesn't do
anything that the VFS will see as atomic.

David Lang

2002-09-03 23:27:02

[permalink] [raw]

Subject: Re: (fwd) Re: [RFC] mount flag "direct"

but now you are changing the on-disk format of each filesystem.

good luck in getting microsoft to change NTFS to allow you to do this (let
alone the others like XFS and JFS that are maintaining compatability with
other operating systems)

David Lang

On Wed, 4 Sep 2002, Peter T. Breuer wrote:

> Date: Wed, 4 Sep 2002 00:42:47 +0200 (MET DST)
> From: Peter T. Breuer <[email protected]>
> To: Anton Altaparmakov <[email protected]>
> Cc: Peter T. Breuer <[email protected]>, [email protected],
> [email protected]
> Subject: Re: (fwd) Re: [RFC] mount flag "direct"
>
> "Anton Altaparmakov wrote:"
> [snip detailed description for which I am extremely grateful]
> > Ok, so to do a 1 byte read, we just had to perform over 10-20 reads from
> > very different disk locations (we are talking several seconds _just_ in
> > seek times, never mind read times!), we had to allocate a lot of memory
> > buffers to store metadata which we have read from disk temporarily, as well
> > as a lot of memory in order to be able to decompress the mapping pair
> > arrays which tell us the logical to physical block mapping.
> >
> > I am completely serious, we are talking at least hundreds of milliseconds
> > possibly even several seconds to read that single byte.
> >
> > What was that about 50GiB/sec performance again...?
>
> Let's maintain a single bit in the superblock that says whether any
> directory structure or whatever else we're worried about has been
> altered (ecch, well, it has to be a timestamp, never mind ..). Before
> every read we check this "bit" ondisk. If it's not set, we happily dive
> for our data where we expect to find it. Otherwise we go through the
> rigmarole you describe.
>
> Maybe our programs aren't going to do unexpected things with the file
> structures. Maybe our file systems satisfy assumptions like not
> moving existing data ondisk to make room for other data. I'd be willing
> to only consider such systems as sane enough to work with in a
> distributed shared environment.
>
> Can we improve the single-bit approach? Yes. The FS is a tree. When
> we make a change in it we can set a bit everywhere above the change,
> all the way to the root. When we observe the root bot changed, we
> can begin to retrace the path to our data, but abandon the retrace
> when the bit-trail we are following down towards our data turns cold.
>
> No?
>
> Peter
>

2002-09-03 23:39:52

[permalink] [raw]

Subject: Re: (fwd) Re: [RFC] mount flag "direct"

"A month of sundays ago Xavier Bestel wrote:"
[Charset ISO-8859-15 unsupported, filtering to ASCII...]
> Le mer 04/09/2002 _ 00:42, Peter T. Breuer a _crit :
>
> > Let's maintain a single bit in the superblock that says whether any
> > directory structure or whatever else we're worried about has been
> > altered (ecch, well, it has to be a timestamp, never mind ..). Before
> > every read we check this "bit" ondisk. If it's not set, we happily dive
> > for our data where we expect to find it. Otherwise we go through the
> > rigmarole you describe.
>
> Won't work. You would need an atomic read-and-write operation for that

I'm proposing (elsewhere) that I be allowed to generate special block-layer
requests from VFS, which act as "tags" to impose order on other requests
at the shared disk resource. But ...

> (read previous timestamp and write a special timestamp meaning
> "currently writing this block"), and you don't have that.

I believe we only have to write it when we change metadata, and
read it when we are about to use cached metadata. Racing to write it
doesn't matter, since the most recent wins, which is what we want.

Umm .. there is a bad race to read. We can read, think nothing has
changed, then read and find things shifted out from under. We need
to take a FS global lock when reading! Umm. No. We need to take
a global lock against changing metadata when reading, not against
arbitrary changes. And that can be done by issuing a tag request.

It's late.

Peter

2002-09-04 04:22:46

by Kevin O'Connor

[permalink] [raw]

Subject: Re: (fwd) Re: [RFC] mount flag "direct"

On Tue, Sep 03, 2002 at 11:19:21PM +0100, Anton Altaparmakov wrote:
> At 22:48 03/09/02, Peter T. Breuer wrote:
> >What one has to get rid of is cached metadata state. I'm open to suggestions.
>
> This is crazy. I don't think you understand what that actually implies. I
> will give you a real world example below.

It is crazy, but probably achievable. One could wrap all VFS ops (read,
write, rename, unlink, etc) to do:

distributed_lock()
mount()
op()
umount()
distributed_unlock()

[...]
> I am completely serious, we are talking at least hundreds of milliseconds
> possibly even several seconds to read that single byte.
>
> What was that about 50GiB/sec performance again...?

Well, you'll probably get 50B/sec from it..

-Kevin

--
------------------------------------------------------------------------
| Kevin O'Connor "BTW, IMHO we need a FAQ for |
| [email protected] 'IMHO', 'FAQ', 'BTW', etc. !" |
------------------------------------------------------------------------

2002-09-04 07:02:25

by Alexander Viro

[permalink] [raw]

Subject: Re: (fwd) Re: [RFC] mount flag "direct"

On Wed, 4 Sep 2002, Peter T. Breuer wrote:

> "A month of sundays ago Xavier Bestel wrote:"
> [Charset ISO-8859-15 unsupported, filtering to ASCII...]
> > Le mer 04/09/2002 _ 00:42, Peter T. Breuer a _crit :
> >
> > > Let's maintain a single bit in the superblock that says whether any
> > > directory structure or whatever else we're worried about has been
> > > altered (ecch, well, it has to be a timestamp, never mind ..). Before
> > > every read we check this "bit" ondisk. If it's not set, we happily dive
> > > for our data where we expect to find it. Otherwise we go through the
> > > rigmarole you describe.
> >
> > Won't work. You would need an atomic read-and-write operation for that
>
> I'm proposing (elsewhere) that I be allowed to generate special block-layer
> requests from VFS, which act as "tags" to impose order on other requests
> at the shared disk resource. But ...

Get. Real.

VFS has no sodding idea of notion of blocks, let alone ordering needed for
a particular filesystem.

As soon as you are starting to talk about "superblocks" (on-disk ones, that
is) - you can forget about generic layers.

As far as I'm concerned, the feature in question is fairly pointless for
the things it can be used for and hopeless for the things you want to
get. Ergo, it's vetoed.

There is more to coherency and preserving fs structure than "don't cache
<something>". And issues involved here clearly belong to filesystem -
generic code simply has not enough information _and_ the nature of the
solution deeply depends on fs in question. IOW, your grand idea is
hopeless. End of discussion.

2002-09-04 08:57:39

[permalink] [raw]

Subject: Re: (fwd) Re: [RFC] mount flag "direct"

"Alexander Viro wrote:"
> On Wed, 4 Sep 2002, Peter T. Breuer wrote:
> > I'm proposing (elsewhere) that I be allowed to generate special block-layer
> > requests from VFS, which act as "tags" to impose order on other requests
> > at the shared disk resource. But ...
>
> Get. Real.

Ok.

> VFS has no sodding idea of notion of blocks, let alone ordering needed for
> a particular filesystem.

True, and so what? Do you have any other irrelevancies on your mind?
Speak now ...

(it's US who can use VFS to implement atomicity wrt VFS operations,
by allowing a special block request to be sent at the start of
each VFS operation and another at the end, and allowing those special
requests to be interpreted on the resource as "give me the lock now" and
"release the lock now" respectively - maintaining ordering will do the
rest. We just want to avoid interleaving during this sequence)

> As soon as you are starting to talk about "superblocks" (on-disk ones, that
> is) - you can forget about generic layers.

Fine. Let's not speak then.

> As far as I'm concerned, the feature in question is fairly pointless for

Well, "as far as I'm concerned", would you mind producing technical
options instead of (opinionated nonrational emotional idiotic
nonsensical) _judgments_? Options can be evaluated. Judgments cannot.

Or are you just going to go on an insult binge and bring up your hamburger?

> the things it can be used for and hopeless for the things you want to
> get. Ergo, it's vetoed.

Ergo WHAT'S vetoed? Your big toe?

> There is more to coherency and preserving fs structure than "don't cache

Sure. So what? What's wrong with a O_DIRDIRECT flag that makes all
opens retrace the path from the root fs _on disk_ instead of from the
directory cache?

I suggest that changing FS structure is an operation that is so
relatively rare in the projected environment (in which gigabytes of
/data/ are streaming through every second) that you can make them as
expensive as you like and nobody will notice. Your frothing at the
mouth about it isn't going to change that. Moreover, _opening_
a file is a rare operation too, relative to all that data thruput.

So go ahead, worry about it.

> <something>". And issues involved here clearly belong to filesystem -
> generic code simply has not enough information _and_ the nature of the
> solution deeply depends on fs in question. IOW, your grand idea is
> hopeless. End of discussion.

Anyone ever tell you that you have an insight problem? You are
evaluating the sistuation using the wrong objective functions as a
metric. Don't use your metric, use the one appropriate to the
situation. Nobody could care less how long it takes to open a file
or do a mkdir, and even if they did care it would take exactly as long
as it does on my 486 right now, which doesn't scare the pants off me.

What we/I want is a simple way to put whatever FS we want on a shared
remote resource. It doesn't matter if you think it's going to be slow
in some aspects, it'll be fast enough, because those aspects merely
have to be correct, not fast.

Peter

2002-09-04 11:00:37

[permalink] [raw]

Subject: Re: (fwd) Re: [RFC] mount flag "direct"

On Wed, 4 Sep 2002, Peter T. Breuer wrote:
> I suggest that changing FS structure is an operation that is so
> relatively rare in the projected environment (in which gigabytes of
> /data/ are streaming through every second) that you can make them as
> expensive as you like and nobody will notice. Your frothing at the
> mouth about it isn't going to change that. Moreover, _opening_
> a file is a rare operation too, relative to all that data thruput.

Sorry but this really shows you lack of understanding for how a file
system works. Every time you write a single byte(!!!) to a file, this
involves modifying fs structures. Even if you do writes in 1MiB chunks,
what happens is that all the writes are broken down into buffer head sized
portions for the purposes of mapping them to disk (this is optimized by
the get_blocks interface but still it means that every time get_blocks is
involved you have to do a full lookup of the logical block and map it to
an on disk block). For reading while you are not modifying fs structures
you still need to read and parse them for each get_blocks call.

This in turn means that each call to get_blocks within the direct_IO code
paths will result in a full block lookup in the filesystem driver.

I explained in a previous post how incredibly expensive that is.

So even though you are streaming GiB of /data/ you will end up streaming
TiB of /metadata/ for each GiB of /data/. Is that so difficult to
understand?

Unless you allow the FS to cache the /metadata/ you have already lost all
your performance and you will never be able to stream at the speads you
require.

So far you are completely ignoring my comments. Is that because you see
they are true and cannot come up with a counter argument?

> Anyone ever tell you that you have an insight problem? You are
> evaluating the sistuation using the wrong objective functions as a
> metric. Don't use your metric, use the one appropriate to the
> situation. Nobody could care less how long it takes to open a file
> or do a mkdir, and even if they did care it would take exactly as long
> as it does on my 486 right now, which doesn't scare the pants off me.

We do care about such things a lot! What you are saying is true in you
extremely specialised scientific application. In normal usage patterns
file creation, etc, are crucially important to be fast. For example file
servers, email servers, etc create/delete huge amounts of files per
second.

> What we/I want is a simple way to put whatever FS we want on a shared
> remote resource. It doesn't matter if you think it's going to be slow
> in some aspects, it'll be fast enough, because those aspects merely
> have to be correct, not fast.

Well normal users care about fast, sorry. Nobody will agree to making the
generic kernel cater for your specialised application which will be used
on a few systems on the planet when you would penalize 99.99999% of Linux
users with your solution.

The only viable solution which can enter the generic kernel is to
implement what you suggest at the FS level, not the VFS/block layer
levels.

Of course if you intend to maintain your solution outside the kernel as a
patch of some description until the end of time then that's fine. It is
irrelevant what solution you choose and noone will complain as you are not
trying to force it onto anyone else.

Best regards,

Anton
--
Anton Altaparmakov <aia21 at cantab.net> (replace at with @)
Linux NTFS maintainer / IRC: #ntfs on irc.openprojects.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/

2002-09-04 11:35:12

[permalink] [raw]

Subject: Re: (fwd) Re: [RFC] mount flag "direct"

"A month of sundays ago Anton Altaparmakov wrote:"
> On Wed, 4 Sep 2002, Peter T. Breuer wrote:
> > I suggest that changing FS structure is an operation that is so
> > relatively rare in the projected environment (in which gigabytes of
> > /data/ are streaming through every second) that you can make them as
> > expensive as you like and nobody will notice. Your frothing at the
> > mouth about it isn't going to change that. Moreover, _opening_
> > a file is a rare operation too, relative to all that data thruput.
>
> Sorry but this really shows you lack of understanding for how a file
> system works. Every time you write a single byte(!!!) to a file, this
> involves modifying fs structures. Even if you do writes in 1MiB chunks,

OK .. in what way?

> what happens is that all the writes are broken down into buffer head sized
> portions for the purposes of mapping them to disk (this is optimized by

Well, they'll be broken down, yes, quite probably.

> the get_blocks interface but still it means that every time get_blocks is

You mean that the buffer cache is looked up? But we know that we have
disabled caching on this device ... well, carry on anyway (it does
no harm to write to an already dirty buffer).

> involved you have to do a full lookup of the logical block and map it to
> an on disk block). For reading while you are not modifying fs structures
> you still need to read and parse them for each get_blocks call.

I'm not following you. It seems to me that you are discussing the
process of getting a buffer to write to prior to letting it age and
float down to the device driver via the block layers. But we have
disabled caching so we know that get_blocks will deliver a new
temporary buffer or block or something or at any rate do what it
should do ... anyway, I mumble.... what you are saying is that you
think we look up a logical block number and get a physical block
number and possibly a buffer associated with it. Well, maybe. So?

> This in turn means that each call to get_blocks within the direct_IO code
> paths will result in a full block lookup in the filesystem driver.

Uh, I'm not sure I understand any of this. You are saying something
about logical/physical that I don't follow or don't know. In
direct IO, one of the things that I believe happens is that writes are
serialized, in order to maintain semantics when we go RWRW (we don't
want it to work like WRWR or other), so that overlaps cannot happen.
We should never be in the situation of having a dirty (or
cached) buffer that we rewrite before it goes to disk (again).

Is that relevant?

> I explained in a previous post how incredibly expensive that is.

Well, I simply don't follow you here. Can you unmuddle my understanding
of what you are saying about logical/physical?

> So even though you are streaming GiB of /data/ you will end up streaming
> TiB of /metadata/ for each GiB of /data/. Is that so difficult to
> understand?

Yep. Can you be concrete about this metadata? I'lll see if I can work
it out .. what I think you must be saying is that when we write to
a file, we write to a process address space, and that has to be
translated into a physical block number on disk. Well, but I imagine
that the translation was settled at the point that the inode was first
obtained, and now that we have the inode, data that goes to it gets
a physical address from the data in the inode. We can even be using an
inode that is unconnected to the FS, and we will still be writing away
to disk, and nobody else will be using that space, because the space
is marked as occupied in the bit map.

I see no continuous lookup of metadata.

> Unless you allow the FS to cache the /metadata/ you have already lost all

What metadata? I only see the inode, and that's not cached in any
meaningful sense (i.e. not available to other things) to but simply
"held" in kmem storage and pointed to.

> your performance and you will never be able to stream at the speads you
> require.

Can you be even MORE specific about what this metadata is? Maybe I'll
get it if you are very very specific.

> So far you are completely ignoring my comments. Is that because you see
> they are true and cannot come up with a counter argument?

No, I know of no comments that i have deliberately ignored. Bear in
mind that I have a plane to catch at 5.30 and an article to finish
before then :-).

> > situation. Nobody could care less how long it takes to open a file
> > or do a mkdir, and even if they did care it would take exactly as long
> > as it does on my 486 right now, which doesn't scare the pants off me.
>
> We do care about such things a lot! What you are saying is true in you
> extremely specialised scientific application. In normal usage patterns

Well, there are a lot of such scientific applications, and they are
taking up a good slice of the computing budgets, and a huge number of
machines ... so I don't think you can ignore them in a meaningful way
:-).

> file creation, etc, are crucially important to be fast. For example file
> servers, email servers, etc create/delete huge amounts of files per
> second.

Yep.

> > What we/I want is a simple way to put whatever FS we want on a shared
> > remote resource. It doesn't matter if you think it's going to be slow
> > in some aspects, it'll be fast enough, because those aspects merely
> > have to be correct, not fast.
>
> Well normal users care about fast, sorry. Nobody will agree to making the

Normal users should see no difference, because they won't turn on
O_DIRDIRECT, or whatever, and it should make no difference to them
that the dir cache /can/ be turned off. That should be a goal, anyway.
Can it be done? I think so, at least if it's restricted to directory
lookup in the first instance, but I would like your very concrete
example of what cached metadata is changed when I do a data write.

> generic kernel cater for your specialised application which will be used
> on a few systems on the planet when you would penalize 99.99999% of Linux
> users with your solution.

Where is the penalty?

> The only viable solution which can enter the generic kernel is to
> implement what you suggest at the FS level, not the VFS/block layer
> levels.

Why? Having VFS ops available does not mean you are obliged to use
them! And useing them means swapping them in via a pointer indirection,
not testing a flag always.

> Of course if you intend to maintain your solution outside the kernel as a
> patch of some description until the end of time then that's fine. It is
> irrelevant what solution you choose and noone will complain as you are not
> trying to force it onto anyone else.

I think you are imagining implementation that are simply not the way
I imagine them. Can you be specific about the exact metadata that is
changed when a data write is done? That will help me decide.

Thanks!

Peter

2002-09-04 12:45:10

by Ragnar Kjørstad

[permalink] [raw]

Subject: Re: (fwd) Re: [RFC] mount flag "direct"

On Wed, Sep 04, 2002 at 11:02:07AM +0200, Peter T. Breuer wrote:
> > There is more to coherency and preserving fs structure than "don't cache
>
> Sure. So what? What's wrong with a O_DIRDIRECT flag that makes all
> opens retrace the path from the root fs _on disk_ instead of from the
> directory cache?

Did you read Antons post about this?

> I suggest that changing FS structure is an operation that is so
> relatively rare in the projected environment (in which gigabytes of
> /data/ are streaming through every second) that you can make them as
> expensive as you like and nobody will notice.

Why do you want a filesystem if you're not going to use any filesystem
operations? If all you want to do is to split your shared device into
multipe (static) logical units use a logical volume manager.

If you _do_ need a filesystem, use something like gfs. Have you looked
at it at all?

--
Ragnar Kj?rstad
Big Storage

2002-09-04 12:50:29

[permalink] [raw]

Subject: Re: (fwd) Re: [RFC] mount flag "direct"

"A month of sundays ago [Ragnar Kj_rstad] wrote:"
[Charset iso-8859-1 unsupported, filtering to ASCII...]
> On Wed, Sep 04, 2002 at 11:02:07AM +0200, Peter T. Breuer wrote:
> > > There is more to coherency and preserving fs structure than "don't cache
> >
> > Sure. So what? What's wrong with a O_DIRDIRECT flag that makes all
> > opens retrace the path from the root fs _on disk_ instead of from the
> > directory cache?
>
> Did you read Antons post about this?

Yep. My reply is out there.

> > I suggest that changing FS structure is an operation that is so
> > relatively rare in the projected environment (in which gigabytes of
> > /data/ are streaming through every second) that you can make them as
> > expensive as you like and nobody will notice.
>
> Why do you want a filesystem if you're not going to use any filesystem
> operations? If all you want to do is to split your shared device into

But I said we DO want to use FS operations. Just nowhere near as often
as we want to treat the data streaming through the file system (i.e.
"strawman"), so the speed of metadata operations on the FS is not
apparently an issue.

> multipe (static) logical units use a logical volume manager.

My experiments with the current lvm have convinced me that it is a
horror show with no way sometimes of rediscovering its own consistent
state even on one node. I'd personally prefer there to be no lvm, right
now!

> If you _do_ need a filesystem, use something like gfs. Have you looked
> at it at all?

The point is not to choose a file system, but to be able to use
whichever one is preferable _at the time_. This is important.
Different FSs have different properties, and if one is 10% faster than
another for a different data load, then the faster FS can be put
in and 10% more data can be collected in the time slot allocated (and
these time slots cost "the earth" :-).

Peter

2002-09-04 13:18:07

by Ragnar Kjørstad

[permalink] [raw]

Subject: Re: (fwd) Re: [RFC] mount flag "direct"

On Wed, Sep 04, 2002 at 02:54:55PM +0200, Peter T. Breuer wrote:
> > > Sure. So what? What's wrong with a O_DIRDIRECT flag that makes all
> > > opens retrace the path from the root fs _on disk_ instead of from the
> > > directory cache?
> >
> > Did you read Antons post about this?
>
> Yep. My reply is out there.

I think you missed one. Anton explained in detail what NTFS would have
to do to write a single byte if _nothing_ was cached on the client. I
think it failed to mention that the whole operation would have to be
executed with a lock - so mouch for distributed operation.

> > Why do you want a filesystem if you're not going to use any filesystem
> > operations? If all you want to do is to split your shared device into
>
> But I said we DO want to use FS operations. Just nowhere near as often
> as we want to treat the data streaming through the file system (i.e.
> "strawman"), so the speed of metadata operations on the FS is not
> apparently an issue.

Remember that append causes metadata updates, so the only thing you can
do without worrying about the speed of metadata updates is read/rewrite.
(assuming you hack the filesystems to turn of timestamp-updates)
And even those operations are highly dependent on "metadata operations"
- they need metadata to know where to read/rewrite. Again, read Antons
post about this.

> > multipe (static) logical units use a logical volume manager.
>
> My experiments with the current lvm have convinced me that it is a
> horror show with no way sometimes of rediscovering its own consistent
> state even on one node. I'd personally prefer there to be no lvm, right
> now!

Currently there is no volume-manager with cluster support on linux
(unless Veritas has one?), but both Sistina and IBM are working on it
for LVM2 and EVMS - I'm sure patches will be accepted to speed up the
process.

> > If you _do_ need a filesystem, use something like gfs. Have you looked
> > at it at all?
>
> The point is not to choose a file system, but to be able to use
> whichever one is preferable _at the time_. This is important.
> Different FSs have different properties, and if one is 10% faster than
> another for a different data load, then the faster FS can be put
> in and 10% more data can be collected in the time slot allocated (and
> these time slots cost "the earth" :-).

Are you refering to that some filesystems can handle certain workloads
better than others? E.g. reiserfs is really fast for manipulating
directory-structures, adding new files or removing old ones? But didn't
you say that all you cared about was read and rewrite? That all other
filesystem-operations were so rare that nobody would notice?

In addition to this beeing technically impossible (to create a _working_
solution) I think the motivation is seriously flawed. The "solution"
you're proposing wouldn't be suitable for any viable problem.

--
Ragnar Kj?rstad
Big Storage

2002-09-04 14:08:37

[permalink] [raw]

Subject: Re: (fwd) Re: [RFC] mount flag "direct"

On Wed, 4 Sep 2002, Peter T. Breuer wrote:
> "A month of sundays ago Anton Altaparmakov wrote:"
> > On Wed, 4 Sep 2002, Peter T. Breuer wrote:
> > > I suggest that changing FS structure is an operation that is so
> > > relatively rare in the projected environment (in which gigabytes of
> > > /data/ are streaming through every second) that you can make them as
> > > expensive as you like and nobody will notice. Your frothing at the
> > > mouth about it isn't going to change that. Moreover, _opening_
> > > a file is a rare operation too, relative to all that data thruput.
> >
> > Sorry but this really shows you lack of understanding for how a file
> > system works. Every time you write a single byte(!!!) to a file, this
> > involves modifying fs structures. Even if you do writes in 1MiB chunks,
>
> OK .. in what way?

Did you read my post which you can lookup on the below url?

http://marc.theaimsgroup.com/?l=linux-kernel&m=103109165717636&w=2

That explains what a single byte write to an uncached ntfs volume entails.
(I failed to mention there that you would actually need to read the block
first modify the one byte and then write it back but if you write
blocksize based chunks at once the RCW falls away.
And as someone else pointed out I failed to mention that the entire
operation would have to keep the entire fs locked.)

If it still isn't clear let me know and I will attempt to explain again in
simpler terms...

> > what happens is that all the writes are broken down into buffer head sized
> > portions for the purposes of mapping them to disk (this is optimized by
>
> Well, they'll be broken down, yes, quite probably.

Not probably. That is how the VFS + FS work! The VFS talks to the FS
drivers and tells them "need to do i/o on the file starting at offset X
into the file for blocksize bytes [this is what I call logical block, i.e.
a block within a file]" and the FS replies "the logical block you
requested is located on physical disk /dev/hda1 at offset Y on the disk
and has the size Z [this is the physical block I am talking about]".

This communication traditionally happens for each blocksize chunk of a
read/write request. The get_blocks() interface optimizes this so it the
size Z above can be bigger than the blocksize which tells the VFS that if
it wants to read or write the next adjacent block it is located adjacent
to the physical block being returned, i.e. the file data is physically
contiguous on disk. But this is just a optimization, so you can ignore the
get_blocks() complication for the time being, just understand that for
each read/write request no matter what its size, the VFS breaks the
request into tiny pieces of blocksize bytes (blocksize is dependent on the
FS, e.g. NTFS always uses blocksize = 512 bytes, so all i/o happens in
chunks of 512 bytes).

> > the get_blocks interface but still it means that every time get_blocks is
>
> You mean that the buffer cache is looked up? But we know that we have
> disabled caching on this device ... well, carry on anyway (it does
> no harm to write to an already dirty buffer).

No, the buffer cache doesn't exist any more! There is no such thing in the
current kernels. However, a buffer_head is used to describe a disk block,
this is the "language" used by the VFS to talk to the FS drivers. In
particular the VFS says: "here is a buffer_head, I want this buffer_head
to be made to describe where on disk is located the data from file blah,
offset into file X, size of data blocksize" and the FS looks inside the
inode structure (this is cached, but under your scheme it would no longer
be cached!), and searches an array (this is again cached but under your
scheme it wouldn't be!) which maps between logical blocks and physical
blocks, i.e. it says the first 512 bytes in the file are located on disk
block X, the second 512 bytes are on disk block Y, etc. (the
implementations of such things are completely FS dependent, in NTFS each
element of the array looks like this: s64 file block number X, s64 disk
block number Y, s64 size of this block L, then the next element describes
file block number X+L, disk block number Z, size of block M), where X, Y,
Z, L, and M are arbitrary s64 numbers). So armed with the knowledge where
the block is on disk, the FS driver maps the buffer head, i.e. it sets
three fields in the buffer_head structure: the device, the offset into the
device, the size of the block, and it also sets the BH_Mapped flag. Then
the FS gives the buffer_head back to the VFS and the VFS now knows where
to read or write to.

Ok, this hopefully explains to you what logical (file) and physical (disk)
blocks are and how the conversion between the two is done.

> > involved you have to do a full lookup of the logical block and map it to
> > an on disk block). For reading while you are not modifying fs structures
> > you still need to read and parse them for each get_blocks call.
>
> I'm not following you. It seems to me that you are discussing the
> process of getting a buffer to write to prior to letting it age and
> float down to the device driver via the block layers. But we have

No, the buffer_head ONLY describes where the data is on disk, nothing
more. But the VFS needs to know where that data is on disk in order to be
able to read/write to the disk.

> disabled caching so we know that get_blocks will deliver a new
> temporary buffer or block or something or at any rate do what it
> should do ... anyway, I mumble.... what you are saying is that you
> think we look up a logical block number and get a physical block
> number and possibly a buffer associated with it. Well, maybe.

Yes (minus the buffer associated with it).

> So?

*sigh* Doing the conversion from a logical block number to a physical
block number is VERY VERY VERY slow!!!

On NTFS, the table that converts from one to the other is stored in the
inode structures on disk in a _compressed_ form, i.e. you have to
uncompress this metadata into temporary memory buffers in order to be able
to do the conversion. Normal we do this ONCE and then cache the result in
the in memory inode structure.

But you want to not cache the inodes (i.e. the metadata), which means that
for every single 512 byte block being read or written (remember no matter
how big your write/read is, the VFS will break it up into 512 byte pieces
for NTFS), NTFS will have to lock the file system, read the inode from
disk (to do this it will have to go through a lot of reads in order to
find the inode on disk first!!! - read the post I referenced by URL above
for a description), decompress the array to allow conversion from logical
to physical block, then enter the correct values in the buffer_head
structure, throw away all the memory buffers we just allocated to store
the inode and the decompressed array, etc, then return the mapped
buffer_head structure to the VFS and the fs lock is dropped.

Then when the VFS wants the next 512 bytes, the whole above thing happens
again, from scratch! But the lock was dropped, and no caching, so we have
to repeat everything we just did, the identical stuff in fact, just to
return the next on disk block!

Do you see now what I mean?

This is completely inpractical. It would be so slow and would wear out the
harddrive sectors so much that your harddisks would break in a few
months/years developping bad blocks in the most commonly accessed places
like the boot sector and the first inode describing where the remaining
inodes are. If you have an unpatched IBM Deskstar drive for example you
can kiss it goodbye within a few months for sure.

Your approach could actually be considered as malliciously destroying end
user hardware... It's a trojan! (-;

> > This in turn means that each call to get_blocks within the direct_IO code
> > paths will result in a full block lookup in the filesystem driver.
>
> > I explained in a previous post how incredibly expensive that is.
>
> Well, I simply don't follow you here. Can you unmuddle my understanding
> of what you are saying about logical/physical?

Has it become clearer now?

> > So even though you are streaming GiB of /data/ you will end up streaming
> > TiB of /metadata/ for each GiB of /data/. Is that so difficult to
> > understand?
>
> Yep. Can you be concrete about this metadata? I'lll see if I can work
> it out .. what I think you must be saying is that when we write to
> a file, we write to a process address space, and that has to be
> translated into a physical block number on disk. Well, but I imagine
> that the translation was settled at the point that the inode was first
> obtained, and now that we have the inode, data that goes to it gets
> a physical address from the data in the inode. We can even be using an
> inode that is unconnected to the FS, and we will still be writing away
> to disk, and nobody else will be using that space, because the space
> is marked as occupied in the bit map.
>
> I see no continuous lookup of metadata.

Clear now?

> > Unless you allow the FS to cache the /metadata/ you have already lost all
>
> What metadata? I only see the inode, and that's not cached in any
> meaningful sense (i.e. not available to other things) to but simply
> "held" in kmem storage and pointed to.

Oh it is cached. In NTFS we keep the inode in memory at all times as long
as the file is active (well it could be sent to swap or evicted from
memory under memory pressure but apart from that it is always cached). And
we also keep the metadata describing the mapping between logical and
physical blocks attached to the struct inode, too until the struct inode
is disposed off from the inode cache.

> > your performance and you will never be able to stream at the speads you
> > require.
>
> Can you be even MORE specific about what this metadata is? Maybe I'll
> get it if you are very very specific.

I hope I have been specific enough above. If not let me know...

> I think you are imagining implementation that are simply not the way
> I imagine them. Can you be specific about the exact metadata that is
> changed when a data write is done? That will help me decide.

Quite possible. I hope I have managed to explain things above but if not
let me know and I will try again. (-;

Best regards,

Anton
--
Anton Altaparmakov <aia21 at cantab.net> (replace at with @)
Linux NTFS maintainer / IRC: #ntfs on irc.openprojects.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/

2002-09-05 03:54:01

by Alexander Viro

[permalink] [raw]

Subject: Re: (fwd) Re: [RFC] mount flag "direct"

On Thu, 5 Sep 2002, Edgar Toernig wrote:

> "Peter T. Breuer" wrote:
> >
> > The point is not to choose a file system, but to be able to use
> > whichever one is preferable _at the time_. This is important.
>
> Heck, then take regular nfs. It works with any/most filesystems
> and does all you want. This discussion has become so silly...

Indeed.

BTW, speaking of NFS... Peter, you _do_ realize that stuff done in
generic layers must make sense for that animal, not only for ext2/FAT/etc.?
Now check how many things you were talking about do make sense for network
filesystems. That's right, none.

You want to modify filesystem code - go ahead and do it. You want
to make filesystem-independent modifications - VFS is the right place, but
they'd better _be_ filesystem-independent.

Making VFS aware of clustering (i.e. allowing filesystems to
notice when we are trying to grab parent for some operation) _DOES_ make
sense and at some point we will have to go through existing filesystems
that attempt something of that kind and see what kind of interface would
make sense. Wanking about grand half-arsed schemes and plugging holes
pointed to you with duct-tape and lots of handwaving, OTOH...

2002-09-05 03:32:13

by Edgar Toernig

[permalink] [raw]

Subject: Re: (fwd) Re: [RFC] mount flag "direct"

"Peter T. Breuer" wrote:
>
> The point is not to choose a file system, but to be able to use
> whichever one is preferable _at the time_. This is important.

Heck, then take regular nfs. It works with any/most filesystems
and does all you want. This discussion has become so silly...

Ciao, ET.

2002-09-05 08:25:10

by Helge Hafting

[permalink] [raw]

Subject: Re: (fwd) Re: [RFC] mount flag "direct"

"Peter T. Breuer" wrote:

> > Sorry but this really shows you lack of understanding for how a file
> > system works. Every time you write a single byte(!!!) to a file, this
> > involves modifying fs structures. Even if you do writes in 1MiB chunks,
>
> OK .. in what way?
>
> > what happens is that all the writes are broken down into buffer head sized
> > portions for the purposes of mapping them to disk (this is optimized by
>
> Well, they'll be broken down, yes, quite probably.
>
> > the get_blocks interface but still it means that every time get_blocks is
>
> You mean that the buffer cache is looked up? But we know that we have
> disabled caching on this device ... well, carry on anyway (it does
> no harm to write to an already dirty buffer).
>
> > involved you have to do a full lookup of the logical block and map it to
> > an on disk block). For reading while you are not modifying fs structures
> > you still need to read and parse them for each get_blocks call.
>
> I'm not following you. It seems to me that you are discussing the
> process of getting a buffer to write to prior to letting it age and
> float down to the device driver via the block layers. But we have
> disabled caching so we know that get_blocks will deliver a new
> temporary buffer or block or something or at any rate do what it
> should do ... anyway, I mumble.... what you are saying is that you
> think we look up a logical block number and get a physical block
> number and possibly a buffer associated with it. Well, maybe. So?
>
Some very basic definitions that anyone working on a
disk-based filesystem should know:

Logical block number is the block's number in the file. Example:
"this is the 67th kilobyte of this file"

Physical block number is something completely different, it
is the block's location on disk. Example:
"this block is the 2676676th block on the disk drive, counting
from the start of the drive"

So, what happens when you read gigabytes of data
to/from your files? Something like this:

1.
Your app asks for the first 10M or so of the file. (You're
shuffling _big_ chunks around, possibly over the net, aren't you?)

2.
The fs however is using a smaller blocksize, such as 4k. So your
big request is broken down into a bunch of requests for the
first 4k block, the second 4k block and so on up to the
2560th 4k block. So far, everything happens fast no matter
what kind of fs, or even your no-cache scheme.

3.
The reason for this breakup is that the file might be fragmented
on disk. No one can know that until it is checked, so
each block is checked to see where it is. The kernel
tries to be smart and merges requests whenever the
logical blocks in the file happens to be consecutive
on the disk. They usually are, for performance reasons, but
not always. Each logical block requested is looked up
to see where it actually is on the disk, i.e. we're
looking up the physical positions.

4.
The list of physical positions (or physical blocks)
are passed to thbe disk drivers and the read
actually happens. Then the os lets your app
know that the read finished successfully and
is ready for whatever you want to do with it.
Then your app goes on to read another 10M or so,
repeat from (2) until the entire file is processed.

Number (3) is where the no-cache operation breaks
down completely. We have to look up where on the _disk_
_every_ small block of the file exists. The information
about where the blocks are is called "metadata".
So we need to look at metadata for every little 4k block.
That isn't a problem usually, because the metadata is
small and is normally cached entirely, even for a large
file. So we can look up "wher block 1 is on disk, where block 2
is on disk..." by looking at a little table in memory.

But wait - you don't allow caching, because some other node
may have messed with this file!
So we have to read metadata *again* for each little
block before we are able to read it, for you don't let
us cache that in order to look up the next block.

Now metadata is small, but the disk have to seek
to get it. Seeking is a very expensive operation, having
to seek for every block read will reduce performance to
perhaps 1% of streaming, or even less.

And where on disk is the metadata for this file?
That information is cached too, but you
disallow caching. Rmemeber, the cache isn't just for
_file data_, it is for _metadata_ too. And you can't
say "lets cache metadata only" because then
you need cache coherency anyway.

And as Anton Altaparmakov pointed out, finding the metadata
for a file with _nothing_ in cache requires several
reads and seeks. You end up with 4 or more seeks and
small reads for every block of your big file.

With this scheme, you need thousands of pcs just to match
the performance of a _single_ pc with caching, because
the performance of _one_ of your nodes will be so low.

> > This in turn means that each call to get_blocks within the direct_IO code
> > paths will result in a full block lookup in the filesystem driver.
>
> Uh, I'm not sure I understand any of this. You are saying something
> about logical/physical that I don't follow or don't know. In

Take a course in fs design, or read some books at least. Discussion
is pointless when you don't know the basics - it is like discussing
advanced math with someone who is eager and full of ideas but
know the single-digit numbers only.
You can probably come up with something interesting _when_ you
have learned how filesystems actually works - in detail.
If you don't bother you'll end up a "failed visionary".

> direct IO, one of the things that I believe happens is that writes are
> serialized, in order to maintain semantics when we go RWRW (we don't
> want it to work like WRWR or other), so that overlaps cannot happen.
> We should never be in the situation of having a dirty (or
> cached) buffer that we rewrite before it goes to disk (again).
>
> Is that relevant?
Both irrelevant and wrong. Direct io isn't serialized, but
it don't matter.

> > I explained in a previous post how incredibly expensive that is.
>
> Well, I simply don't follow you here. Can you unmuddle my understanding
> of what you are saying about logical/physical?
>
> > So even though you are streaming GiB of /data/ you will end up streaming
> > TiB of /metadata/ for each GiB of /data/. Is that so difficult to
> > understand?
>
> Yep. Can you be concrete about this metadata? I'lll see if I can work
> it out ..
Explained above. Metadata is the information about where on disk
the file blocks is. And the filename. And the directory
structure. The info about how big the file is and who
is allowed to read/write it. And all of it is stored on
the disk. The inode is metadata, and contains the disk
addresses for other metadata for that file. But
where is the inode located? That information is
in the directory entry, looked up from the filename.

> what I think you must be saying is that when we write to
> a file, we write to a process address space, and that has to be
> translated into a physical block number on disk.
Sort of orrect if you're writing to a mmapped file.

> Well, but I imagine
> that the translation was settled at the point that the inode was first
> obtained,
Plain wrong. _Each block_ needs this translation, because the
file might be fragmented. Oh, and don't forget that _you_
aren't allowed to keep an inode around - unless you
do cache coherency for inodes. Another node
in your cluster might delete the file after all,
destroying the indode! So _you_ have to re-read the inode
for every little block read or written. We don't, we
keep the inode in the icache. But you says "no cache,
cache coherency is too hard!"

> and now that we have the inode, data that goes to it gets
> a physical address from the data in the inode. We can even be using an
> inode that is unconnected to the FS, and we will still be writing away

What is an inode "unconnected to the FS". Please explain?
An inode is a part of fs metadata - per definition!

> to disk, and nobody else will be using that space, because the space
> is marked as occupied in the bit map.
How do you know, without a cache-coherent bitmap? A normal fs knows,
because it can look up the bitmap is cached. You have
to re-read it everytime. And where on disk is the bitmap?
That too is something you have to check for every block read.
After all, someone else may have created a new file
or even deleted yours.

> I see no continuous lookup of metadata.
The inode and bitmap you speak of are both "metadata". Anything that
isn't your file but describes things like where the file blocks
are, which blocks are free, or how the fs is organized is metadata.

It is all on disk, it is normally cached, and it all requires
cache coherency in i cluster.

This is why I suggested a new fs, you may then organize the
metadata such that different nodes _can_ cache metadata
for _different_ files over time. Or even for different parts
of the same file.
>
> > Unless you allow the FS to cache the /metadata/ you have already lost all
>
> What metadata? I only see the inode, and that's not cached in any
> meaningful sense (i.e. not available to other things) to but simply
> "held" in kmem storage and pointed to.
That _is_ caching. But what if another node deletes or changes
the file you're working on? That leaves you with a invalid
inode. Not a problem on smp, where all the processors access
the same memory. One cpu can change the inode and then
it is changed for the others as well. A cluster don't
work that way because the others can't access the memory
with the inode directly.

>
> > your performance and you will never be able to stream at the speads you
> > require.
>
> Can you be even MORE specific about what this metadata is? Maybe I'll
> get it if you are very very specific.
See above!
> > So far you are completely ignoring my comments. Is that because you see
> > they are true and cannot come up with a counter argument?
>
> Normal users should see no difference, because they won't turn on
> O_DIRDIRECT, or whatever, and it should make no difference to them
it is O_DIRECT
> that the dir cache /can/ be turned off. That should be a goal, anyway.
> Can it be done? I think so, at least if it's restricted to directory
> lookup in the first instance, but I would like your very concrete
> example of what cached metadata is changed when I do a data write.
>
What cached metadata is changed when you write?
First, the file size. Second, access time. the free block
bitmap, because you use up free blocks when writing.
Then there is the file's block layout, the information about where
on the disk your data blocks go. Every block must be written
_somewhere_ on disk, exactly where is recorded in the file's
metadata pointed to by the inode.

> > generic kernel cater for your specialised application which will be used
> > on a few systems on the planet when you would penalize 99.99999% of Linux
> > users with your solution.
>
> Where is the penalty?
Turning off the cache is big penalty. It makes even _your own_
use slow. Remember, the cache isn't only about file blocks,
it is also about metadata that is consulted all the time.

And remember that the majority of linux users aren't running your
special app. They are running editors, compilers
and web servers, which all benefit greatly from caching. Turning
off chaching for those uses _will_ make ordinary
operations take a thousand time as long - or worse.

Suggesting to remove caching is like suggesting to remove
the engines from cars - most car problems is with the
engine and it would reduce pollution too. The suggestion may
seem smart fro someone who have no idea what the engine
is for. For everybode else it seems crazy.

> > The only viable solution which can enter the generic kernel is to
> > implement what you suggest at the FS level, not the VFS/block layer
> > levels.
>
> Why? Having VFS ops available does not mean you are obliged to use
> them! And useing them means swapping them in via a pointer indirection,
> not testing a flag always.
>
> > Of course if you intend to maintain your solution outside the kernel as a
> > patch of some description until the end of time then that's fine. It is
> > irrelevant what solution you choose and noone will complain as you are not
> > trying to force it onto anyone else.
>
> I think you are imagining implementation that are simply not the way
> I imagine them. Can you be specific about the exact metadata that is
> changed when a data write is done? That will help me decide.

Again, see above. And it isn't merely about what metadata _changes_.
It is also about what metadata needs to be _looked at_. You
can't even _read_ a single block from a file without looking up
where on disk that block is. You can't try to design
filesystems while overlooking elementary stuff like that.
And people are probably tired of explaining simple stuff. Read
a good book on fs design, or write code and impress us.

Helge Hafting

2002-09-05 08:46:50

[permalink] [raw]

Subject: Re: (fwd) Re: [RFC] mount flag "direct"

another problem with trying do do special things with the metadata is that
the lower layers don't have any idea what is a metadata block on disk and
what is a user data block on disk, the only thing that can tell the
difference is the filesystem that knows what it is useing the block for
(and depending on how the filesystem was created blocks on disk could have
different uses, just thing about the simple option of changing the number
of inodes allocated, blocks that are inodes in one case are data in the
other)

if making things distributed was as simple as disabling caching then SMP
programming would be simple, disable your caches and let things go,
right???

it doesn't work that easily, you need to have cooperation between
processors.

version one is to make a single lock that you grab before you do any
'special' work to prevent processors clobbering each others memory, in
linux this was the Big Kernel Lock.

it worked (sort of) but barely scaled to two CPUs

making things scale better then that requires careful analysis of the
locking that is needed and careful implementation of the locks, this is
where the linux kernel is now.

take the example above and replace procesors with systems and memory with
disk blocks and you have the problem of distributed filesystems.

just like you can't take a single threaded process, wave a magic wand
(upgrading the underlying OS) and make it use multiple processors you
can't just take a filesystem, upgrade the VFS/block layer and have it
share the same drive from multiple systems, in both cases it takes drastic
modifications that are frequently as bad or worse then writing a new
version from scratch, and in addition the new version is usually less
efficiant if run in a single CPU/system environment.

David Lang

On Thu, 5 Sep 2002, Helge Hafting
wrote:

> Date: Thu, 05 Sep 2002 10:30:11 +0200
> From: Helge Hafting <[email protected]>
> To: [email protected], [email protected]
> Subject: Re: (fwd) Re: [RFC] mount flag "direct"
>
> "Peter T. Breuer" wrote:
>
> > > Sorry but this really shows you lack of understanding for how a file
> > > system works. Every time you write a single byte(!!!) to a file, this
> > > involves modifying fs structures. Even if you do writes in 1MiB chunks,
> >
> > OK .. in what way?
> >
> > > what happens is that all the writes are broken down into buffer head sized
> > > portions for the purposes of mapping them to disk (this is optimized by
> >
> > Well, they'll be broken down, yes, quite probably.
> >
> > > the get_blocks interface but still it means that every time get_blocks is
> >
> > You mean that the buffer cache is looked up? But we know that we have
> > disabled caching on this device ... well, carry on anyway (it does
> > no harm to write to an already dirty buffer).
> >
> > > involved you have to do a full lookup of the logical block and map it to
> > > an on disk block). For reading while you are not modifying fs structures
> > > you still need to read and parse them for each get_blocks call.
> >
> > I'm not following you. It seems to me that you are discussing the
> > process of getting a buffer to write to prior to letting it age and
> > float down to the device driver via the block layers. But we have
> > disabled caching so we know that get_blocks will deliver a new
> > temporary buffer or block or something or at any rate do what it
> > should do ... anyway, I mumble.... what you are saying is that you
> > think we look up a logical block number and get a physical block
> > number and possibly a buffer associated with it. Well, maybe. So?
> >
> Some very basic definitions that anyone working on a
> disk-based filesystem should know:
>
> Logical block number is the block's number in the file. Example:
> "this is the 67th kilobyte of this file"
>
> Physical block number is something completely different, it
> is the block's location on disk. Example:
> "this block is the 2676676th block on the disk drive, counting
> from the start of the drive"
>
> So, what happens when you read gigabytes of data
> to/from your files? Something like this:
>
> 1.
> Your app asks for the first 10M or so of the file. (You're
> shuffling _big_ chunks around, possibly over the net, aren't you?)
>
> 2.
> The fs however is using a smaller blocksize, such as 4k. So your
> big request is broken down into a bunch of requests for the
> first 4k block, the second 4k block and so on up to the
> 2560th 4k block. So far, everything happens fast no matter
> what kind of fs, or even your no-cache scheme.
>
> 3.
> The reason for this breakup is that the file might be fragmented
> on disk. No one can know that until it is checked, so
> each block is checked to see where it is. The kernel
> tries to be smart and merges requests whenever the
> logical blocks in the file happens to be consecutive
> on the disk. They usually are, for performance reasons, but
> not always. Each logical block requested is looked up
> to see where it actually is on the disk, i.e. we're
> looking up the physical positions.
>
> 4.
> The list of physical positions (or physical blocks)
> are passed to thbe disk drivers and the read
> actually happens. Then the os lets your app
> know that the read finished successfully and
> is ready for whatever you want to do with it.
> Then your app goes on to read another 10M or so,
> repeat from (2) until the entire file is processed.
>
> Number (3) is where the no-cache operation breaks
> down completely. We have to look up where on the _disk_
> _every_ small block of the file exists. The information
> about where the blocks are is called "metadata".
> So we need to look at metadata for every little 4k block.
> That isn't a problem usually, because the metadata is
> small and is normally cached entirely, even for a large
> file. So we can look up "wher block 1 is on disk, where block 2
> is on disk..." by looking at a little table in memory.
>
> But wait - you don't allow caching, because some other node
> may have messed with this file!
> So we have to read metadata *again* for each little
> block before we are able to read it, for you don't let
> us cache that in order to look up the next block.
>
> Now metadata is small, but the disk have to seek
> to get it. Seeking is a very expensive operation, having
> to seek for every block read will reduce performance to
> perhaps 1% of streaming, or even less.
>
> And where on disk is the metadata for this file?
> That information is cached too, but you
> disallow caching. Rmemeber, the cache isn't just for
> _file data_, it is for _metadata_ too. And you can't
> say "lets cache metadata only" because then
> you need cache coherency anyway.
>
> And as Anton Altaparmakov pointed out, finding the metadata
> for a file with _nothing_ in cache requires several
> reads and seeks. You end up with 4 or more seeks and
> small reads for every block of your big file.
>
> With this scheme, you need thousands of pcs just to match
> the performance of a _single_ pc with caching, because
> the performance of _one_ of your nodes will be so low.
>
>
>
> > > This in turn means that each call to get_blocks within the direct_IO code
> > > paths will result in a full block lookup in the filesystem driver.
> >
> > Uh, I'm not sure I understand any of this. You are saying something
> > about logical/physical that I don't follow or don't know. In
>
> Take a course in fs design, or read some books at least. Discussion
> is pointless when you don't know the basics - it is like discussing
> advanced math with someone who is eager and full of ideas but
> know the single-digit numbers only.
> You can probably come up with something interesting _when_ you
> have learned how filesystems actually works - in detail.
> If you don't bother you'll end up a "failed visionary".
>
> > direct IO, one of the things that I believe happens is that writes are
> > serialized, in order to maintain semantics when we go RWRW (we don't
> > want it to work like WRWR or other), so that overlaps cannot happen.
> > We should never be in the situation of having a dirty (or
> > cached) buffer that we rewrite before it goes to disk (again).
> >
> > Is that relevant?
> Both irrelevant and wrong. Direct io isn't serialized, but
> it don't matter.
>
>
> > > I explained in a previous post how incredibly expensive that is.
> >
> > Well, I simply don't follow you here. Can you unmuddle my understanding
> > of what you are saying about logical/physical?
> >
> > > So even though you are streaming GiB of /data/ you will end up streaming
> > > TiB of /metadata/ for each GiB of /data/. Is that so difficult to
> > > understand?
> >
> > Yep. Can you be concrete about this metadata? I'lll see if I can work
> > it out ..
> Explained above. Metadata is the information about where on disk
> the file blocks is. And the filename. And the directory
> structure. The info about how big the file is and who
> is allowed to read/write it. And all of it is stored on
> the disk. The inode is metadata, and contains the disk
> addresses for other metadata for that file. But
> where is the inode located? That information is
> in the directory entry, looked up from the filename.
>
> > what I think you must be saying is that when we write to
> > a file, we write to a process address space, and that has to be
> > translated into a physical block number on disk.
> Sort of orrect if you're writing to a mmapped file.
>
> > Well, but I imagine
> > that the translation was settled at the point that the inode was first
> > obtained,
> Plain wrong. _Each block_ needs this translation, because the
> file might be fragmented. Oh, and don't forget that _you_
> aren't allowed to keep an inode around - unless you
> do cache coherency for inodes. Another node
> in your cluster might delete the file after all,
> destroying the indode! So _you_ have to re-read the inode
> for every little block read or written. We don't, we
> keep the inode in the icache. But you says "no cache,
> cache coherency is too hard!"
>
> > and now that we have the inode, data that goes to it gets
> > a physical address from the data in the inode. We can even be using an
> > inode that is unconnected to the FS, and we will still be writing away
>
> What is an inode "unconnected to the FS". Please explain?
> An inode is a part of fs metadata - per definition!
>
> > to disk, and nobody else will be using that space, because the space
> > is marked as occupied in the bit map.
> How do you know, without a cache-coherent bitmap? A normal fs knows,
> because it can look up the bitmap is cached. You have
> to re-read it everytime. And where on disk is the bitmap?
> That too is something you have to check for every block read.
> After all, someone else may have created a new file
> or even deleted yours.
>
> > I see no continuous lookup of metadata.
> The inode and bitmap you speak of are both "metadata". Anything that
> isn't your file but describes things like where the file blocks
> are, which blocks are free, or how the fs is organized is metadata.
>
> It is all on disk, it is normally cached, and it all requires
> cache coherency in i cluster.
>
> This is why I suggested a new fs, you may then organize the
> metadata such that different nodes _can_ cache metadata
> for _different_ files over time. Or even for different parts
> of the same file.
> >
> > > Unless you allow the FS to cache the /metadata/ you have already lost all
> >
> > What metadata? I only see the inode, and that's not cached in any
> > meaningful sense (i.e. not available to other things) to but simply
> > "held" in kmem storage and pointed to.
> That _is_ caching. But what if another node deletes or changes
> the file you're working on? That leaves you with a invalid
> inode. Not a problem on smp, where all the processors access
> the same memory. One cpu can change the inode and then
> it is changed for the others as well. A cluster don't
> work that way because the others can't access the memory
> with the inode directly.
>
> >
> > > your performance and you will never be able to stream at the speads you
> > > require.
> >
> > Can you be even MORE specific about what this metadata is? Maybe I'll
> > get it if you are very very specific.
> See above!
> > > So far you are completely ignoring my comments. Is that because you see
> > > they are true and cannot come up with a counter argument?
> >
> > Normal users should see no difference, because they won't turn on
> > O_DIRDIRECT, or whatever, and it should make no difference to them
> it is O_DIRECT
> > that the dir cache /can/ be turned off. That should be a goal, anyway.
> > Can it be done? I think so, at least if it's restricted to directory
> > lookup in the first instance, but I would like your very concrete
> > example of what cached metadata is changed when I do a data write.
> >
> What cached metadata is changed when you write?
> First, the file size. Second, access time. the free block
> bitmap, because you use up free blocks when writing.
> Then there is the file's block layout, the information about where
> on the disk your data blocks go. Every block must be written
> _somewhere_ on disk, exactly where is recorded in the file's
> metadata pointed to by the inode.
>
> > > generic kernel cater for your specialised application which will be used
> > > on a few systems on the planet when you would penalize 99.99999% of Linux
> > > users with your solution.
> >
> > Where is the penalty?
> Turning off the cache is big penalty. It makes even _your own_
> use slow. Remember, the cache isn't only about file blocks,
> it is also about metadata that is consulted all the time.
>
> And remember that the majority of linux users aren't running your
> special app. They are running editors, compilers
> and web servers, which all benefit greatly from caching. Turning
> off chaching for those uses _will_ make ordinary
> operations take a thousand time as long - or worse.
>
> Suggesting to remove caching is like suggesting to remove
> the engines from cars - most car problems is with the
> engine and it would reduce pollution too. The suggestion may
> seem smart fro someone who have no idea what the engine
> is for. For everybode else it seems crazy.
>
> > > The only viable solution which can enter the generic kernel is to
> > > implement what you suggest at the FS level, not the VFS/block layer
> > > levels.
> >
> > Why? Having VFS ops available does not mean you are obliged to use
> > them! And useing them means swapping them in via a pointer indirection,
> > not testing a flag always.
> >
> > > Of course if you intend to maintain your solution outside the kernel as a
> > > patch of some description until the end of time then that's fine. It is
> > > irrelevant what solution you choose and noone will complain as you are not
> > > trying to force it onto anyone else.
> >
> > I think you are imagining implementation that are simply not the way
> > I imagine them. Can you be specific about the exact metadata that is
> > changed when a data write is done? That will help me decide.
>
> Again, see above. And it isn't merely about what metadata _changes_.
> It is also about what metadata needs to be _looked at_. You
> can't even _read_ a single block from a file without looking up
> where on disk that block is. You can't try to design
> filesystems while overlooking elementary stuff like that.
> And people are probably tired of explaining simple stuff. Read
> a good book on fs design, or write code and impress us.
>
> Helge Hafting
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2002-09-05 14:20:33

[permalink] [raw]

Subject: Re: (fwd) Re: [RFC] mount flag "direct"

HI .. I'm currently in an internet cafe in nice, france, watching the
rain come down, so forgive me if I don't do you justice. I find this
reply to be excelent for my purposes. Thank you.

"Helge Hafting wrote:"
> "Peter T. Breuer" wrote:
> > I'm not following you. It seems to me that you are discussing the
> > process of getting a buffer to write to prior to letting it age and
> > float down to the device driver via the block layers. But we have
...

> Some very basic definitions that anyone working on a
> disk-based filesystem should know:

Thanks.

> Logical block number is the block's number in the file. Example:
> "this is the 67th kilobyte of this file"

Aha. OK.

> Physical block number is something completely different, it
> is the block's location on disk. Example:

Yes.

> 1.
> Your app asks for the first 10M or so of the file. (You're
> shuffling _big_ chunks around, possibly over the net, aren't you?)

Yes.

> 2.
> The fs however is using a smaller blocksize, such as 4k. So your
> big request is broken down into a bunch of requests for the
> first 4k block, the second 4k block and so on up to the
> 2560th 4k block. So far, everything happens fast no matter
> what kind of fs, or even your no-cache scheme.

Fine, but where is the log/phys translation done? I presume that the
actual inode contains sufficient info to do the translation, because
the inode has a physical location on disk, and it is also associated
with a file, and what we do is generally start from the inode and trace
down to where the inode says the logical block shoul dbe, and then look
it up. During this time the inode location on disk must be locked
(with a read lock). I can do that. If you let me have "tag
requests" in the block layers and let me generate them in the VFS
layers. Yes, I agree, I have to know where the inode is on disk
in order to generate the block request, but the FS will know,
and I just want it to tell VFS .. well, too much detail.

> 3.
> The reason for this breakup is that the file might be fragmented
> on disk. No one can know that until it is checked, so

Yes, indeed. I agree.

> each block is checked to see where it is. The kernel
> tries to be smart and merges requests whenever the
> logical blocks in the file happens to be consecutive

That's fine. This is the elevator alg. working in the block layers.
Theer's no problem there. What I want to know is more about the
lookup procedure for the inode, but already this much info is enough
to confirm the picture I have. One needs to lock the inode position
on disk.

> 4.
> The list of physical positions (or physical blocks)
> are passed to thbe disk drivers and the read

> Number (3) is where the no-cache operation breaks
> down completely. We have to look up where on the _disk_
> _every_ small block of the file exists. The information

Yes. But this is not a "complete breakdown" but a requirement to
(a) lock bits of disk not associated with data while we look at them,
and (b) relook again next time.

Frankly, a single FS lock works at this point. That is, a single
location on disk (say the sb) that can be locked by some layer.
It can be finer, but there's no need to. The problem is avoiding
caching the metadata lookup that results, so that next time we look it
up again.

> about where the blocks are is called "metadata".
> So we need to look at metadata for every little 4k block.

No .. I don't see that. Not every block has some unique metadata
associated, with it, surely? I thought that normally inodes
were "direct", that is, pointing at contiguous lumps? Sure,
sometimes some other lookups might be required. but often? No.

Especially if you imagine that 99.99% of the ops on the file system will
be rewriting or rereading files. Just a "metadata updated" flag on the
FS might be useful to avoid looking everything up again avery time,
but I realy would like to see how much overhead there IS first.

> That isn't a problem usually, because the metadata is
> small and is normally cached entirely, even for a large
> file. So we can look up "wher block 1 is on disk, where block 2
> is on disk..." by looking at a little table in memory.

Well, or on disk :-)

> But wait - you don't allow caching, because some other node

Thanks, but I know. That's right - we have to look up the inode again.
That's precisely what I want to do.

> may have messed with this file!

May have, but probably won't. To have messed with it it must
have messed with the inode, and we can ask the driver if anyone
but us has written to that spot (the inode), and if not, not
reread it. That's just an idea, but really, I would prefer to reread
it. Data reads and writes will generally be in lumps the order of
0.25M. An extra 4K on that is latency, not appreciable extra data.

> So we have to read metadata *again* for each little
> block before we are able to read it, for you don't let

No, not each little block. I disagree totally. Remember that
we get the lock on the inode area at the start of the opn. Nobody
can mess with it while we do the sequence of small reads or writes
(which will be merged into large reads or writes). That was the
whole point of my request for extra "tag requests" in the block
layer. I want to be able to request a region lock, by whatever
mechanism.

> us cache that in order to look up the next block.

At this point the argument has gone wrong, so I can't follow ...

> Now metadata is small, but the disk have to seek
> to get it. Seeking is a very expensive operation, having
> to seek for every block read will reduce performance to
> perhaps 1% of streaming, or even less.

Well, you forget that we had to seek there anyway, and that the
_server_ kernel has cached the result (I am only saying that the
_client_ kernels need O_*DIRECT) , and that therefore there is
no ondisk seek on the real device, just a "memory seek" in the
guardian of the real device.

> And where on disk is the metadata for this file?
> That information is cached too, but you
> disallow caching. Rmemeber, the cache isn't just for

You are confusing yourself with syntax instead of semantics, I think ..
I can't relate this account to anything that really happens right now.
The words "no caching" only apply to the kernel for which the
real device is remote. On the kernel for which it is local, of course
there is block-level caching (though not FS level caching, I hope!).

> And as Anton Altaparmakov pointed out, finding the metadata
> for a file with _nothing_ in cache requires several
> reads and seeks. You end up with 4 or more seeks and

This is all way off beam .. sorry. I'd like to know why people
imagine in this way .. it doesn't correspond to the real state
of play.

> small reads for every block of your big file.
>
> With this scheme, you need thousands of pcs just to match

No.

> the performance of a _single_ pc with caching, because
> the performance of _one_ of your nodes will be so low.

> > Uh, I'm not sure I understand any of this. You are saying something
> > about logical/physical that I don't follow or don't know. In
>
> Take a course in fs design, or read some books at least. Discussion

Well, tell me whatyou mean by "logical". telling me you mean "offset
within file" is enough explanation for me! I really wish you would
stop patronising - you are confusing unfamiliarity with YOUR jargon
for unfamiliarity with the logic and the science involved. Try again.

> is pointless when you don't know the basics - it is like discussing
> advanced math with someone who is eager and full of ideas but
> know the single-digit numbers only.

No, it isn't. You way overestimate yourself, I am afraid.

> You can probably come up with something interesting _when_ you
> have learned how filesystems actually works - in detail.

I don't need to. Sorry.

> If you don't bother you'll end up a "failed visionary".

I am not trying to be a visionary. I am an extremely practical person.
I want your input so I can DO something. When I've tried it, we can
speak further.

> > Is that relevant?
> Both irrelevant and wrong. Direct io isn't serialized, but

It appears to be. I have measured the sequencialization that takes
place on reads and writes through direct_io at the block level in
2.5.31, and as far as in can tell, a new write request does not go
out before the "previous" one has come back in. Shrug.

> it don't matter.
>
Peter

2002-09-05 15:57:20

by Andreas Dilger

[permalink] [raw]

Subject: Re: [RFC] intent-based lookup (was: mount flag "direct")

On Sep 04, 2002 23:58 -0400, Alexander Viro wrote:
> Making VFS aware of clustering (i.e. allowing filesystems to
> notice when we are trying to grab parent for some operation) _DOES_ make
> sense and at some point we will have to go through existing filesystems
> that attempt something of that kind and see what kind of interface would
> make sense. Wanking about grand half-arsed schemes and plugging holes
> pointed to you with duct-tape and lots of handwaving, OTOH...

Al, have you looked at Peter's intent-based lookup code yet? Trond
also expressed interest in something like this for NFS.

It essentially passes a bitflag with the intent of the lookup,
(e.g. create, mkdir, getattr, etc) so that you can optionally complete
the lookup and do the operation in one go. The VFS patch to add
intent-based lookup (not for inclusion, just discussion) below.

Cheers, Andreas
============================
--- lum-pristine/include/linux/dcache.h Thu Nov 22 14:46:18 2001
+++ lum/include/linux/dcache.h Mon Aug 12 00:02:29 2002
@@ -6,6 +6,33 @@
#include <asm/atomic.h>
#include <linux/mount.h>

+#define IT_OPEN (1)
+#define IT_CREAT (1<<1)
+#define IT_MKDIR (1<<2)
+#define IT_LINK (1<<3)
+#define IT_SYMLINK (1<<4)
+#define IT_UNLINK (1<<5)
+#define IT_RMDIR (1<<6)
+#define IT_RENAME (1<<7)
+#define IT_RENAME2 (1<<8)
+#define IT_READDIR (1<<9)
+#define IT_GETATTR (1<<10)
+#define IT_SETATTR (1<<11)
+#define IT_READLINK (1<<12)
+#define IT_MKNOD (1<<13)
+#define IT_LOOKUP (1<<14)
+
+struct lookup_intent {
+ int it_op;
+ int it_mode;
+ int it_disposition;
+ int it_status;
+ struct iattr *it_iattr;
+ __u64 it_lock_handle[2];
+ int it_lock_mode;
+ void *it_data;
+};
+
/*
* linux/include/linux/dcache.h
*
@@ -78,6 +106,7 @@
unsigned long d_time; /* used by d_revalidate */
struct dentry_operations *d_op;
struct super_block * d_sb; /* The root of the dentry tree */
+ struct lookup_intent *d_it;
unsigned long d_vfs_flags;
void * d_fsdata; /* fs-specific data */
unsigned char d_iname[DNAME_INLINE_LEN]; /* small names */
@@ -91,6 +119,8 @@
int (*d_delete)(struct dentry *);
void (*d_release)(struct dentry *);
void (*d_iput)(struct dentry *, struct inode *);
+ int (*d_revalidate2)(struct dentry *, int, struct lookup_intent *);
+ void (*d_intent_release)(struct dentry *);
};

/* the dentry parameter passed to d_hash and d_compare is the parent
--- lum-pristine/include/linux/fs.h Mon Aug 12 11:02:53 2002
+++ lum/include/linux/fs.h Mon Aug 12 11:48:38 2002
@@ -536,6 +536,7 @@

/* needed for tty driver, and maybe others */
void *private_data;
+ struct lookup_intent *f_intent;

/* preallocated helper kiobuf to speedup O_DIRECT */
struct kiobuf *f_iobuf;
@@ -779,7 +780,9 @@
extern int vfs_link(struct dentry *, struct inode *, struct dentry *);
extern int vfs_rmdir(struct inode *, struct dentry *);
extern int vfs_unlink(struct inode *, struct dentry *);
-extern int vfs_rename(struct inode *, struct dentry *, struct inode *, struct dentry *);
+int vfs_rename(struct inode *old_dir, struct dentry *old_dentry,
+ struct inode *new_dir, struct dentry *new_dentry,
+ struct lookup_intent *it);

/*
* File types
@@ -840,6 +843,7 @@
struct inode_operations {
int (*create) (struct inode *,struct dentry *,int);
struct dentry * (*lookup) (struct inode *,struct dentry *);
+ struct dentry * (*lookup2) (struct inode *,struct dentry *, struct lookup_intent *);
int (*link) (struct dentry *,struct inode *,struct dentry *);
int (*unlink) (struct inode *,struct dentry *);
int (*symlink) (struct inode *,struct dentry *,const char *);
@@ -986,7 +990,7 @@
extern struct vfsmount *kern_mount(struct file_system_type *);
extern int may_umount(struct vfsmount *);
extern long do_mount(char *, char *, char *, unsigned long, void *);
-
+struct vfsmount *do_kern_mount(char *type, int flags, char *name, void *data);
#define kern_umount mntput

extern int vfs_statfs(struct super_block *, struct statfs *);
@@ -1307,6 +1311,7 @@
extern loff_t default_llseek(struct file *file, loff_t offset, int origin);

extern int FASTCALL(__user_walk(const char *, unsigned, struct nameidata *));
+extern int FASTCALL(__user_walk_it(const char *, unsigned, struct nameidata *, struct lookup_intent *it));
extern int FASTCALL(path_init(const char *, unsigned, struct nameidata *));
extern int FASTCALL(path_walk(const char *, struct nameidata *));
extern int FASTCALL(link_path_walk(const char *, struct nameidata *));
@@ -1317,6 +1322,8 @@
extern struct dentry * lookup_hash(struct qstr *, struct dentry *);
#define user_path_walk(name,nd) __user_walk(name, LOOKUP_FOLLOW|LOOKUP_POSITIVE, nd)
#define user_path_walk_link(name,nd) __user_walk(name, LOOKUP_POSITIVE, nd)
+#define user_path_walk_it(name,nd,it) __user_walk_it(name, LOOKUP_FOLLOW|LOOKUP_POSITIVE, nd, it)
+#define user_path_walk_link_it(name,nd,it) __user_walk_it(name, LOOKUP_POSITIVE, nd, it)

extern void iput(struct inode *);
extern void force_delete(struct inode *);
--- lum-pristine/fs/dcache.c Mon Feb 25 14:38:08 2002
+++ lum/fs/dcache.c Thu Aug 1 18:07:35 2002
@@ -617,6 +617,7 @@
dentry->d_op = NULL;
dentry->d_fsdata = NULL;
dentry->d_mounted = 0;
+ dentry->d_it = NULL;
INIT_LIST_HEAD(&dentry->d_hash);
INIT_LIST_HEAD(&dentry->d_lru);
INIT_LIST_HEAD(&dentry->d_subdirs);
--- lum-pristine/fs/nfsd/vfs.c Fri Dec 21 12:41:55 2001
+++ lum/fs/nfsd/vfs.c Thu Aug 1 18:07:35 2002
@@ -1285,7 +1285,7 @@
err = nfserr_perm;
} else
#endif
- err = vfs_rename(fdir, odentry, tdir, ndentry);
+ err = vfs_rename(fdir, odentry, tdir, ndentry, NULL);
if (!err && EX_ISSYNC(tfhp->fh_export)) {
nfsd_sync_dir(tdentry);
nfsd_sync_dir(fdentry);
--- lum-pristine/fs/namei.c Mon Feb 25 14:38:09 2002
+++ lum/fs/namei.c Mon Aug 12 11:47:56 2002
@@ -94,6 +94,14 @@
* XEmacs seems to be relying on it...
*/

+void intent_release(struct dentry *de)
+{
+ if (de->d_op && de->d_op->d_intent_release)
+ de->d_op->d_intent_release(de);
+ de->d_it = NULL;
+}
+
+
/* In order to reduce some races, while at the same time doing additional
* checking and hopefully speeding things up, we copy filenames to the
* kernel data space before using them..
@@ -260,10 +268,19 @@
* Internal lookup() using the new generic dcache.
* SMP-safe
*/
-static struct dentry * cached_lookup(struct dentry * parent, struct qstr * name, int flags)
+static struct dentry *cached_lookup(struct dentry *parent, struct qstr *name,
+ int flags, struct lookup_intent *it)
{
struct dentry * dentry = d_lookup(parent, name);

+ if (dentry && dentry->d_op && dentry->d_op->d_revalidate2) {
+ if (!dentry->d_op->d_revalidate2(dentry, flags, it) &&
+ !d_invalidate(dentry)) {
+ dput(dentry);
+ dentry = NULL;
+ }
+ return dentry;
+ } else
if (dentry && dentry->d_op && dentry->d_op->d_revalidate) {
if (!dentry->d_op->d_revalidate(dentry, flags) && !d_invalidate(dentry)) {
dput(dentry);
@@ -281,7 +298,8 @@
* make sure that nobody added the entry to the dcache in the meantime..
* SMP-safe
*/
-static struct dentry * real_lookup(struct dentry * parent, struct qstr * name, int flags)
+static struct dentry *real_lookup(struct dentry *parent, struct qstr *name,
+ int flags, struct lookup_intent *it)
{
struct dentry * result;
struct inode *dir = parent->d_inode;
@@ -300,6 +318,9 @@
result = ERR_PTR(-ENOMEM);
if (dentry) {
lock_kernel();
- result = dir->i_op->lookup(dir, dentry);
+ if (dir->i_op->lookup2)
+ result = dir->i_op->lookup2(dir, dentry, it);
+ else
+ result = dir->i_op->lookup(dir, dentry);
unlock_kernel();
if (result)
@@ -321,6 +342,12 @@
dput(result);
result = ERR_PTR(-ENOENT);
}
+ } else if (result->d_op && result->d_op->d_revalidate2) {
+ if (!result->d_op->d_revalidate2(result, flags, it) &&
+ !d_invalidate(result)) {
+ dput(result);
+ result = ERR_PTR(-ENOENT);
+ }
}
return result;
}
@@ -445,7 +472,8 @@
*
* We expect 'base' to be positive and a directory.
*/
-int link_path_walk(const char * name, struct nameidata *nd)
+int link_path_walk_it(const char *name, struct nameidata *nd,
+ struct lookup_intent *it)
{
struct dentry *dentry;
struct inode *inode;
@@ -518,9 +546,9 @@
break;
}
/* This does the actual lookups.. */
- dentry = cached_lookup(nd->dentry, &this, LOOKUP_CONTINUE);
+ dentry = cached_lookup(nd->dentry, &this, LOOKUP_CONTINUE, NULL);
if (!dentry) {
- dentry = real_lookup(nd->dentry, &this, LOOKUP_CONTINUE);
+ dentry = real_lookup(nd->dentry, &this, LOOKUP_CONTINUE, NULL);
err = PTR_ERR(dentry);
if (IS_ERR(dentry))
break;
@@ -554,7 +582,7 @@
nd->dentry = dentry;
}
err = -ENOTDIR;
- if (!inode->i_op->lookup)
+ if (!inode->i_op->lookup && !inode->i_op->lookup2)
break;
continue;
/* here ends the main loop */
@@ -581,9 +609,9 @@
if (err < 0)
break;
}
- dentry = cached_lookup(nd->dentry, &this, 0);
+ dentry = cached_lookup(nd->dentry, &this, 0, it);
if (!dentry) {
- dentry = real_lookup(nd->dentry, &this, 0);
+ dentry = real_lookup(nd->dentry, &this, 0, it);
err = PTR_ERR(dentry);
if (IS_ERR(dentry))
break;
@@ -607,7 +635,8 @@
goto no_inode;
if (lookup_flags & LOOKUP_DIRECTORY) {
err = -ENOTDIR;
- if (!inode->i_op || !inode->i_op->lookup)
+ if (!inode->i_op || (!inode->i_op->lookup &&
+ !inode->i_op->lookup2))
break;
}
goto return_base;
@@ -626,6 +655,7 @@
else if (this.len == 2 && this.name[1] == '.')
nd->last_type = LAST_DOTDOT;
return_base:
+ nd->dentry->d_it = it;
return 0;
out_dput:
dput(dentry);
@@ -633,15 +663,29 @@
}
path_release(nd);
return_err:
+ if (!err)
+ nd->dentry->d_it = it;
return err;
}

+int link_path_walk(const char * name, struct nameidata *nd)
+{
+ return link_path_walk_it(name, nd, NULL);
+}
+
+int path_walk_it(const char * name, struct nameidata *nd, struct lookup_intent *it)
+{
+ current->total_link_count = 0;
+ return link_path_walk_it(name, nd, it);
+}
+
int path_walk(const char * name, struct nameidata *nd)
{
current->total_link_count = 0;
- return link_path_walk(name, nd);
+ return link_path_walk_it(name, nd, NULL);
}

+
/* SMP-safe */
/* returns 1 if everything is done */
static int __emul_lookup_dentry(const char *name, struct nameidata *nd)
@@ -742,7 +786,8 @@
* needs parent already locked. Doesn't follow mounts.
* SMP-safe.
*/
-struct dentry * lookup_hash(struct qstr *name, struct dentry * base)
+struct dentry * lookup_hash_it(struct qstr *name, struct dentry * base,
+ struct lookup_intent *it)
{
struct dentry * dentry;
struct inode *inode;
@@ -765,13 +810,16 @@
goto out;
}

- dentry = cached_lookup(base, name, 0);
+ dentry = cached_lookup(base, name, 0, it);
if (!dentry) {
struct dentry *new = d_alloc(base, name);
dentry = ERR_PTR(-ENOMEM);
if (!new)
goto out;
lock_kernel();
+ if (inode->i_op->lookup2)
+ dentry = inode->i_op->lookup2(inode, new, it);
+ else
dentry = inode->i_op->lookup(inode, new);
unlock_kernel();
if (!dentry)
@@ -783,6 +831,12 @@
return dentry;
}

+struct dentry * lookup_hash(struct qstr *name, struct dentry * base)
+{
+ return lookup_hash_it(name, base, NULL);
+}
+
+
/* SMP-safe */
struct dentry * lookup_one_len(const char * name, struct dentry * base, int len)
{
@@ -804,7 +858,7 @@
}
this.hash = end_name_hash(hash);

- return lookup_hash(&this, base);
+ return lookup_hash_it(&this, base, NULL);
access:
return ERR_PTR(-EACCES);
}
@@ -836,6 +890,23 @@
return err;
}

+int __user_walk_it(const char *name, unsigned flags, struct nameidata *nd,
+ struct lookup_intent *it)
+{
+ char *tmp;
+ int err;
+
+ tmp = getname(name);
+ err = PTR_ERR(tmp);
+ if (!IS_ERR(tmp)) {
+ err = 0;
+ if (path_init(tmp, flags, nd))
+ err = path_walk_it(tmp, nd, it);
+ putname(tmp);
+ }
+ return err;
+}
+
/*
* It's inline, so penalty for filesystems that don't use sticky bit is
* minimal.
@@ -970,7 +1041,8 @@
* for symlinks (where the permissions are checked later).
* SMP-safe
*/
-int open_namei(const char * pathname, int flag, int mode, struct nameidata *nd)
+int open_namei_it(const char *pathname, int flag, int mode,
+ struct nameidata *nd, struct lookup_intent *it)
{
int acc_mode, error = 0;
struct inode *inode;
@@ -984,17 +1056,22 @@
* The simplest case - just a plain lookup.
*/
if (!(flag & O_CREAT)) {
if (path_init(pathname, lookup_flags(flag), nd))
- error = path_walk(pathname, nd);
+ error = path_walk_it(pathname, nd, it);
if (error)
return error;
dentry = nd->dentry;
+ dentry->d_it = it;
goto ok;
}

/*
* Create - we need to know the parent.
*/
+ if (it) {
+ it->it_mode = mode;
+ it->it_op |= IT_CREAT;
+ }
if (path_init(pathname, LOOKUP_PARENT, nd))
error = path_walk(pathname, nd);
if (error)
@@ -1011,7 +1089,7 @@

dir = nd->dentry;
down(&dir->d_inode->i_sem);
- dentry = lookup_hash(&nd->last, nd->dentry);
+ dentry = lookup_hash_it(&nd->last, nd->dentry, it);

do_last:
error = PTR_ERR(dentry);
@@ -1020,6 +1098,8 @@
goto exit;
}

+ dentry->d_it = it;
+ dentry->d_it->it_mode = mode;
/* Negative dentry, just create the file */
if (!dentry->d_inode) {
error = vfs_create(dir->d_inode, dentry,
@@ -1139,8 +1219,10 @@
return 0;

exit_dput:
+ intent_release(dentry);
dput(dentry);
exit:
+ intent_release(nd->dentry);
path_release(nd);
return error;

@@ -1160,6 +1242,8 @@
*/
UPDATE_ATIME(dentry->d_inode);
error = dentry->d_inode->i_op->follow_link(dentry, nd);
+ if (error)
+ intent_release(dentry);
dput(dentry);
if (error)
return error;
@@ -1181,13 +1265,20 @@
}
dir = nd->dentry;
down(&dir->d_inode->i_sem);
- dentry = lookup_hash(&nd->last, nd->dentry);
+ dentry = lookup_hash_it(&nd->last, nd->dentry, NULL);
putname(nd->last.name);
goto do_last;
}

+int open_namei(const char *pathname, int flag, int mode, struct nameidata *nd)
+{
+ return open_namei_it(pathname, flag, mode, nd, NULL);
+}
+
+
/* SMP-safe */
-static struct dentry *lookup_create(struct nameidata *nd, int is_dir)
+static struct dentry *lookup_create(struct nameidata *nd, int is_dir,
+ struct lookup_intent *it)
{
struct dentry *dentry;

@@ -1195,7 +1286,7 @@
dentry = ERR_PTR(-EEXIST);
if (nd->last_type != LAST_NORM)
goto fail;
- dentry = lookup_hash(&nd->last, nd->dentry);
+ dentry = lookup_hash_it(&nd->last, nd->dentry, it);
if (IS_ERR(dentry))
goto fail;
if (!is_dir && nd->last.name[nd->last.len] && !dentry->d_inode)
@@ -1241,6 +1332,7 @@
char * tmp;
struct dentry * dentry;
struct nameidata nd;
+ struct lookup_intent it = { .it_op = IT_MKNOD, .it_mode = mode };

if (S_ISDIR(mode))
return -EPERM;
@@ -1252,11 +1344,12 @@
error = path_walk(tmp, &nd);
if (error)
goto out;
- dentry = lookup_create(&nd, 0);
+ dentry = lookup_create(&nd, 0, &it);
error = PTR_ERR(dentry);

mode &= ~current->fs->umask;
if (!IS_ERR(dentry)) {
+ dentry->d_it = ⁢
switch (mode & S_IFMT) {
case 0: case S_IFREG:
error = vfs_create(nd.dentry->d_inode,dentry,mode);
@@ -1270,6 +1363,7 @@
default:
error = -EINVAL;
}
+ intent_release(dentry);
dput(dentry);
}
up(&nd.dentry->d_inode->i_sem);
@@ -1310,6 +1404,7 @@
{
int error = 0;
char * tmp;
+ struct lookup_intent it = { .it_op = IT_MKDIR, .it_mode = mode };

tmp = getname(pathname);
error = PTR_ERR(tmp);
@@ -1321,11 +1416,13 @@
error = path_walk(tmp, &nd);
if (error)
goto out;
- dentry = lookup_create(&nd, 1);
+ dentry = lookup_create(&nd, 1, &it);
error = PTR_ERR(dentry);
if (!IS_ERR(dentry)) {
+ dentry->d_it = ⁢
error = vfs_mkdir(nd.dentry->d_inode, dentry,
mode & ~current->fs->umask);
+ intent_release(dentry);
dput(dentry);
}
up(&nd.dentry->d_inode->i_sem);
@@ -1407,6 +1504,7 @@
char * name;
struct dentry *dentry;
struct nameidata nd;
+ struct lookup_intent it = { .it_op = IT_RMDIR };

name = getname(pathname);
if(IS_ERR(name))
@@ -1429,10 +1527,12 @@
goto exit1;
}
down(&nd.dentry->d_inode->i_sem);
- dentry = lookup_hash(&nd.last, nd.dentry);
+ dentry = lookup_hash_it(&nd.last, nd.dentry, &it);
error = PTR_ERR(dentry);
if (!IS_ERR(dentry)) {
+ dentry->d_it = ⁢
error = vfs_rmdir(nd.dentry->d_inode, dentry);
+ intent_release(dentry);
dput(dentry);
}
up(&nd.dentry->d_inode->i_sem);
@@ -1476,6 +1576,7 @@
char * name;
struct dentry *dentry;
struct nameidata nd;
+ struct lookup_intent it = { .it_op = IT_UNLINK };

name = getname(pathname);
if(IS_ERR(name))
@@ -1489,14 +1590,16 @@
if (nd.last_type != LAST_NORM)
goto exit1;
down(&nd.dentry->d_inode->i_sem);
- dentry = lookup_hash(&nd.last, nd.dentry);
+ dentry = lookup_hash_it(&nd.last, nd.dentry, &it);
error = PTR_ERR(dentry);
if (!IS_ERR(dentry)) {
+ dentry->d_it = ⁢
/* Why not before? Because we want correct error value */
if (nd.last.name[nd.last.len])
goto slashes;
error = vfs_unlink(nd.dentry->d_inode, dentry);
exit2:
+ intent_release(dentry);
dput(dentry);
}
up(&nd.dentry->d_inode->i_sem);
@@ -1543,6 +1646,7 @@
int error = 0;
char * from;
char * to;
+ struct lookup_intent it = { .it_op = IT_SYMLINK };

from = getname(oldname);
if(IS_ERR(from))
@@ -1557,10 +1661,13 @@
error = path_walk(to, &nd);
if (error)
goto out;
- dentry = lookup_create(&nd, 0);
+ it.it_data = from;
+ dentry = lookup_create(&nd, 0, &it);
error = PTR_ERR(dentry);
if (!IS_ERR(dentry)) {
+ dentry->d_it = ⁢
error = vfs_symlink(nd.dentry->d_inode, dentry, from);
+ intent_release(dentry);
dput(dentry);
}
up(&nd.dentry->d_inode->i_sem);
@@ -1626,6 +1732,7 @@
int error;
char * from;
char * to;
+ struct lookup_intent it = { .it_op = IT_LINK };

from = getname(oldname);
if(IS_ERR(from))
@@ -1648,10 +1755,12 @@
error = -EXDEV;
if (old_nd.mnt != nd.mnt)
goto out_release;
- new_dentry = lookup_create(&nd, 0);
+ new_dentry = lookup_create(&nd, 0, &it);
error = PTR_ERR(new_dentry);
if (!IS_ERR(new_dentry)) {
+ new_dentry->d_it = ⁢
error = vfs_link(old_nd.dentry, nd.dentry->d_inode, new_dentry);
+ intent_release(new_dentry);
dput(new_dentry);
}
up(&nd.dentry->d_inode->i_sem);
@@ -1694,7 +1803,8 @@
* locking].
*/
int vfs_rename_dir(struct inode *old_dir, struct dentry *old_dentry,
- struct inode *new_dir, struct dentry *new_dentry)
+ struct inode *new_dir, struct dentry *new_dentry,
+ struct lookup_intent *it)
{
int error;
struct inode *target;
@@ -1748,12 +1858,14 @@
} else
double_down(&old_dir->i_zombie,
&new_dir->i_zombie);
+ new_dentry->d_it = it;
if (IS_DEADDIR(old_dir)||IS_DEADDIR(new_dir))
error = -ENOENT;
else if (d_mountpoint(old_dentry)||d_mountpoint(new_dentry))
error = -EBUSY;
else
error = old_dir->i_op->rename(old_dir, old_dentry, new_dir, new_dentry);
+ intent_release(new_dentry);
if (target) {
if (!error)
target->i_flags |= S_DEAD;
@@ -1775,7 +1887,8 @@
}

int vfs_rename_other(struct inode *old_dir, struct dentry *old_dentry,
- struct inode *new_dir, struct dentry *new_dentry)
+ struct inode *new_dir, struct dentry *new_dentry,
+ struct lookup_intent *it)
{
int error;

@@ -1802,10 +1915,12 @@
DQUOT_INIT(old_dir);
DQUOT_INIT(new_dir);
double_down(&old_dir->i_zombie, &new_dir->i_zombie);
+ new_dentry->d_it = it;
if (d_mountpoint(old_dentry)||d_mountpoint(new_dentry))
error = -EBUSY;
else
error = old_dir->i_op->rename(old_dir, old_dentry, new_dir, new_dentry);
+ intent_release(new_dentry);
double_up(&old_dir->i_zombie, &new_dir->i_zombie);
if (error)
return error;
@@ -1817,13 +1932,14 @@
}

int vfs_rename(struct inode *old_dir, struct dentry *old_dentry,
- struct inode *new_dir, struct dentry *new_dentry)
+ struct inode *new_dir, struct dentry *new_dentry,
+ struct lookup_intent *it)
{
int error;
if (S_ISDIR(old_dentry->d_inode->i_mode))
- error = vfs_rename_dir(old_dir,old_dentry,new_dir,new_dentry);
+ error = vfs_rename_dir(old_dir,old_dentry,new_dir,new_dentry,it);
else
- error = vfs_rename_other(old_dir,old_dentry,new_dir,new_dentry);
+ error = vfs_rename_other(old_dir,old_dentry,new_dir,new_dentry,it);
if (!error) {
if (old_dir == new_dir)
inode_dir_notify(old_dir, DN_RENAME);
@@ -1840,6 +1956,7 @@
int error = 0;
struct dentry * old_dir, * new_dir;
struct dentry * old_dentry, *new_dentry;
+ struct lookup_intent it = { .it_op = IT_RENAME };
struct nameidata oldnd, newnd;

if (path_init(oldname, LOOKUP_PARENT, &oldnd))
@@ -1868,7 +1985,7 @@

double_lock(new_dir, old_dir);

- old_dentry = lookup_hash(&oldnd.last, old_dir);
+ old_dentry = lookup_hash_it(&oldnd.last, old_dir, &it);
error = PTR_ERR(old_dentry);
if (IS_ERR(old_dentry))
goto exit3;
@@ -1884,18 +2003,21 @@
if (newnd.last.name[newnd.last.len])
goto exit4;
}
- new_dentry = lookup_hash(&newnd.last, new_dir);
+ it.it_op = IT_RENAME2;
+ new_dentry = lookup_hash_it(&newnd.last, new_dir, &it);
error = PTR_ERR(new_dentry);
if (IS_ERR(new_dentry))
goto exit4;

lock_kernel();
error = vfs_rename(old_dir->d_inode, old_dentry,
- new_dir->d_inode, new_dentry);
+ new_dir->d_inode, new_dentry, &it);
unlock_kernel();

+ intent_release(new_dentry);
dput(new_dentry);
exit4:
+ intent_release(old_dentry);
dput(old_dentry);
exit3:
double_up(&new_dir->d_inode->i_sem, &old_dir->d_inode->i_sem);
--- lum-pristine/fs/open.c Fri Oct 12 16:48:42 2001
+++ lum/fs/open.c Sun Aug 11 15:26:29 2002
@@ -19,6 +19,9 @@
#include <asm/uaccess.h>

#define special_file(m) (S_ISCHR(m)||S_ISBLK(m)||S_ISFIFO(m)||S_ISSOCK(m))
+extern int path_walk_it(const char *name, struct nameidata *nd,
+ struct lookup_intent *it);
+extern void intent_release(struct dentry *de);

int vfs_statfs(struct super_block *sb, struct statfs *buf)
{
@@ -94,14 +97,16 @@
struct nameidata nd;
struct inode * inode;
int error;
+ struct lookup_intent it = { .it_op = IT_SETATTR };

error = -EINVAL;
if (length < 0) /* sorry, but loff_t says... */
goto out;

- error = user_path_walk(path, &nd);
+ error = user_path_walk_it(path, &nd, &it);
if (error)
goto out;
+ nd.dentry->d_it = ⁢
inode = nd.dentry->d_inode;

/* For directories it's -EISDIR, for other non-regulars - -EINVAL */
@@ -144,6 +149,7 @@
put_write_access(inode);

dput_and_out:
+ intent_release(nd.dentry);
path_release(&nd);
out:
return error;
@@ -235,10 +241,12 @@
struct nameidata nd;
struct inode * inode;
struct iattr newattrs;
+ struct lookup_intent it = { .it_op = IT_SETATTR };

- error = user_path_walk(filename, &nd);
+ error = user_path_walk_it(filename, &nd, &it);
if (error)
goto out;
+ nd.dentry->d_it = ⁢
inode = nd.dentry->d_inode;

error = -EROFS;
@@ -262,6 +270,7 @@
}
error = notify_change(nd.dentry, &newattrs);
dput_and_out:
+ intent_release(nd.dentry);
path_release(&nd);
out:
return error;
@@ -279,11 +288,13 @@
struct nameidata nd;
struct inode * inode;
struct iattr newattrs;
+ struct lookup_intent it = { .it_op = IT_SETATTR };

- error = user_path_walk(filename, &nd);
+ error = user_path_walk_it(filename, &nd, &it);

if (error)
goto out;
+ nd.dentry->d_it = ⁢
inode = nd.dentry->d_inode;

error = -EROFS;
@@ -306,6 +317,7 @@
}
error = notify_change(nd.dentry, &newattrs);
dput_and_out:
+ intent_release(nd.dentry);
path_release(&nd);
out:
return error;
@@ -322,6 +334,7 @@
int old_fsuid, old_fsgid;
kernel_cap_t old_cap;
int res;
+ struct lookup_intent it = { .it_op = IT_GETATTR };

if (mode & ~S_IRWXO) /* where's F_OK, X_OK, W_OK, R_OK? */
return -EINVAL;
@@ -339,13 +352,14 @@
else
current->cap_effective = current->cap_permitted;

- res = user_path_walk(filename, &nd);
+ res = user_path_walk_it(filename, &nd, &it);
if (!res) {
res = permission(nd.dentry->d_inode, mode);
/* SuS v2 requires we report a read only fs too */
if(!res && (mode & S_IWOTH) && IS_RDONLY(nd.dentry->d_inode)
&& !special_file(nd.dentry->d_inode->i_mode))
res = -EROFS;
+ intent_release(nd.dentry);
path_release(&nd);
}

@@ -361,6 +375,7 @@
int error;
struct nameidata nd;
char *name;
+ struct lookup_intent it = { .it_op = IT_GETATTR };

name = getname(filename);
error = PTR_ERR(name);
@@ -369,11 +384,12 @@

error = 0;
if (path_init(name,LOOKUP_POSITIVE|LOOKUP_FOLLOW|LOOKUP_DIRECTORY,&nd))
- error = path_walk(name, &nd);
+ error = path_walk_it(name, &nd, &it);
putname(name);
if (error)
goto out;

+ nd.dentry->d_it = ⁢
error = permission(nd.dentry->d_inode,MAY_EXEC);
if (error)
goto dput_and_out;
@@ -381,6 +397,7 @@
set_fs_pwd(current->fs, nd.mnt, nd.dentry);

dput_and_out:
+ intent_release(nd.dentry);
path_release(&nd);
out:
return error;
@@ -421,6 +438,7 @@
int error;
struct nameidata nd;
char *name;
+ struct lookup_intent it = { .it_op = IT_GETATTR };

name = getname(filename);
error = PTR_ERR(name);
@@ -429,11 +447,12 @@

path_init(name, LOOKUP_POSITIVE | LOOKUP_FOLLOW |
LOOKUP_DIRECTORY | LOOKUP_NOALT, &nd);
- error = path_walk(name, &nd);
+ error = path_walk_it(name, &nd, &it);
putname(name);
if (error)
goto out;

+ nd.dentry->d_it = ⁢
error = permission(nd.dentry->d_inode,MAY_EXEC);
if (error)
goto dput_and_out;
@@ -446,6 +465,7 @@
set_fs_altroot();
error = 0;
dput_and_out:
+ intent_release(nd.dentry);
path_release(&nd);
out:
return error;
@@ -490,12 +510,14 @@
struct inode * inode;
int error;
struct iattr newattrs;
+ struct lookup_intent it = { .it_op = IT_SETATTR };

- error = user_path_walk(filename, &nd);
+ error = user_path_walk_it(filename, &nd, &it);
if (error)
goto out;
inode = nd.dentry->d_inode;

+ nd.dentry->d_it = ⁢
error = -EROFS;
if (IS_RDONLY(inode))
goto dput_and_out;
@@ -511,6 +532,7 @@
error = notify_change(nd.dentry, &newattrs);

dput_and_out:
+ intent_release(nd.dentry);
path_release(&nd);
out:
return error;
@@ -580,10 +602,13 @@
{
struct nameidata nd;
int error;
+ struct lookup_intent it = { .it_op = IT_SETATTR };

- error = user_path_walk(filename, &nd);
+ error = user_path_walk_it(filename, &nd, &it);
if (!error) {
+ nd.dentry->d_it = ⁢
error = chown_common(nd.dentry, user, group);
+ intent_release(nd.dentry);
path_release(&nd);
}
return error;
@@ -593,10 +618,13 @@
{
struct nameidata nd;
int error;
+ struct lookup_intent it = { .it_op = IT_SETATTR };

- error = user_path_walk_link(filename, &nd);
+ error = user_path_walk_link_it(filename, &nd, &it);
if (!error) {
+ nd.dentry->d_it = ⁢
error = chown_common(nd.dentry, user, group);
+ intent_release(nd.dentry);
path_release(&nd);
}
return error;
@@ -630,10 +658,16 @@
* for the internal routines (ie open_namei()/follow_link() etc). 00 is
* used by symlinks.
*/
+extern int open_namei_it(const char *filename, int namei_flags, int mode,
+ struct nameidata *nd, struct lookup_intent *it);
+struct file *dentry_open_it(struct dentry *dentry, struct vfsmount *mnt,
+ int flags, struct lookup_intent *it);
+
struct file *filp_open(const char * filename, int flags, int mode)
{
int namei_flags, error;
struct nameidata nd;
+ struct lookup_intent it = { .it_op = IT_OPEN };

namei_flags = flags;
if ((namei_flags+1) & O_ACCMODE)
@@ -641,14 +675,15 @@
if (namei_flags & O_TRUNC)
namei_flags |= 2;

- error = open_namei(filename, namei_flags, mode, &nd);
- if (!error)
- return dentry_open(nd.dentry, nd.mnt, flags);
+ error = open_namei_it(filename, namei_flags, mode, &nd, &it);
+ if (error)
+ return ERR_PTR(error);

- return ERR_PTR(error);
+ return dentry_open_it(nd.dentry, nd.mnt, flags, &it);
}

-struct file *dentry_open(struct dentry *dentry, struct vfsmount *mnt, int flags)
+struct file *dentry_open_it(struct dentry *dentry, struct vfsmount *mnt,
+ int flags, struct lookup_intent *it)
{
struct file * f;
struct inode *inode;
@@ -691,6 +726,7 @@
}
f->f_flags &= ~(O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC);

+ intent_release(dentry);
return f;

cleanup_all:
@@ -705,11 +741,17 @@
cleanup_file:
put_filp(f);
cleanup_dentry:
+ intent_release(dentry);
dput(dentry);
mntput(mnt);
return ERR_PTR(error);
}

+struct file *dentry_open(struct dentry *dentry, struct vfsmount *mnt, int flags)
+{
+ return dentry_open_it(dentry, mnt, flags, NULL);
+}
+
/*
* Find an empty file descriptor entry, and mark it busy.
*/
--- lum-pristine/fs/stat.c Thu Sep 13 19:04:43 2001
+++ lum/fs/stat.c Mon Aug 12 00:04:39 2002
@@ -13,6 +13,7 @@

#include <asm/uaccess.h>

+extern void intent_release(struct dentry *de);
/*
* Revalidate the inode. This is required for proper NFS attribute caching.
*/
@@ -135,13 +135,15 @@
asmlinkage long sys_stat(char * filename, struct __old_kernel_stat * statbuf)
{
struct nameidata nd;
+ struct lookup_intent it = { .it_op = IT_GETATTR };
int error;

- error = user_path_walk(filename, &nd);
+ error = user_path_walk_it(filename, &nd, &it);
if (!error) {
error = do_revalidate(nd.dentry);
if (!error)
error = cp_old_stat(nd.dentry->d_inode, statbuf);
+ intent_release(nd.dentry);
path_release(&nd);
}
return error;
@@ -151,13 +153,15 @@
asmlinkage long sys_newstat(char * filename, struct stat * statbuf)
{
struct nameidata nd;
+ struct lookup_intent it = { .it_op = IT_GETATTR };
int error;

- error = user_path_walk(filename, &nd);
+ error = user_path_walk_it(filename, &nd, &it);
if (!error) {
error = do_revalidate(nd.dentry);
if (!error)
error = cp_new_stat(nd.dentry->d_inode, statbuf);
+ intent_release(nd.dentry);
path_release(&nd);
}
return error;
@@ -172,13 +176,15 @@
asmlinkage long sys_lstat(char * filename, struct __old_kernel_stat * statbuf)
{
struct nameidata nd;
+ struct lookup_intent it = { .it_op = IT_GETATTR };
int error;

- error = user_path_walk_link(filename, &nd);
+ error = user_path_walk_link_it(filename, &nd, &it);
if (!error) {
error = do_revalidate(nd.dentry);
if (!error)
error = cp_old_stat(nd.dentry->d_inode, statbuf);
+ intent_release(nd.dentry);
path_release(&nd);
}
return error;
@@ -189,13 +195,15 @@
asmlinkage long sys_newlstat(char * filename, struct stat * statbuf)
{
struct nameidata nd;
+ struct lookup_intent it = { .it_op = IT_GETATTR };
int error;

- error = user_path_walk_link(filename, &nd);
+ error = user_path_walk_link_it(filename, &nd, &it);
if (!error) {
error = do_revalidate(nd.dentry);
if (!error)
error = cp_new_stat(nd.dentry->d_inode, statbuf);
+ intent_release(nd.dentry);
path_release(&nd);
}
return error;
@@ -247,20 +255,21 @@
{
struct nameidata nd;
int error;
+ struct lookup_intent it = { .it_op = IT_READLINK };

if (bufsiz <= 0)
return -EINVAL;

- error = user_path_walk_link(path, &nd);
+ error = user_path_walk_link_it(path, &nd, &it);
if (!error) {
struct inode * inode = nd.dentry->d_inode;
-
error = -EINVAL;
if (inode->i_op && inode->i_op->readlink &&
!(error = do_revalidate(nd.dentry))) {
UPDATE_ATIME(inode);
error = inode->i_op->readlink(nd.dentry, buf, bufsiz);
}
+ intent_release(nd.dentry);
path_release(&nd);
}
return error;
@@ -333,12 +342,14 @@
{
struct nameidata nd;
int error;
+ struct lookup_intent it = { .it_op = IT_GETATTR };

- error = user_path_walk(filename, &nd);
+ error = user_path_walk_it(filename, &nd, &it);
if (!error) {
error = do_revalidate(nd.dentry);
if (!error)
error = cp_new_stat64(nd.dentry->d_inode, statbuf);
+ intent_release(nd.dentry);
path_release(&nd);
}
return error;
@@ -348,12 +359,14 @@
{
struct nameidata nd;
int error;
+ struct lookup_intent it = { .it_op = IT_GETATTR };

- error = user_path_walk_link(filename, &nd);
+ error = user_path_walk_link_it(filename, &nd, &it);
if (!error) {
error = do_revalidate(nd.dentry);
if (!error)
error = cp_new_stat64(nd.dentry->d_inode, statbuf);
+ intent_release(nd.dentry);
path_release(&nd);
}
return error;
@@ -363,6 +376,7 @@
{
struct file * f;
int err = -EBADF;
+ struct lookup_intent it = { .it_op = IT_GETATTR };

f = fget(fd);
if (f) {
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/

2002-09-05 17:03:20

[permalink] [raw]

Subject: Re: (fwd) Re: [RFC] mount flag "direct"

On Thu, 5 Sep 2002, Peter T. Breuer wrote:

> HI .. I'm currently in an internet cafe in nice, france, watching the
> rain come down, so forgive me if I don't do you justice. I find this
> reply to be excelent for my purposes. Thank you.

enjoy yourself

> "Helge Hafting wrote:"
> > "Peter T. Breuer" wrote:
> > 2.
> > The fs however is using a smaller blocksize, such as 4k. So your
> > big request is broken down into a bunch of requests for the
> > first 4k block, the second 4k block and so on up to the
> > 2560th 4k block. So far, everything happens fast no matter
> > what kind of fs, or even your no-cache scheme.
>
> Fine, but where is the log/phys translation done? I presume that the
> actual inode contains sufficient info to do the translation, because
> the inode has a physical location on disk, and it is also associated
> with a file, and what we do is generally start from the inode and trace
> down to where the inode says the logical block shoul dbe, and then look
> it up. During this time the inode location on disk must be locked
> (with a read lock). I can do that. If you let me have "tag
> requests" in the block layers and let me generate them in the VFS
> layers. Yes, I agree, I have to know where the inode is on disk
> in order to generate the block request, but the FS will know,
> and I just want it to tell VFS .. well, too much detail.
>

ahh, but the problem is that you have to lookup where the inode is, so you
have to start from the layer above that, etc eventually getting to the
large number of accesses required that were documented in an earlier post.
remember that the low levels don't know the difference between data and an
Inode, only the filesystem code knows which is which.

> > about where the blocks are is called "metadata".
> > So we need to look at metadata for every little 4k block.
>
> No .. I don't see that. Not every block has some unique metadata
> associated, with it, surely? I thought that normally inodes
> were "direct", that is, pointing at contiguous lumps? Sure,
> sometimes some other lookups might be required. but often? No.

but if you cache the inode contents then you have consistancy problems
between multiple machines, if you don't cache the inodes then you have to
find them and read their contents each time.

> Especially if you imagine that 99.99% of the ops on the file system will
> be rewriting or rereading files. Just a "metadata updated" flag on the
> FS might be useful to avoid looking everything up again avery time,
> but I realy would like to see how much overhead there IS first.

but now this is filesystem specific, not a generic mechanism.

> > That isn't a problem usually, because the metadata is
> > small and is normally cached entirely, even for a large
> > file. So we can look up "wher block 1 is on disk, where block 2
> > is on disk..." by looking at a little table in memory.
>
> Well, or on disk :-)

take a look at disk seek times, newer disks have increased the transfer
rate and capacity significantly, but seek times are hovering in the mid
single digit ms range, you have to seek to the place on disk that holds
your metadata (potentially several seeks) and then seek back to the data,
if the data happened to be the next block on disk you have now bounced the
ehad all over the disk in between, eliminating the elevator algorithm and
any chance of merging the requests.

> May have, but probably won't. To have messed with it it must
> have messed with the inode, and we can ask the driver if anyone
> but us has written to that spot (the inode), and if not, not
> reread it. That's just an idea, but really, I would prefer to reread
> it. Data reads and writes will generally be in lumps the order of
> 0.25M. An extra 4K on that is latency, not appreciable extra data.

but at the low levels the data is all 4K reads, these reads can be merged
if the metadata tells you that they are adjacent, but if you have to
lookup the metadata between each block (since you can't look it up in
memory...)

> > So we have to read metadata *again* for each little
> > block before we are able to read it, for you don't let
>
> No, not each little block. I disagree totally. Remember that
> we get the lock on the inode area at the start of the opn. Nobody
> can mess with it while we do the sequence of small reads or writes
> (which will be merged into large reads or writes). That was the
> whole point of my request for extra "tag requests" in the block
> layer. I want to be able to request a region lock, by whatever
> mechanism.

This locking is the coordination between multiple machines that the
specialized distributed filesystems implement. if you are going to
implement this on every filesystem then you are going to turn every one of
them into a cooperative DFS.

> Well, you forget that we had to seek there anyway, and that the
> _server_ kernel has cached the result (I am only saying that the
> _client_ kernels need O_*DIRECT) , and that therefore there is
> no ondisk seek on the real device, just a "memory seek" in the
> guardian of the real device.
>
> > And where on disk is the metadata for this file?
> > That information is cached too, but you
> > disallow caching. Rmemeber, the cache isn't just for
>
> You are confusing yourself with syntax instead of semantics, I think ..
> I can't relate this account to anything that really happens right now.
> The words "no caching" only apply to the kernel for which the
> real device is remote. On the kernel for which it is local, of course
> there is block-level caching (though not FS level caching, I hope!).

Ok, here is one place where the disconnect is happening.

you are thinking of lots of disks attached to servers that the servers
then reexport out to the network. this is NFS with an added layer on top
of it. sharing this is simple, but putting another filesystem on top of it
is of questionable use.

what the rest of us are thinking of is the Storage Area Network (SAN
topology where you have large arrays of disks with fibre channel on each
disk (or each shelf of disks) and fibre channel into each server. every
server can access every disk directly. there is not 'local' server to do
any caching.

anyplace that is running multi TB of disks to multiple servers is almost
certinly going to be doing something like this, otherwise you waste
bandwidth on your 'local' server when a remote machine wants to access the
drives, and that bandwidth is frequently the bottleneck of performance for
those local boxes (even 66MHz 64 bit PCI is easily swamped when you start
talking about multiple Gb NICs plus disk IO)

David Lang

2002-09-05 22:39:45

[permalink] [raw]

Subject: Re: (fwd) Re: [RFC] mount flag "direct"

On Wednesday 04 September 2002 16:13, Anton Altaparmakov wrote:
> Did you read my post which you can lookup on the below url?
>
> http://marc.theaimsgroup.com/?l=linux-kernel&m=103109165717636&w=2
>
> That explains what a single byte write to an uncached ntfs volume entails.
> (I failed to mention there that you would actually need to read the block
> first modify the one byte and then write it back but if you write
> blocksize based chunks at once the RCW falls away.
> And as someone else pointed out I failed to mention that the entire
> operation would have to keep the entire fs locked.)
>
> If it still isn't clear let me know and I will attempt to explain again in
> simpler terms...

Anton, it's clear he understands the concept, and doesn't care because
he does not intend his application to access the data a byte at a time.

Your points are correct, just not relevant to what he wants to do.

--
Daniel

2002-09-05 23:01:58

[permalink] [raw]

Subject: Re: (fwd) Re: [RFC] mount flag "direct"

On Thursday 05 September 2002 16:24, Peter T. Breuer wrote:
> Fine, but where is the log/phys translation done?

It's done here:

http://lxr.linux.no/source/mm/filemap.c#L438

> I presume that the
> actual inode contains sufficient info to do the translation,

The inode, plus an arbitrary amount of associated, on-disk metadata.

> because
> the inode has a physical location on disk, and it is also associated
> with a file, and what we do is generally start from the inode and trace
> down to where the inode says the logical block shoul dbe, and then look
> it up. During this time the inode location on disk must be locked
> (with a read lock). I can do that. If you let me have "tag
> requests" in the block layers and let me generate them in the VFS
> layers. Yes, I agree, I have to know where the inode is on disk
> in order to generate the block request, but the FS will know,
> and I just want it to tell VFS .. well, too much detail.

Actually, while much of this seems simple when thought of in abstract
terms, in practice it's quite involved. You'll need to allocate at
least a few months to code study just to know what the issues are, and
that is with the help of the various excellent resources that are
available.

I'd suggest starting here, and figuring out how the ->get_block
interface works:

http://lxr.linux.no/source/fs/buffer.c#L1692

--
Daniel

2002-09-05 23:02:46

[permalink] [raw]

Subject: Re: (fwd) Re: [RFC] mount flag "direct"

At 23:45 05/09/02, Daniel Phillips wrote:
>On Wednesday 04 September 2002 16:13, Anton Altaparmakov wrote:
> > Did you read my post which you can lookup on the below url?
> >
> > http://marc.theaimsgroup.com/?l=linux-kernel&m=103109165717636&w=2
> >
> > That explains what a single byte write to an uncached ntfs volume entails.
> > (I failed to mention there that you would actually need to read the block
> > first modify the one byte and then write it back but if you write
> > blocksize based chunks at once the RCW falls away.
> > And as someone else pointed out I failed to mention that the entire
> > operation would have to keep the entire fs locked.)
> >
> > If it still isn't clear let me know and I will attempt to explain again in
> > simpler terms...
>
>Anton, it's clear he understands the concept, and doesn't care because
>he does not intend his application to access the data a byte at a time.
>
>Your points are correct, just not relevant to what he wants to do.

Daniel,

The procedure described is _identical_ if you want to access 1TiB at a
time, because the request is broken down by the VFS into 512 byte size
units and I think I explained that, too. And for _each_ 512 byte sized unit
of those 1TiB you would have to repeat the _whole_ of the described
procedure. So just replace 1 byte with 512 bytes in my post and then repeat
the procedure as many times as it takes to make up the 1TiB. Surely you
should know this... just fundamental Linux kernel VFS operation.

It is not clear to me he understands the concept at all. He thinks that you
need to read the disk inode just once and then you immediately read all the
1TiB of data and somehow all this magic is done by the VFS. This is
complete crap and is _NOT_ how the Linux kernel works. The VFS breaks every
request into super block->s_blocksize sized units and _each_ and _every_
request is _individually_ looked up as to where it is on disk.

Each of those lookups requires a lot of seeks, reads, memory allocations,
etc. Just _read_ my post...

Please, I really wouldn't have expected someone like you to come up with a
statement like that... You really should know better...

Best regards,

Anton

--
"I've not lost my mind. It's backed up on tape somewhere." - Unknown
--
Anton Altaparmakov <aia21 at cantab.net> (replace at with @)
Linux NTFS Maintainer / IRC: #ntfs on irc.openprojects.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/

2002-09-05 23:11:42

[permalink] [raw]

Subject: Re: (fwd) Re: [RFC] mount flag "direct"

On Friday 06 September 2002 01:06, Anton Altaparmakov wrote:
> At 23:45 05/09/02, Daniel Phillips wrote:
> >On Wednesday 04 September 2002 16:13, Anton Altaparmakov wrote:
> > > Did you read my post which you can lookup on the below url?
> > >
> > > http://marc.theaimsgroup.com/?l=linux-kernel&m=103109165717636&w=2
> > >
> > > That explains what a single byte write to an uncached ntfs volume entails.
> > > (I failed to mention there that you would actually need to read the block
> > > first modify the one byte and then write it back but if you write
> > > blocksize based chunks at once the RCW falls away.
> > > And as someone else pointed out I failed to mention that the entire
> > > operation would have to keep the entire fs locked.)
> > >
> > > If it still isn't clear let me know and I will attempt to explain again in
> > > simpler terms...
> >
> >Anton, it's clear he understands the concept, and doesn't care because
> >he does not intend his application to access the data a byte at a time.
> >
> >Your points are correct, just not relevant to what he wants to do.
>
> Daniel,
>
> The procedure described is _identical_ if you want to access 1TiB at a
> time, because the request is broken down by the VFS into 512 byte size
> units and I think I explained that, too. And for _each_ 512 byte sized unit
> of those 1TiB you would have to repeat the _whole_ of the described
> procedure.

Well, remember, he was going to use o_direct, so he can get away with pretty
crude locking.

> It is not clear to me he understands the concept at all. He thinks that you
> need to read the disk inode just once and then you immediately read all the
> 1TiB of data and somehow all this magic is done by the VFS. This is
> complete crap and is _NOT_ how the Linux kernel works. The VFS breaks every
> request into super block->s_blocksize sized units and _each_ and _every_
> request is _individually_ looked up as to where it is on disk.

He was going to go hack the vfs, so in his mind, practical issues of how the
vfs works now aren't an obstacle. The only mistake he's making is seriously
underestimating how much effort is required to learn enough to do this kind
of surgery and have a remote chance that the patient will survive.

--
Daniel

2002-09-05 23:19:13

[permalink] [raw]

Subject: Re: (fwd) Re: [RFC] mount flag "direct"

At 00:17 06/09/02, Daniel Phillips wrote:
>On Friday 06 September 2002 01:06, Anton Altaparmakov wrote:
> > At 23:45 05/09/02, Daniel Phillips wrote:
> > >On Wednesday 04 September 2002 16:13, Anton Altaparmakov wrote:
> > > > Did you read my post which you can lookup on the below url?
> > > >
> > > > http://marc.theaimsgroup.com/?l=linux-kernel&m=103109165717636&w=2
> > > >
> > > > That explains what a single byte write to an uncached ntfs volume
> entails.
> > > > (I failed to mention there that you would actually need to read the
> block
> > > > first modify the one byte and then write it back but if you write
> > > > blocksize based chunks at once the RCW falls away.
> > > > And as someone else pointed out I failed to mention that the entire
> > > > operation would have to keep the entire fs locked.)
> > > >
> > > > If it still isn't clear let me know and I will attempt to explain
> again in
> > > > simpler terms...
> > >
> > >Anton, it's clear he understands the concept, and doesn't care because
> > >he does not intend his application to access the data a byte at a time.
> > >
> > >Your points are correct, just not relevant to what he wants to do.
> >
> > Daniel,
> >
> > The procedure described is _identical_ if you want to access 1TiB at a
> > time, because the request is broken down by the VFS into 512 byte size
> > units and I think I explained that, too. And for _each_ 512 byte sized
> unit
> > of those 1TiB you would have to repeat the _whole_ of the described
> > procedure.
>
>Well, remember, he was going to use o_direct, so he can get away with pretty
>crude locking.
>
> > It is not clear to me he understands the concept at all. He thinks that
> you
> > need to read the disk inode just once and then you immediately read all
> the
> > 1TiB of data and somehow all this magic is done by the VFS. This is
> > complete crap and is _NOT_ how the Linux kernel works. The VFS breaks
> every
> > request into super block->s_blocksize sized units and _each_ and _every_
> > request is _individually_ looked up as to where it is on disk.
>
>He was going to go hack the vfs, so in his mind, practical issues of how the
>vfs works now aren't an obstacle. The only mistake he's making is seriously
>underestimating how much effort is required to learn enough to do this kind
>of surgery and have a remote chance that the patient will survive.

Ok so he plans to rewrite the whole I/O subsystem starting from block up to
FS drivers and VFS. Great. See him again in 10 years... And his changes
probably will never reach the mainstream kernel because they will optimize
for some unlikely application... And by then there won't be any need
because we will have proper DFS solutions... Well if he wants to waste his
time, that's fine. It's not my time...

Best regards,

Anton

--
"I've not lost my mind. It's backed up on tape somewhere." - Unknown
--
Anton Altaparmakov <aia21 at cantab.net> (replace at with @)
Linux NTFS Maintainer / IRC: #ntfs on irc.openprojects.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/

2002-09-06 08:12:20

by Ingo Oeser

[permalink] [raw]

Subject: Re: (fwd) Re: [RFC] mount flag "direct"

Hi Peter,

after following this discussion I have some ideas which might
help you, but have to ask some more questions to find out clearly
what you are trying to do and express it in terms known to this
audience (which has failed so far, it seems).

1. You want to stream lots of data from a external source (which
might be the network) to a bunch of machines which are
connected to the same file system, where file system is a file
system designed to be a local one (like FAT, NTFS, EXT2FS and
so on).

2. This filesystem is on a single device, which is attached to
multiple machines, right?

3. You problem is, that your data arrives at some hundred MB/s
right?

4. You want to solve it by force with using multiple machines to
accept the data but sharing one super fast persistent device
between them, ok?

5. To archieve 4. you cannot trust the caches and want to disable
them?

6. Do you have a "master" machine or do you want each machine
only peering each other?

7. Is fast bidirectional communication between the machines over
some kind of wire/network possible?

I would suggest not to disable caches, but to use timed leases on
them to not penalize read only metadata operations (which are
~100% in your case).

If you read in some cache item, then you tell all your machines,
that you did so and want to use it read only for a time of X.
That is called a "lease", because other machines can tell you, that
you cannot use your cache anymore, because they need to
invalidate it. Leases can be broken by the grantor as opposed to
locks, which can only be released by the holder.

You can make the cache items very big and chunk them together to
reduce communication overhead, so performance should not be that
bad.

The writing case happens not that often you say, so you can make
it as easy as:

1) Disallow new leases on the related items (if you cannot
find out relations, than disallow ALL new leases).

2) Breaking all related leases while telling them
waiting for the lease clients to ACK the breakage.

3) Do you update, flush it to disk.

4) Reallow leases.

One of your problems are file extension and mtime updates.

This could be solved by restricting both to only one machine and
propagating the changes to the other machines. It's like an
allocator thread, if you are familiar with this concept.

These mechanisms are basically some kind of cache coherency
protocol. A special network between these machines just for that
might be worthwhile.

You don't need to change ANY on-disk format.
You need to change the locks of kernel meta data caching into
leases.

Now you have sth. to work with, which is well known, is
performant in your use case and acceptable in OS terms.

But please answer my questions and doubts first, because I might
be way off, if my assumptions are wrong.

Hope this helps.

Regards

Ingo Oeser
--
Science is what we can tell a computer. Art is everything else. --- D.E.Knuth

2002-09-06 08:53:25

[permalink] [raw]

Subject: Re: (fwd) Re: [RFC] mount flag "direct"

"Anton Altaparmakov wrote:"
> [DanP]
> >He was going to go hack the vfs, so in his mind, practical issues of how the
> >vfs works now aren't an obstacle. The only mistake he's making is seriously
> >underestimating how much effort is required to learn enough to do this kind
> >of surgery and have a remote chance that the patient will survive.
>
> Ok so he plans to rewrite the whole I/O subsystem starting from block up to
> FS drivers and VFS. Great. See him again in 10 years... And his changes

I've had a look now. I concur that e2fs at least is full of assumptions
that it has various different sorts of metadata in memory at all times.
I can't rewrite that, or even add a "warning will robinson" callback to
vfs, as I intended. What I can do, I think, is ..

.. simply do a dput of the fs root directory from time to time, from
vfs - whenever I think it is appropriate. As far as I can see from my
quick look that may cause the dcache path off that to be GC'ed. And
that may be enough to do a few experiments for O_DIRDIRECT.

Is that (approx) correct?

Peter

2002-09-06 09:00:50

by Helge Hafting

[permalink] [raw]

Subject: Re: (fwd) Re: [RFC] mount flag "direct"

"Peter T. Breuer" wrote:
>
> HI .. I'm currently in an internet cafe in nice, france, watching the
> rain come down, so forgive me if I don't do you justice. I find this
> reply to be excelent for my purposes. Thank you.

Hope you get some better weather soon. :-)

> "Helge Hafting wrote:"
> > "Peter T. Breuer" wrote:
> > > I'm not following you. It seems to me that you are discussing the
> > > process of getting a buffer to write to prior to letting it age and
> > > float down to the device driver via the block layers. But we have
> ...
>
> > Some very basic definitions that anyone working on a
> > disk-based filesystem should know:
>
> Thanks.
>
[...]
>
> > 2.
> > The fs however is using a smaller blocksize, such as 4k. So your
> > big request is broken down into a bunch of requests for the
> > first 4k block, the second 4k block and so on up to the
> > 2560th 4k block. So far, everything happens fast no matter
> > what kind of fs, or even your no-cache scheme.
>
> Fine, but where is the log/phys translation done?

In the filesystem specific code, because different
filesystems have different ways of storing the log/phys
translation information. The format of this information
is one of the things that makes filesystems different.

application asks VFS to read large amount from a file,
VFS repeatedly asks FS to provide one block from the file,
FS uses metadata (from cache, asking the block device to read
it into cache if it isn't there already) to look up
where on disk the block is, then FS asks the block
device to "read that block into process memory."

> I presume that the
> actual inode contains sufficient info to do the translation,

No, it doesn't. Some filesystems manage to squeeze all the translation
information into the inode for rather small files. But you are
going to deal with big files.

Be aware that inodes are small fixed size things. (For performance)
They do, however, contain pointers to more information when
that is necessary.

Remember, logical->physical translation needs to be done for
each disk block. (4k or so) That means such information
is stored somewhere - for each disk block. Traditional
unix block-based filesystems (like ext2) store one disk
address per disk block, meaning the size of this metadata is
proportional to the file size. Translation
information is about 1/1024 of file size, so a 1GB file
has about 1M of translation info and so on. This don't
fit in a fixed-size inode that is smaller than a single disk block.

The first few blocks of a file is usually specified in the inode
itself, as a performance optimization for small files. The rest is
stored using indirect blocks. (Blocks full of disk addresses)

The (unix) inode contains a disk address for a indirect block,
this one is full of disk addresses mapping 1024 file blocks in
the case of 4k blocks and 4-byte disk addresses.
That may not be enough either, so the inode also have a disk
address for a double indirect block. That one holds
1024 disk addresses for more indirect blocks, each holding
1024 file block addresses. Now were able to look up
the first 1024*1024 blocks of the file. With one block
being 4k, we have 4G files unless my math is wrong somewhere.
Of course we may want even bigger files, so there is a
triple indirect block too, able to map another 1024*1024*1024
4k blocks. This is considered "enough" for most uses,
and ther is no more capacity in our inode.

> because
> the inode has a physical location on disk, and it is also associated
> with a file, and what we do is generally start from the inode and trace
> down to where the inode says the logical block shoul dbe, and then look
> it up. During this time the inode location on disk must be locked
> (with a read lock).

In the case of ext2, you'll be doing something like
1. lock the inode, verify that it still exists, read it
2. the inode has a disk address for a double indirect block, read that
from disk
3. look up the correct indirect block in the double one, read that
4. look up the physical block in the indirect block, read the physical
block
5. release the inode lock.

This is 4 reads, to locations that likely is spread all over the disk.
4 seek operations, each taking much more time than all the actual
reading. Even if we don't worry about that, you'll be reading 4 blocks
in order to get one block of your _file_. This divides effective
bandwith by 4. But it will actually be hundreds of times worse because
of the
seeking.

> I can do that. If you let me have "tag
> requests" in the block layers and let me generate them in the VFS
> layers. Yes, I agree, I have to know where the inode is on disk
> in order to generate the block request, but the FS will know,

Sure, the FS will know. But it don't know that in any magical
way. It finds out because metadata is cached, or by
looking it up on disk when it isn't. But looking things
up on disk takes time.

> and I just want it to tell VFS .. well, too much detail.
>
> > 3.
> > The reason for this breakup is that the file might be fragmented
> > on disk. No one can know that until it is checked, so
>
> Yes, indeed. I agree.
>
> > each block is checked to see where it is. The kernel
> > tries to be smart and merges requests whenever the
> > logical blocks in the file happens to be consecutive
>
> That's fine. This is the elevator alg. working in the block layers.
> Theer's no problem there. What I want to know is more about the
> lookup procedure for the inode, but already this much info is enough
> to confirm the picture I have. One needs to lock the inode position
> on disk.
>
> > 4.
> > The list of physical positions (or physical blocks)
> > are passed to thbe disk drivers and the read
>
> > Number (3) is where the no-cache operation breaks
> > down completely. We have to look up where on the _disk_
> > _every_ small block of the file exists. The information
>
> Yes. But this is not a "complete breakdown" but a requirement to
> (a) lock bits of disk not associated with data while we look at them,
> and (b) relook again next time.
>
Well, not a complete breakdown, because it _works_. But it'll be so
_slow_, that it is useless for anyone wanting performance. And you
seems to want this to be fast. Everybody want speed of course, you
want to optimize for large transfers. That is doable,
but I can't see it happening with _no_ caching of metadata.

> Frankly, a single FS lock works at this point. That is, a single
> location on disk (say the sb) that can be locked by some layer.
> It can be finer, but there's no need to. The problem is avoiding
> caching the metadata lookup that results, so that next time we look it
> up again.

> > about where the blocks are is called "metadata".
> > So we need to look at metadata for every little 4k block.
>
> No .. I don't see that. Not every block has some unique metadata
> associated, with it, surely?
Yes, every block has unique metadata on a unix fs. And that is
the information about where on the disk the actual block is
stored. Such information is stored for every block of the file.
Otherwise, how could we know where on disk the blocks are? They
don't necessarily follow each other in any nice order. Most
fs'es _try_ for a contigous layout, because that allows much
faster operation. (The disk don't have to seek as long as the blocks
are contigous, it can read one block directly after another.
That gives us lots of performance. Seeking to a different
position takes very long time compared to reading a single block.)

Although the fs try for contigous positions, it can't always
get that. Because the disk may have been used before, so
free space is scattered all over in small chunks. That's
when you need a metadata format that allows the blocks
of a file to be scattered all over. They usually aren't, but
the metadata format must be able to cope with it. So the
position of every little block _is_ stored. A single
4k disk block can hold 1024 disk addresses though, so
this metadata don't take up much space. But it
have to be consulted for every block of a file.
You could try a simple modification to ext2, so it
discards such blocks from cache after each use. (The blocks
will have to be read again for each use then.) Then
see how slow the filesystem gets.

> I thought that normally inodes
> were "direct", that is, pointing at contiguous lumps?
No, most filesystems aren't like that. I have described above
how unix fs'es tend to work, and that is what you have
to deal with _if_ you want to modify existing filesystems.

There are some historical filesystems that store each file
as a contigous lump. All necessary info fits in the inode
for such a fs, but nobody uses them because they can't deal
with fragmentation at all. I.e. there may be 10GB free space
on the disk, but no single contigous free lump is bigger
than 1M (there's over 10.000 small lumps of free space)
so we can't even fit a 2M file in that fs.

There are also filesystems like hpfs that store one unit
of allocation information for each contigous lump
instead of each block. That works well as long as there
is little fragmentation, but you get a O(n) worst case
lookup time for the last block in a badly fragmented
file, where a unix fs always has a O(1) lookup time.
(n is the number of blocks in the file)

The hpfs approach may seem vastly superior as long as
we avoid bad fragmentation, but it don't buy that much
for systems that use caching. The ext2 metadata
is small compared to the file size even if it is proportional.

> Sure,
> sometimes some other lookups might be required. but often? No.

Yes - _if_ using ext2. No, _if_ using hpfs and there's little
fragmentation.

But you seem to overlook something.
Each time you drop the fs lock, you'll have to re-read
the inode even though it is the same inode as last time. Because
it may have changed. _Probably_ not, but you can only _know_
by reading it. So a fs design with little metadata won't help,
you have to read it anyway if you dropped the lock. If you
didn't drop the lock then you're free to enjoy the benefits of metadata
caching, but then the other nodes can't access that fs in the meantime.

> Especially if you imagine that 99.99% of the ops on the file system will
> be rewriting or rereading files. Just a "metadata updated" flag on the
> FS might be useful to avoid looking everything up again avery time,
> but I realy would like to see how much overhead there IS first.

That might help for you case, sure. You could then cache
metadata as long as the flag don't change. There is
a problem though, you might still need to check that flag
for each operation, and the going over the network is
_slow_ compared to mostly working in memory.

> > That isn't a problem usually, because the metadata is
> > small and is normally cached entirely, even for a large
> > file. So we can look up "wher block 1 is on disk, where block 2
> > is on disk..." by looking at a little table in memory.
>
> Well, or on disk :-)
>
> > But wait - you don't allow caching, because some other node
>
> Thanks, but I know. That's right - we have to look up the inode again.
> That's precisely what I want to do.
>
> > may have messed with this file!
>
> May have, but probably won't.
Sure. The problem is that you can't know. The probability
is low, but you can't know if the inode were updated.

> To have messed with it it must
> have messed with the inode, and we can ask the driver if anyone
> but us has written to that spot (the inode), and if not, not
> reread it.

What driver? The disk controller driver? It keeps no record
about who writes where. If it did, it'd run out of memory
to store that kind of info quickly.

> That's just an idea, but really, I would prefer to reread
> it. Data reads and writes will generally be in lumps the order of
> 0.25M. An extra 4K on that is latency, not appreciable extra data.
>
> > So we have to read metadata *again* for each little
> > block before we are able to read it, for you don't let
>
> No, not each little block. I disagree totally. Remember that
> we get the lock on the inode area at the start of the opn.

Remember that the inode isn't enough for some of the fs'es you
want to modify. Nothing wrong in locking down the
rest of the metadata too for the duration of your
0.25M transaction though.

> Nobody
> can mess with it while we do the sequence of small reads or writes
> (which will be merged into large reads or writes). That was the
> whole point of my request for extra "tag requests" in the block
> layer. I want to be able to request a region lock, by whatever
> mechanism.

This starts looking better. You lock the entire file,
or at least the region you want to work on, _with_
associated metadata.

Try to be aware of terminology though. What you have specified
now _is_ a system for metadata cache coherency. A good
implementation might
work well too. But you started out by stating that you
didn't want to cache anything at all and turn caching
off entirely because cache coherency was "too hard".

Now you want to lock down regions leaving other
regions open for other nodes in the cluster to reserve.

>
> > us cache that in order to look up the next block.
>
> At this point the argument has gone wrong, so I can't follow ...
>
> > Now metadata is small, but the disk have to seek
> > to get it. Seeking is a very expensive operation, having
> > to seek for every block read will reduce performance to
> > perhaps 1% of streaming, or even less.
>
> Well, you forget that we had to seek there anyway,

No. the nice thing about disks is that we
don't seek to every block and then read it. We take
advantage of the fact that after disk block 1 comes
disk block 2 and so on. So, if we just read block 1 then
block 2 is available _without_ any seeking at all.
Consider that seeking is much slower than reading, and you
see that this zero-seek reading gets us a lot of
performance. That's why we have elevator algorithms, and that
is why filesystems try their best to lay out files
contigously whenever possible. It is in order to
enjoy the performance of zero-seek reading and writing,
which is just what you too need for your big-file streaming.

Anything that throws in a few extra seeks here and there
will slow down streaming considerably. It don't help
that the data volume is small, because transfer speed
isn't the issue at all. Moving disk heads around is
much slower, you could read _many_ blocks (tens, hundreds)in the
time spent on moving disk heads to a different location.

> and that the
> _server_ kernel has cached the result (I am only saying that the
> _client_ kernels need O_*DIRECT) , and that therefore there is
> no ondisk seek on the real device, just a "memory seek" in the
> guardian of the real device.
>
You should also take a look at network speed and latency.
Transferring unnecessary blocks across the network
takes time too. Your transfer speed is probably very high,
and the volume of metadata is small. But not so small
if you keep requesting the same block of metadata over
and over because you want to look at the next
disk address stored in some indirect block.
Of course you can avoid that if you have a lock
on the entire 0.25M region of the file (with
metadata included) but you can't avoid
it if you don't want to cache metadata _at all_.

Consider the latency issue too. The transfer
of a metadata block may be fast, but there is
a roundtrip time where you're waiting. The other
node don't necessarily respond instantaneously.

Other cluster systems have seen performance limitations
from cache coherency issues even if they only transfer
small packets for this purpose. A avoidable transfer
of an entire disk block could cost you more.

> > And where on disk is the metadata for this file?
> > That information is cached too, but you
> > disallow caching. Rmemeber, the cache isn't just for
>
> You are confusing yourself with syntax instead of semantics, I think ..
> I can't relate this account to anything that really happens right now.
> The words "no caching" only apply to the kernel for which the
> real device is remote. On the kernel for which it is local, of course
> there is block-level caching (though not FS level caching, I hope!).

So, you're no longer limited by _disk_ issues. Still, if you insist
on no caching at all (of metadata) you'll end up transferring
4 blocks or more for _each_ block in a ext2 filesystem because
of the indirect block lookup on _every_ disk block.
It has to be done, and it divides your streaming bandwith by
4 at least. Of course you can reserve the metadata (inode &
indirect blocks for that file) so others can't write
them, and cache that on the client until your 0.25M is done.

But don't say you aren't caching anything if you do that!
This _is_ metadata caching on the client side.
not caching any _file_ data may indeed make sense for
your large transfers, but don't forget the metadata necessary
to find the file data.

When we speak of "cache" in general, we mean both data
and all kinds of metadata. If you mean to say that you want to
not cache any _file_ data but lock down and cache some
metadata for the duration of a transfer then please
say so. You'll avoid a lot of misunderstanding that way.
And it looks like something that might work too, so
you might get people interested instead of irritated
over ideas obfuscated by misunderstandable ill-formed
specifications.

If you really want "no cache at all" on the client then
you still get your streaming bandwith divided by n, where
n is the number of metadata blocks necessary to look up
per file block.
You won't suffer from ill-ordered disk requests, but
you'll transfer a lot of metadata over and over.

If you lock down the metadata for a while so the
client can keep looking at the same copy for
a while before discarding it then you _are_
implemening limited metadata caching _and_ a
simple cache coherency protocol.

> This is all way off beam .. sorry. I'd like to know why people
> imagine in this way .. it doesn't correspond to the real state
> of play.
>
Seems to be a language problem - you started by stating
that you wanted to do no caching, when it turns out
you meant something different. Technical language is
very precise, and definitions can't be bent a little.

> Well, tell me whatyou mean by "logical". telling me you mean "offset
> within file" is enough explanation for me!
Good. There is no way I could _know_ that though.

> I really wish you would stop patronising - you are confusing unfamiliarity with YOUR jargon
> for unfamiliarity with the logic and the science involved. Try again.

Sorry. You seem to be on to something. We do, however, get occationally
clueless people who insist on the impossible and won't go away.
That kind of people is sometimes hard to tell from those
that merely misunderstand or use the language/jargon wrongly.

[...]

> > > Is that relevant?
> > Both irrelevant and wrong. Direct io isn't serialized, but
>
> It appears to be. I have measured the sequencialization that takes
> place on reads and writes through direct_io at the block level in
> 2.5.31, and as far as in can tell, a new write request does not go
> out before the "previous" one has come back in. Shrug.

Again, this depends on precise definitions. One process issuing
one direct io after another will of course have them
executed in order - serialized by the process itself.

A general statement about "direct io" covers _all_ direct io though.

What if I have several processes on my SMP machine all issuing
direct io operations on my multiple local disks?

Process 1 on cpu 1 starts direct io against disk 1
Slightly later, while process 1's io operation barely has
begun, process 2 on cpu 2 starts direct io against disk 2.

There's no need for process 2 to wait until process 1's
io completes, it is against another device. There's even
another cpu to handle it, although that isn't really
necessary as one cpu has plenty of time to start
two simultaneous requests against different devices.

Process 2's io might even complete long before
process 1's io, because disk 2 is a faster disk. Or
because disk 1 is busy, a third process have a long queue
of outstanding io's against it.

The io's might even get re-ordered in a two-process
one-cpu one-disk case, because the second io issued
is nearer the disk heads and so the elevator
algorithm prefers to do the second io first.

Direct io is merely a way of saying don't waste
cache capacity on this, I know the usage pattern
and know that it won't pay off.

The one-process case is serialized because a
process waits on its _own_ io operation, so it
is obvious that it can't issue a new one until
the old one completes.

Helge Hafting

2002-09-06 09:04:09

[permalink] [raw]

Subject: Re: (fwd) Re: [RFC] mount flag "direct"

At 09:57 06/09/02, Peter T. Breuer wrote:
>"Anton Altaparmakov wrote:"
> > [DanP]
> > >He was going to go hack the vfs, so in his mind, practical issues of
> how the
> > >vfs works now aren't an obstacle. The only mistake he's making is
> seriously
> > >underestimating how much effort is required to learn enough to do this
> kind
> > >of surgery and have a remote chance that the patient will survive.
> >
> > Ok so he plans to rewrite the whole I/O subsystem starting from block
> up to
> > FS drivers and VFS. Great. See him again in 10 years... And his changes
>
>I've had a look now. I concur that e2fs at least is full of assumptions
>that it has various different sorts of metadata in memory at all times.
>I can't rewrite that, or even add a "warning will robinson" callback to
>vfs, as I intended. What I can do, I think, is ..

Oh, you saw the light. (-: I can assure you that most file systems make
such assumptions. Certainly NTFS does... I designed the driver all around
caching metadata to optimize access to actual data. Once I enable
direct_IO, I plan to keep all metadata caching in place, just stop caching
the actual file data. That should give maximum performance I think.

>.. simply do a dput of the fs root directory from time to time, from
>vfs - whenever I think it is appropriate. As far as I can see from my
>quick look that may cause the dcache path off that to be GC'ed. And
>that may be enough to do a few experiments for O_DIRDIRECT.

What does "GC" mean?

Anton

--
"I've not lost my mind. It's backed up on tape somewhere." - Unknown
--
Anton Altaparmakov <aia21 at cantab.net> (replace at with @)
Linux NTFS Maintainer / IRC: #ntfs on irc.openprojects.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/

2002-09-06 09:12:29

[permalink] [raw]

Subject: Re: (fwd) Re: [RFC] mount flag "direct"

"Anton Altaparmakov wrote:"
> At 09:57 06/09/02, Peter T. Breuer wrote:
> >"Anton Altaparmakov wrote:"
> >I've had a look now. I concur that e2fs at least is full of assumptions
> >that it has various different sorts of metadata in memory at all times.
> >I can't rewrite that, or even add a "warning will robinson" callback to
> >vfs, as I intended. What I can do, I think, is ..
>
> Oh, you saw the light. (-: I can assure you that most file systems make

The question is if they do it in a way I can read. If I can read it, I
can fix it. There was too much noise inside e2fs to see a point or points
of intercept. So the intercept has to be higher, and ..

> direct_IO, I plan to keep all metadata caching in place, just stop caching
> the actual file data. That should give maximum performance I think.

But not correct behaviour wrt metadata in a shared disk fs. And your
calculation of "maximum performance" is off. Look, you seem to forget
this:

suppose that I make the FS twice as slow as before by meddling with
it to make it sharable

then I simply share it among 4 nodes to get a two times _speed up_
overall.

That's the basic idea. Details left to reader.

I.e. I don't care if it gets slower. We are talking thousands of nodes
here. Only the detail of the topology is affected by the real numbers.

> >.. simply do a dput of the fs root directory from time to time, from
> >vfs - whenever I think it is appropriate. As far as I can see from my
> >quick look that may cause the dcache path off that to be GC'ed. And
> >that may be enough to do a few experiments for O_DIRDIRECT.
>
> What does "GC" mean?

Garbage collection. I assume that either there is a point at which the
dcache is swept or as I free a dentry from the cache its dependents
are swept. I don't know which. Feel free to enlighten me. My question
is simply "is doing a dput of the base directory dentry on the FS
enough to clear the dcache for that FS"?

Peter

2002-09-06 09:45:32

by Lars Marowsky-Bree

[permalink] [raw]

Subject: Re: (fwd) Re: [RFC] mount flag "direct"

On 2002-09-06T11:17:05,
"Peter T. Breuer" <[email protected]> said:

Peter, you seem to be ignoring the long-ish mail I sent to you and l-k on Wed,
4 Sep 2002 11:26:45 +0200, trying to figure out why you are trying what you
are saying. Could you please enlighten us?

Sincerely,
Lars Marowsky-Br?e <[email protected]>

--
Immortality is an adequate definition of high availability for me.
--- Gregory F. Pfister

2002-09-06 13:14:50

by Helge Hafting

[permalink] [raw]

Subject: Re: (fwd) Re: [RFC] mount flag "direct"

"Peter T. Breuer" wrote:

> > Oh, you saw the light. (-: I can assure you that most file systems make
>
> The question is if they do it in a way I can read. If I can read it, I
> can fix it. There was too much noise inside e2fs to see a point or points
> of intercept. So the intercept has to be higher, and ..
>
> > direct_IO, I plan to keep all metadata caching in place, just stop caching
> > the actual file data. That should give maximum performance I think.
>
> But not correct behaviour wrt metadata in a shared disk fs. And your
> calculation of "maximum performance" is off. Look, you seem to forget
> this:
>
> suppose that I make the FS twice as slow as before by meddling with
> it to make it sharable
The big question is, of course: Can you do that? Can you make a fs
shareable
the way you want it with only a 2-times slowdown? That'd be interesting
to see.

>
> then I simply share it among 4 nodes to get a two times _speed up_
> overall.
Cool if it works, but then the next question is if you can make it
scaleable like that. Will it really be 4x as fast with 4 nodes?
Maybe. But it won't scale like that with more and more nodes,
that you can be sure of. Sometime you max out the disk,
or the network connections, or the processing capacity of the
node controlling the shared device. After that it don't
get any faster with more nodes.

>
> That's the basic idea. Details left to reader.
No thanks. It is the details that is the hard part here.

> I.e. I don't care if it gets slower. We are talking thousands of nodes
> here. Only the detail of the topology is affected by the real numbers.

Try, but thousands of nodes sharing one or more devices isn't
easy to get right. People struggle with clusters much smaller than
that.

Helge Hafting

2002-09-06 13:18:44

by Rik van Riel

[permalink] [raw]

Subject: Re: (fwd) Re: [RFC] mount flag "direct"

On Fri, 6 Sep 2002, Peter T. Breuer wrote:

> suppose that I make the FS twice as slow as before by meddling with
> it to make it sharable
>
> then I simply share it among 4 nodes to get a two times _speed up_
> overall.
>
> That's the basic idea. Details left to reader.

The "detail" is lock contention. If you lock the filesystem
and invalidate the caches you'll be able to do one operation
every disk seek, across all nodes.

Chances are your 4-node system would have lower aggregate
throughput than a single node system.

Rik
--
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/ http://distro.conectiva.com/

Spamtraps of the month: [email protected] [email protected]

2002-09-06 13:49:11

[permalink] [raw]

Subject: Re: (fwd) Re: [RFC] mount flag "direct"

"A month of sundays ago Rik van Riel wrote:"
> On Fri, 6 Sep 2002, Peter T. Breuer wrote:
>
> > suppose that I make the FS twice as slow as before by meddling with
> > it to make it sharable
> >
> > then I simply share it among 4 nodes to get a two times _speed up_
> > overall.
> >
> > That's the basic idea. Details left to reader.
>
> The "detail" is lock contention. If you lock the filesystem

Assume that I don't make infantile mistakes and your arguments will look
more convincing too </rant ;->

> and invalidate the caches you'll be able to do one operation
> every disk seek, across all nodes.

EVEN if you do that, you will still get a two times speed up from the
naive sketch above. Your argument rests on the suggestion that there is
contention. But there most likely isn't! The problem that having
shared access seeks to solve in the first place is _bandwidth_. One cpu
writing to one disk can only go so fast, because it takes time to
process its inputs, time to do the (massively parallel) analysis, and
time to write its outputs. There is lots of space in its execution
cycle (while the analysis happens) for other cpus to read or write to
the same disk. And even better if we can get them all working on
different parts of the same data at a time (well, we can ..).

Let me tell you that the cpus are essentially doing "ray tracing"
for thousands of jets at a time, and that there are a good number
of originators of these jets every second .. enough per second
that we must analyse and store (or discard) the data in real time.
Yes, there are common data that we want stored once and read many
times, and then there is secondary data that we want stored once and
read once.

And we need the data to be duplicated and triplicated too. It cannot be
lost. Analysis is needed to determine if the full array is turned loose
on the data (i.e. discard or not). Then more analysis follows.

BTW, Network bandwidth itself and the computation cycles taken by the
cpus to handle the networking are also considerations. We need lots of
options.

> Chances are your 4-node system would have lower aggregate
> throughput than a single node system.

IF, and I say IF, it turned out to look that way, we would look for
finer locking instead of abandoning the idea, because the raw arithmetic
at the top of this post is so powerful.

That would also remove your arument about contention, but then
we are into the detail of the process then, and "it depends".

All I am asking is that you make the evaluation of such strategies
easier by allowing me to - or giving me clues that help me to -
do rough implementations from which a further evaluation can proceed.
And to that end.

At the moment I would like a clue for how to vamoosh the dcache for
a given FS. Can't I set a flag on every dentry that means "kill me
now", so that d_alloc's never are more than trivial. Or am
I forced to to take the root dentry, and dput it after having first
walked through its children and dputted them? (gawd, I hate that
kind of inplace nonrecursive tree-walking).

Pointer? Clue? I'm happyto take all contributions in the clue
department :-).

Peter

2002-09-06 14:06:06

[permalink] [raw]

Subject: Re: (fwd) Re: [RFC] mount flag "direct"

"A month of sundays ago Lars Marowsky-Bree wrote:"
> On 2002-09-06T11:17:05,
> "Peter T. Breuer" <[email protected]> said:
>
> Peter, you seem to be ignoring the long-ish mail I sent to you and l-k on Wed,
> 4 Sep 2002 11:26:45 +0200, trying to figure out why you are trying what you
> are saying. Could you please enlighten us?

I'm currrently travelling. I won't be back in my office till late
sunday or monday. It's likely that I have missed mails and I apologise.
I'll do my best to track them down on monday. I certainly have been
unable to reply to all mails - for one thing, not all were cc'ed to me,
which meant that I had to trawl lk, and for another, well, I likely
just missed em. I get hundreds of mails a day. Do send it to me again.

Peter

2002-09-06 14:20:45

[permalink] [raw]

Subject: Re: (fwd) Re: [RFC] mount flag "direct"

"A month of sundays ago Anton Altaparmakov wrote:"
> The procedure described is _identical_ if you want to access 1TiB at a
> time, because the request is broken down by the VFS into 512 byte size

I think you said "1byte". And aren't these 4096B, or whatever the
blocksize is?

> units and I think I explained that, too. And for _each_ 512 byte sized unit
> of those 1TiB you would have to repeat the _whole_ of the described

Why? Doesn't the inode usually point at a contiguous lump of many
blocks? Are you perhaps looking at the worst case, not the usual case?

> procedure. So just replace 1 byte with 512 bytes in my post and then repeat
> the procedure as many times as it takes to make up the 1TiB. Surely you
> should know this... just fundamental Linux kernel VFS operation.

Well, you seem to have improved the situation by a factor of 512 in
just a few lines. Perhaps you can improve it again ...?

> It is not clear to me he understands the concept at all. He thinks that you

Well, "let it be clear to you".

> need to read the disk inode just once and then you immediately read all the

No, I think that's likely. I don't "think it". But yes, i assume that
in the case of a normal file it is quite likely that all the info
involved is in the inode, and we don't need to go hunting elsewhere.

Wrong?

> 1TiB of data and somehow all this magic is done by the VFS. This is

No, the fs reads. But yes, the inode is "looked up once" on average, I
believe, and if it says there's 1TB of data on disk here, here and here,
then I am going to tell you that I think it's locked in the fs while we
go look up the data on disk in the fs.

What I am not clear about now is exactly when the data is looked up - I
get the impression from what I have seen of the code that the VFS passes
down a complete request for 1TB, and that the the FS then goes and locks
the inode and chases down the data. What you are saying above gives me
the impression that things are broken down into 512B lumps FIRST in or
above VFS, and then sent to the fs as individual requests with no inode
or fs locking. And I simply don't buy that as a strategy.

> complete crap and is _NOT_ how the Linux kernel works. The VFS breaks every
> request into super block->s_blocksize sized units and _each_ and _every_

Well, that's 4096 (or 1024), not 512.

> request is _individually_ looked up as to where it is on disk.

Then that's crazy, but, shrug, if you want to do things that way, it
means that locking the inode becomes even more important, so that you
can cache it safely for a while. I'm quite happy with that. Just tell
me about it .. after all, I want to issue an appropriate "lock"
instruction from vfs before every vfs op. I would also like to remove
the dcache after every vfs opn, as we unlock. I'm asking for insight
into doing this ...

> Each of those lookups requires a lot of seeks, reads, memory allocations,
> etc. Just _read_ my post...

No they don't. You don't seem to realize that the remote disk server is
doing all that and has the data cached. What we, the client kernels
get, is latency in accessing that info, and not even that if we lock,
cache, and unlock.

> Please, I really wouldn't have expected someone like you to come up with a
> statement like that... You really should know better...

If you know the person involved then try and entertain the idea that
they're not crazy, and suspect that other people are not crazy (or
illogical) either!

Peter

2002-09-06 15:28:20

by Rik van Riel

[permalink] [raw]

Subject: Re: (fwd) Re: [RFC] mount flag "direct"

On Fri, 6 Sep 2002, Peter T. Breuer wrote:

> > Chances are your 4-node system would have lower aggregate
> > throughput than a single node system.
>
> IF, and I say IF, it turned out to look that way, we would look for
> finer locking instead of abandoning the idea, because the raw arithmetic
> at the top of this post is so powerful.

Finer grained locking would mean doing IO in smaller chunks
and caching less data, which would further decrease the
transfer time vs. seek time ratio.

If you want an efficient clustered filesystem, I think you
should use a filesystem that's designed for clustering.

Adding an inefficient kludge on top of ext2 just isn't
going to cut it because of your inability to cache metadata
and the extremely high cost of disk seeks.

The bottleneck isn't bandwidth, it's the latency of a
disk seek. A modern hard disk can transfer maybe 60
MB per second when doing linear streaming, but finding
one particular place on the disk still takes 10 ms.

So in order to get 50% bandwidth efficiency from your
disk, you'll need to do IO in chunks of 600 kB. Not
caching metadata and having to read directory, inode,
indirect block and double indirect block for each piece
of data you read is guaranteed to cut your performance
by a factor of 10, or more.

regards,

Rik
--
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/ http://distro.conectiva.com/

Spamtraps of the month: [email protected] [email protected]

2002-09-06 17:15:44

[permalink] [raw]

Subject: Re: (fwd) Re: [RFC] mount flag "direct"

On Fri, 6 Sep 2002, Peter T. Breuer wrote:

> "A month of sundays ago Anton Altaparmakov wrote:"
> > The procedure described is _identical_ if you want to access 1TiB at a
> > time, because the request is broken down by the VFS into 512 byte size
>
> I think you said "1byte". And aren't these 4096B, or whatever the
> blocksize is?

I said 1 byte but that is the same as if you did 512 bytes. No difference
whatsoever that is relevant to the discussion. And 1 terra-byte or bigger
or smaller doesn't matter as it is still the exact same operation repeated
lots and lots of times till you have covered the whole 1 terra-byte or
whatever in blocksize byte units (on NTFS always 512 bytes, technically
not any more as we now always work in units of PAGE_CACHE_SIZE when
talking to the VFS for read/write operations but more on that later).

The blocksize is dependent on file system driver _and_ on the options used
at format time. For example ext2 supports different block sizes, starting
at 1kiB (I think) and going up to the PAGE_SIZE of the machine, i.e. 4kiB
on ia32. For NTFS the cluster size is between 512 bytes and <place big
power of 2 number here, easily 64kiB but even 512kiB or larger possible>
but because the kernel cannot work on blocksize > PAGE_SIZE, I use a
translation where the kernel blocksize for NTFS is always 512 bytes and
the driver translates between kernel blocks (512 bytes each) and NTFS
clusters (512 bytes to 512 or more kilo bytes each).

> > units and I think I explained that, too. And for _each_ 512 byte sized unit
> > of those 1TiB you would have to repeat the _whole_ of the described
>
> Why? Doesn't the inode usually point at a contiguous lump of many
> blocks? Are you perhaps looking at the worst case, not the usual case?

Because the VFS does the split up and it has no idea whatsoever whether
something is contiguous or not. Only the FS itself knows that information.
So the VFS has to ask for each bit individually assuming the worst case
scenario. And the FS itself has no idea what the VFS is doing and hence
has to process each request completely independently from the others. This
is how the VFS is implemented at present time.

As of very recently, Andrew Morton introduced an optimization to this with
the get_blocks() interface in 2.5 kernels. Now the file system, when doing
direct_IO at least, returns to the VFS the requested block position _and_
the size of the block. So the VFS now gains in power in that it only needs
to ask for each block once as it is now aware of the size of the block.

But still, even with this optimization, the VFS still asks the FS for each
block, and then the FS has to lookup each block.

Obviously if only one file was written, and the disk is not fragmented,
the file will be basically contiguous on disk. But if you wrote two
or more files simultaneously they would be very likely interleaved, i.e.
fragmented, i.e. each block would be small. - How an FS allocates space is
entirely FS driver dependent so it is impossible to say how well the FS
driver copes with simultaneous writes...

> > procedure. So just replace 1 byte with 512 bytes in my post and then repeat
> > the procedure as many times as it takes to make up the 1TiB. Surely you
> > should know this... just fundamental Linux kernel VFS operation.
>
> Well, you seem to have improved the situation by a factor of 512 in
> just a few lines. Perhaps you can improve it again ...?
>
> > It is not clear to me he understands the concept at all. He thinks that you
>
> Well, "let it be clear to you".
>
> > need to read the disk inode just once and then you immediately read all the
>
> No, I think that's likely. I don't "think it". But yes, i assume that
> in the case of a normal file it is quite likely that all the info
> involved is in the inode, and we don't need to go hunting elsewhere.
>
> Wrong?

In NTFS wrong. You have the base inode which contains all metadata (i.e.
modification times, security information, the conversion table from file
offsets to disk locations, ...). However, as data is added to the file,
the conversion table for example grows and eventually the inode runs out
of space. (In NTFS inodes are fixed size records of 1024 bytes.) When that
happens, the NTFS driver either moves the conversion table completely to a
2nd inode (we call this inode an extent inode because it is _only_
referenced by the original inode itself) or it splits the conversion table
into two pieces and saves the 2nd part to a 2nd (extent) inode. The 2nd
(extent) inode can be anywhere in the inode table, wherever the allocator
finds a free one. When the 2nd inode fills up, the process is repeated, so
we now have a 3rd (extent) inode, and so on, ad infinitum. So NTFS has
pretty much no limit on how many extent inodes can be attached to a base
inode. (Note: the original base inode contains a table describing which
metadata is stored in which extent inode. This table is not necessarily
stored in the inode itself. It can be stored anywhere on disk, and the
base inode then contains a conversion array between the offset in the
table and the physical disk block. - This just to give you the full
picture.)

So, in summary, if you have a small file _or_ the file is bug
but completely or almost completely unfragmented, then you are correct and
one only needs the original inode. But as soon as you get a little
fragmentation and you start needing extent inodes.

> > 1TiB of data and somehow all this magic is done by the VFS. This is
>
> No, the fs reads. But yes, the inode is "looked up once" on average, I
> believe, and if it says there's 1TB of data on disk here, here and here,
> then I am going to tell you that I think it's locked in the fs while we
> go look up the data on disk in the fs.
>
> What I am not clear about now is exactly when the data is looked up - I
> get the impression from what I have seen of the code that the VFS passes
> down a complete request for 1TB, and that the the FS then goes and locks
> the inode and chases down the data. What you are saying above gives me
> the impression that things are broken down into 512B lumps FIRST in or
> above VFS, and then sent to the fs as individual requests with no inode
> or fs locking. And I simply don't buy that as a strategy.

But that is how the VFS works! The VFS _does_ break up everything into
512B lumps FIRST. Let me show you where this happens (assume current 2.5
kernel):

VFS: on file read we eventually reach:

mm/filemap.c::generic_file_read() which calls
mm/filemap.c::do_generic_file_read() which breaks the request into units
of PAGE_CACHE_SIZE, i.e. into individual pages and calls the file system's
address space ->readpage() method for each of those pages.

Assuming the file system uses the generic readpage implementation, i.e.
fs/buffer.c::block_read_full_page(), this in turn breaks the page into
blocksize blocks (for NTFS 512 bytes, remember?) and calls the FS
supplied get_block() callback, once for each blocksize block (in the form
of a struct buffer_head).

This finds where this 512 byte block is, and maps the buffer head (i.e.
sets up the buffer_head to point to the correct on disk location) and
returns it to the VFS, i.e. to block_read_full_page() which then goes and
calls get_block() for the next blocksize block, etc...

So this is how file reads work. For NTFS, I figured this is too
inefficient (and because of other complications in NTFS, too) and I don't
use block_read_full_page() any more. Instead, I copied
block_read_full_page() and get_block() and rewrote them for NTFS so that
they became one single function, which takes the page and maps all the
buffers in it in one go, keeping the conversion table from file offsets to
disk blocks locked (rw_semaphore) during that period. That is way more
efficient as it means I don't need to do a lookup in the conversion table
for every 512 byte block any more and instead I perform the lookup for all
blocks in the current page in one go.

Note, Andrew Morton actually optimized this, too, in recent 2.5 kernels,
where there is now a ->readpages() address space method which optimizes it
even further. But that involves using BIOs instead of buffer heads and I
haven't really looked into it much yet. (I want to keep NTFS as close to
2.4 compatibility as possible for the time being.)

> > complete crap and is _NOT_ how the Linux kernel works. The VFS breaks every
> > request into super block->s_blocksize sized units and _each_ and _every_
>
> Well, that's 4096 (or 1024), not 512.

512 if the FS says so. blocksize on NTFS is 512 bytes. Fixed. Otherwise
the code would have become too complicated...

> > request is _individually_ looked up as to where it is on disk.
>
> Then that's crazy, but, shrug, if you want to do things that way, it

I agree. Which is why I don't use block_read_full_page() in NTFS any
more...

But you were talking modifying the VFS so I am telling you how the VFS
works. An important issue to remember is that FS drivers don't have to use
all the VFS helpers provided, they can just go off and implement their own
thing instead. But at least for file systems using the VFS helpers a
modification of the VFS means changing every single FS driver using this
which is a lot of work...

> means that locking the inode becomes even more important, so that you
> can cache it safely for a while. I'm quite happy with that. Just tell
> me about it .. after all, I want to issue an appropriate "lock"
> instruction from vfs before every vfs op. I would also like to remove
> the dcache after every vfs opn, as we unlock. I'm asking for insight
> into doing this ...
>
> > Each of those lookups requires a lot of seeks, reads, memory allocations,
> > etc. Just _read_ my post...
>
> No they don't. You don't seem to realize that the remote disk server is
> doing all that and has the data cached. What we, the client kernels
> get, is latency in accessing that info, and not even that if we lock,
> cache, and unlock.

I thought that your nodes are the disk servers at the same time, i.e. each
node has several hds and all the hds together make up the "virtual" huge
disk. Are you talking about a server with a single HD then? You
didn't define your idea precisely enough... It now makes no sense to me at
all with your above comment...

Could you describe the setup you envision exactly? From the above it seems
to me that you are talking about this:

- a pool of servers sharing the same harddisk, all the servers attached
to the HD via SCSI or some simillar interface. - Or are you saying that
there is a "master node" which has the hd attached and then all the other
servers talk to the master node to get access to the disk?

- all the servers mount the HD with some FS and they somehow manage to get
this right without breaking anything. Presumably all the servers are
interconnected directly via some super fast super low latency interface? -
Or do they all talk to the master node?

- the FS mounted on the servers is exported via NFS (or other networking
FS) to the workstations which then read/write files on the NFS shares.

Is this correct? If not what is it that you _really_ have in mind. It
would help a lot to know what you are _really_ trying to do otherwise it
is difficult to talk about it... Your turn to explain your idea better
now. (-;

Best regards,

Anton
--
Anton Altaparmakov <aia21 at cantab.net> (replace at with @)
Linux NTFS maintainer / IRC: #ntfs on irc.openprojects.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/

2002-09-06 17:28:31

[permalink] [raw]

Subject: Re: (fwd) Re: [RFC] mount flag "direct"

On Friday 06 September 2002 19:20, Anton Altaparmakov wrote:
> As of very recently, Andrew Morton introduced an optimization to this with
> the get_blocks() interface in 2.5 kernels. Now the file system, when doing
> direct_IO at least, returns to the VFS the requested block position _and_
> the size of the block. So the VFS now gains in power in that it only needs
> to ask for each block once as it is now aware of the size of the block.
>
> But still, even with this optimization, the VFS still asks the FS for each
> block, and then the FS has to lookup each block.

Well, it takes no great imagination to see the progression: get_blocks
acts on extents instead of arrays of blocks. Expect to see that around
the 2.7 timeframe.

--
Daniel

2002-09-06 18:25:10

[permalink] [raw]

Subject: Re: (fwd) Re: [RFC] mount flag "direct"

"Anton Altaparmakov wrote:"
> On Fri, 6 Sep 2002, Peter T. Breuer wrote:
> > Why? Doesn't the inode usually point at a contiguous lump of many
> > blocks? Are you perhaps looking at the worst case, not the usual case?
>
> Because the VFS does the split up and it has no idea whatsoever whether
> something is contiguous or not. Only the FS itself knows that information.
> So the VFS has to ask for each bit individually assuming the worst case

Well, it doesn't /have to/, but it does it, if this is a realistic
account (personally, if I were designing it, I'd have VFS ask for what
I want at high level, and let the FS look after how it wants to do
it - but it is a valid approach, I mean, at least this way the VFS has
some nontrivial work that it can contribute always).

> scenario. And the FS itself has no idea what the VFS is doing and hence
> has to process each request completely independently from the others. This

That doesn't matter in a large sense, though it matters the way things
are. As I said - get the VFS to look after the locking requests too ..
the problem here is really that what I asked for at the beginning is
precisely lacking - a call back from the FS to the VFS to ask it to lock
the inode against anyone else for a while. Or .. hmm. Well, locking
the whole FS for the duration of the request sequence is the only other
reasonable approach, without knowledge of where the inode is on disk.

And that's not very nice.

> is how the VFS is implemented at present time.
>
> As of very recently, Andrew Morton introduced an optimization to this with
> the get_blocks() interface in 2.5 kernels. Now the file system, when doing
> direct_IO at least, returns to the VFS the requested block position _and_
> the size of the block. So the VFS now gains in power in that it only needs
> to ask for each block once as it is now aware of the size of the block.

I'm not sure I follow. The VFS already split the request up according
to your account above, so it doesn't care. I think you must be saying
that the VFS now DOES ask for more than 512B at a time? Can you
clarify? I imagine that the FS replies with how much of the i/o it did
this time, and thus the VFS knows if it should call again.

> But still, even with this optimization, the VFS still asks the FS for each
> block, and then the FS has to lookup each block.

Now I really don't understand you -- what's the point of the FS's
replying with the size of the block? Are you saying that it can reply
with a _larger_ size than the VFS asked for, sort of a "you can be
braver about this" invitation, so the VFS can ask for a bigger lump
next time?

> Obviously if only one file was written, and the disk is not fragmented,
> the file will be basically contiguous on disk. But if you wrote two
> or more files simultaneously they would be very likely interleaved, i.e.

That's OK. Imagine that most of the files were laid down beforehand,
and now we are simply rewriting over them.

> fragmented, i.e. each block would be small. - How an FS allocates space is
> entirely FS driver dependent so it is impossible to say how well the FS
> driver copes with simultaneous writes...

Doesn't matter. We can arrange that beforehand.

> > > need to read the disk inode just once and then you immediately read all the
> > No, I think that's likely. I don't "think it". But yes, i assume that
> > in the case of a normal file it is quite likely that all the info
> > involved is in the inode, and we don't need to go hunting elsewhere.
> >
> > Wrong?
>
> In NTFS wrong. You have the base inode which contains all metadata (i.e.
> modification times, security information, the conversion table from file
> offsets to disk locations, ...). However, as data is added to the file,
> the conversion table for example grows and eventually the inode runs out
> of space. (In NTFS inodes are fixed size records of 1024 bytes.) When that

Well, that I can imagine. Let's /not/ consider NTFS, then. I'd rather
consider something that hasn't taken about 5 years to reverse engineer
in the first place! And NTFS is NOT going to be a candidate, because
of reliability. Does this happen on unix file systems too ..

> happens, the NTFS driver either moves the conversion table completely to a
> 2nd inode (we call this inode an extent inode because it is _only_
> referenced by the original inode itself) or it splits the conversion table
> into two pieces and saves the 2nd part to a 2nd (extent) inode. The 2nd
> (extent) inode can be anywhere in the inode table, wherever the allocator
> finds a free one. When the 2nd inode fills up, the process is repeated, so
> we now have a 3rd (extent) inode, and so on, ad infinitum. So NTFS has

Well, something like that I distantly recall as happening on unix fs's,
but I'd like to know if we expect a file of fixed size which is being
overwritten without O_TRUNC to have any metadata changes apart from in
its inode (and there trivially) ...?

> So, in summary, if you have a small file _or_ the file is big
> but completely or almost completely unfragmented, then you are correct and
> one only needs the original inode. But as soon as you get a little
> fragmentation and you start needing extent inodes.

I thank you for that information about NTFS.

> > What I am not clear about now is exactly when the data is looked up - I
> > get the impression from what I have seen of the code that the VFS passes
> > down a complete request for 1TB, and that the the FS then goes and locks
> > the inode and chases down the data. What you are saying above gives me
> > the impression that things are broken down into 512B lumps FIRST in or
> > above VFS, and then sent to the fs as individual requests with no inode
> > or fs locking. And I simply don't buy that as a strategy.
>
> But that is how the VFS works! The VFS _does_ break up everything into
> 512B lumps FIRST. Let me show you where this happens (assume current 2.5

OK. Why 512? Why not the blocksize?

> kernel):
>
> VFS: on file read we eventually reach:
>
> mm/filemap.c::generic_file_read() which calls
> mm/filemap.c::do_generic_file_read() which breaks the request into units
> of PAGE_CACHE_SIZE, i.e. into individual pages and calls the file system's

Well, we're at 4096 now (or is it 8192 :-).

> address space ->readpage() method for each of those pages.
>
> Assuming the file system uses the generic readpage implementation, i.e.
> fs/buffer.c::block_read_full_page(), this in turn breaks the page into
> blocksize blocks (for NTFS 512 bytes, remember?) and calls the FS

But not for any other FS. I didn't even know 512 was a possible kernel
blocksize. In fact, I'm fairly sure it wasn't in the recent past. 1024
eas the minimum, no? And I'll choose 4096, thanks!

> supplied get_block() callback, once for each blocksize block (in the form
> of a struct buffer_head).

OK. But this will stick at 4096 blocks for other FSs.

I'm vary gratefull for this explanation BTW. gfr -> do_gfr -> breakup
-> readpage. I should be able to trace it from there. I'll expect that
the pagesize is the blocksize (or close), for optimum performance at
this point.

> This finds where this 512 byte block is, and maps the buffer head (i.e.

THAT's the inode lookup. OK, that's once per block, for the scenario
I'm imagining. So .. for reads the situation is fine. We can have a
read/write lock on all FS metadata, and multiple readers will be
allowed at once. The problem is that as soon as anyone writes metadata,
then no reads can be allowed anywhere, unless we know where the inode
is.

So that's the perfect intercept. The get_block callback needs to return
an extra argument which is the physical block location of the inode
and I can lock that. That's perfect.

No? Well, maybe we might have to go through several inodes, which all
need to be locked, wo we need to return an array.

> sets up the buffer_head to point to the correct on disk location) and
> returns it to the VFS, i.e. to block_read_full_page() which then goes and
> calls get_block() for the next blocksize block, etc...

Yes, this looks good.

> So this is how file reads work. For NTFS, I figured this is too
> inefficient (and because of other complications in NTFS, too) and I don't

Well, at 512B a time, it looks so!

> use block_read_full_page() any more. Instead, I copied
> block_read_full_page() and get_block() and rewrote them for NTFS so that
> they became one single function, which takes the page and maps all the
> buffers in it in one go, keeping the conversion table from file offsets to
> disk blocks locked (rw_semaphore) during that period. That is way more
> efficient as it means I don't need to do a lookup in the conversion table
> for every 512 byte block any more and instead I perform the lookup for all
> blocks in the current page in one go.

yes. Well, I didn't follow that detail. I'm still thinking about your
general statements, which look good to me. Unfortunately, I am betting
that at the moment the FS get_block returns _after_ getting the first
block, which is too late to return the inode position, if I try and do
that. This is awkward. We need to do the i/o after returning the inode
block(s) in order that the VFS can lock it/them.

Suggestion (start-io, end-io)?

> Note, Andrew Morton actually optimized this, too, in recent 2.5 kernels,
> where there is now a ->readpages() address space method which optimizes it
> even further. But that involves using BIOs instead of buffer heads and I
> haven't really looked into it much yet. (I want to keep NTFS as close to
> 2.4 compatibility as possible for the time being.)

Hmm.

>
> 512 if the FS says so. blocksize on NTFS is 512 bytes. Fixed. Otherwise

:-). It won't say so, I somehow think!

> the code would have become too complicated...

OK.

> > > request is _individually_ looked up as to where it is on disk.
> >
> > Then that's crazy, but, shrug, if you want to do things that way, it
>
> I agree. Which is why I don't use block_read_full_page() in NTFS any
> more...
>
> But you were talking modifying the VFS so I am telling you how the VFS
> works. An important issue to remember is that FS drivers don't have to use

Yes, that's very very very helpful, thanks.

> all the VFS helpers provided, they can just go off and implement their own

Well, they have to respond when asked.

> thing instead. But at least for file systems using the VFS helpers a
> modification of the VFS means changing every single FS driver using this

Only modify those that I want to use. The difficulty is in identifying a
strategy and a point at which a change can be made.

> > means that locking the inode becomes even more important, so that you
> > can cache it safely for a while. I'm quite happy with that. Just tell
> > me about it .. after all, I want to issue an appropriate "lock"

Thanks for teling me! I appreciate it.

> > instruction from vfs before every vfs op. I would also like to remove
> > the dcache after every vfs opn, as we unlock. I'm asking for insight
> > into doing this ...
> >
> > > Each of those lookups requires a lot of seeks, reads, memory allocations,
> > > etc. Just _read_ my post...
> >
> > No they don't. You don't seem to realize that the remote disk server is
> > doing all that and has the data cached. What we, the client kernels
> > get, is latency in accessing that info, and not even that if we lock,
> > cache, and unlock.
>
> I thought that your nodes are the disk servers at the same time, i.e. each

They generally are servers for some resources and clients for others.
I don't know at the moment if they will be clients for their "own"
resources, but even if they were, then I think it will be arranged that
they do block-level caching. I am not sure yet.

> node has several hds and all the hds together make up the "virtual" huge
> disk. Are you talking about a server with a single HD then? You

The standard configuration will likely be 1 disk per node, with
each disk being served to 3 other machines. The local node will partake of
three other remote disks.

Too early to say for sure.

> didn't define your idea precisely enough... It now makes no sense to me at
> all with your above comment...
>
> Could you describe the setup you envision exactly? From the above it seems
> to me that you are talking about this:
>
> - a pool of servers sharing the same harddisk, all the servers attached

No, not a pool. Each machine will be the same setup. But it will serve
for some resource(s) and be a client for others. It may be a client for
its own resource, but that is not sure yet.

> to the HD via SCSI or some simillar interface. - Or are you saying that
> there is a "master node" which has the hd attached and then all the other
> servers talk to the master node to get access to the disk?

Each is master for some and slave for others.

> - all the servers mount the HD with some FS and they somehow manage to get
> this right without breaking anything. Presumably all the servers are
> interconnected directly via some super fast super low latency interface? -

The topology is a grid. That is, neither star nor complete.

> Or do they all talk to the master node?
>
> - the FS mounted on the servers is exported via NFS (or other networking
> FS) to the workstations which then read/write files on the NFS shares.

NFS is not the most likely candidate becuase of its known weaknesses, but
it is a possibility.

> is difficult to talk about it... Your turn to explain your idea better
> now. (-;

Well, it's hard to go into details because what I am asking for is clues
to enable the idea to be prototyped and evaluated. There will be a grid
topolgy. Each machine will locally be connected to several others. The
whole grid will form a ring (or a torus, if you see that I am talking
about a 2-D planar local connectivity graph, so that I must be talking
about an orientable surface ...). Each machine will have at least one
disk of its own, and share it with three other machines. There is
planned redundancy here (i.e. raid as well).

Peter

2002-09-06 19:25:38

[permalink] [raw]

Subject: Re: (fwd) Re: [RFC] mount flag "direct"

At 19:29 06/09/02, Peter T. Breuer wrote:
>"Anton Altaparmakov wrote:"
> > is how the VFS is implemented at present time.
> >
> > As of very recently, Andrew Morton introduced an optimization to this with
> > the get_blocks() interface in 2.5 kernels. Now the file system, when doing
> > direct_IO at least, returns to the VFS the requested block position _and_
> > the size of the block. So the VFS now gains in power in that it only needs
> > to ask for each block once as it is now aware of the size of the block.
>
>I'm not sure I follow. The VFS already split the request up according
>to your account above, so it doesn't care. I think you must be saying
>that the VFS now DOES ask for more than 512B at a time?

No.

>Can you clarify? I imagine that the FS replies with how much of the i/o it
>did this time, and thus the VFS knows if it should call again.

You are mixing two things up: there is the lookup layer level and there is
the i/o level the two are separate! First all blocks in the current page
are looked up, then the i/o is done. Not simultaneously. Does that clarify
it? The FS doesn't reply with how much i/o it did because it didn't do any.
It only tells the VFS where the block is on disk, nothing more, then the
VFS does the i/o the FS doesn't.

Note that this only happens, _if_ the FS uses the generic helpers in
fs/buffer.c, like block_read_full_page(). This calls the FS to do the
lookups and then sets up the i/o for all the blocks in the current page at
once.

If the FS doesn't use those and implements their own readpage() method then
it is free to do whatever it wants. Just like NTFS does now, i.e. NTFS now
does all the lookups and then the i/o entirely without going anywhere near
VFS code. NTFS directly talks to the block layer.

> > But still, even with this optimization, the VFS still asks the FS for each
> > block, and then the FS has to lookup each block.
>
>Now I really don't understand you -- what's the point of the FS's
>replying with the size of the block? Are you saying that it can reply
>with a _larger_ size than the VFS asked for,

The VFS says, please give me 1 block at file offset X. The FS replies: the
block at offset X is located on disk block Y and has size Z. The VFS then
goes and reads up to Z bytes from that block on disk and either asks for
the location of offset X+Z or terminates (no more data to transfer).

>sort of a "you can be braver about this" invitation, so the VFS can ask
>for a bigger lump next time?

No. See above. The VFS doesn't ask for any size, it just asks where is this
block and the FS says it is here and this is its size.

> > Obviously if only one file was written, and the disk is not fragmented,
> > the file will be basically contiguous on disk. But if you wrote two
> > or more files simultaneously they would be very likely interleaved, i.e.
>
>That's OK. Imagine that most of the files were laid down beforehand,
>and now we are simply rewriting over them.

Oh, ok, in that case they will be contiguous or almost so.

> > > > need to read the disk inode just once and then you immediately read
> all the
> > > No, I think that's likely. I don't "think it". But yes, i assume that
> > > in the case of a normal file it is quite likely that all the info
> > > involved is in the inode, and we don't need to go hunting elsewhere.
> > >
> > > Wrong?
> >
> > In NTFS wrong. You have the base inode which contains all metadata (i.e.
> > modification times, security information, the conversion table from file
> > offsets to disk locations, ...). However, as data is added to the file,
> > the conversion table for example grows and eventually the inode runs out
> > of space. (In NTFS inodes are fixed size records of 1024 bytes.) When that
>
>Well, that I can imagine. Let's /not/ consider NTFS, then. I'd rather
>consider something that hasn't taken about 5 years to reverse engineer
>in the first place! And NTFS is NOT going to be a candidate, because
>of reliability. Does this happen on unix file systems too ..

In ext2/3 you have indirect blocks so as the file grows you have to read
them in addition to the inode. So this is kind of a very standard
feature... The indirect blocks of ext2/3 are very much like NTFS extent
inodes in function...

> > happens, the NTFS driver either moves the conversion table completely to a
> > 2nd inode (we call this inode an extent inode because it is _only_
> > referenced by the original inode itself) or it splits the conversion table
> > into two pieces and saves the 2nd part to a 2nd (extent) inode. The 2nd
> > (extent) inode can be anywhere in the inode table, wherever the allocator
> > finds a free one. When the 2nd inode fills up, the process is repeated, so
> > we now have a 3rd (extent) inode, and so on, ad infinitum. So NTFS has
>
>Well, something like that I distantly recall as happening on unix fs's,
>but I'd like to know if we expect a file of fixed size which is being
>overwritten without O_TRUNC to have any metadata changes apart from in
>its inode (and there trivially) ...?

There _shouldn't_ be changes needed. For NTFS there certainly aren't. Only
the access times would get changed, that's it. (As long as the file size
remains constant, no metadata except access times need changing.) I can
only guess that for other FS it is the same...

> > > What I am not clear about now is exactly when the data is looked up - I
> > > get the impression from what I have seen of the code that the VFS passes
> > > down a complete request for 1TB, and that the the FS then goes and locks
> > > the inode and chases down the data. What you are saying above gives me
> > > the impression that things are broken down into 512B lumps FIRST in or
> > > above VFS, and then sent to the fs as individual requests with no inode
> > > or fs locking. And I simply don't buy that as a strategy.
> >
> > But that is how the VFS works! The VFS _does_ break up everything into
> > 512B lumps FIRST. Let me show you where this happens (assume current 2.5
>
>OK. Why 512? Why not the blocksize?

sb->s_blocksize = 512;

NTFS does that at mount time. so that is the NTFS blocksize. blocksize is
not a constant, it is a per mount point variable number...

So your second question doesn't make sense...

But to answer the first question: the lowest common denominator between the
kernel blocksize and the NTFS cluster size is 512. Hence I chose 512. This
is also 1 hard disk sector which fits very nicely with the multi sector
transfer protection applied to metadata on NTFS which is done in 512 byte
blocks (I will spare you the details what that is). Further the kernel
doesn't allow blocksize > PAGE_SIZE, on ia32 4kiB. But NTFS clusters can be
much bigger than that. So I can't just set the blocksize = NTFS cluster
size. Also for metadata access that would be inconvenient.

The code gets very ugly and inefficient of blockszie > 512. An example:
blocksize = 4096 (1 page) and NTFS cluster size = 512.

Now we can no longer do i/o in less than 4096 byte quantities but NTFS
clusters are 512 bytes so when we read 4096 bytes from disk, we get the 512
bytes we wanted and another 3.5kiB of random data. Now think what happens
if we wanted the 2nd 512 bytes? We would need to always work with current
block + ofs into block. Nasty. It can be done. But if done efficiently then
imagine the the next mounted volume has cluster size 8kiB. Suddenly this is
bigger than the blocksize of 4kiB and now all divisions and shifts that
wonderfully worked above now all cause values of zero as results because of
integer division. Now we need to swap all divisions and shifts around
dividing cluster_size / blocksize instead of the other way round. For the
code to cope with both cases it gets fugly... All that nastyness is nicely
avoided by setting blocksize = 512 and thus guaranteeing that cluster_size
is always >= blocksize. Allows much more efficient maths.

> > kernel):
> >
> > VFS: on file read we eventually reach:
> >
> > mm/filemap.c::generic_file_read() which calls
> > mm/filemap.c::do_generic_file_read() which breaks the request into units
> > of PAGE_CACHE_SIZE, i.e. into individual pages and calls the file system's
>
>Well, we're at 4096 now (or is it 8192 :-).

Depends on your page size and that varies from arch to arch! (-: This is a
complication because if I move a HD from an Alpha to a ia32, i.e. moving
from PAGE_SIZE 8192 to PAGE_SIZE 4096 I want that file system to work! Not
give me an error message "unsupported blocksize". That is also the reason I
chose 512 bytes in NTFS...

> > address space ->readpage() method for each of those pages.
> >
> > Assuming the file system uses the generic readpage implementation, i.e.
> > fs/buffer.c::block_read_full_page(), this in turn breaks the page into
> > blocksize blocks (for NTFS 512 bytes, remember?) and calls the FS
>
>But not for any other FS. I didn't even know 512 was a possible kernel
>blocksize. In fact, I'm fairly sure it wasn't in the recent past. 1024
>eas the minimum, no?

It always was possible. It would have been insane not to allow it. You are
confusing this with the blocksize used by the kernel for device i/o which
is 1024 bytes, i.e. the infamous BLOCK_SIZE constant...

>And I'll choose 4096, thanks!

You can't just choose 4096. Maybe the person who formatted the volume chose
less? Or maybe the disk comes from a different computer/arch which was
using something else?

> > supplied get_block() callback, once for each blocksize block (in the form
> > of a struct buffer_head).
>
>OK. But this will stick at 4096 blocks for other FSs.

Not really. Most FS allow different block sizes. Certainly VFAT, ext2,
ext3, NTFS, by extension HPFS I guess do, and I am fairly sure XFS, does
too. And probably all the other "modern" Linux OS like Reiser and JFS...

>I'm vary gratefull for this explanation BTW. gfr -> do_gfr -> breakup
>-> readpage. I should be able to trace it from there. I'll expect that
>the pagesize is the blocksize (or close), for optimum performance at
>this point.

Yes. Optimum performance is given when you set blocksize = PAGE_SIZE. But
that is not always possible or practical.

> > This finds where this 512 byte block is, and maps the buffer head (i.e.
>
>THAT's the inode lookup. OK, that's once per block, for the scenario
>I'm imagining. So .. for reads the situation is fine. We can have a
>read/write lock on all FS metadata, and multiple readers will be
>allowed at once. The problem is that as soon as anyone writes metadata,
>then no reads can be allowed anywhere, unless we know where the inode
>is.
>
>So that's the perfect intercept. The get_block callback needs to return
>an extra argument which is the physical block location of the inode
>and I can lock that. That's perfect.
>
>No? Well, maybe we might have to go through several inodes, which all
>need to be locked, wo we need to return an array.
>
> > sets up the buffer_head to point to the correct on disk location) and
> > returns it to the VFS, i.e. to block_read_full_page() which then goes and
> > calls get_block() for the next blocksize block, etc...
>
>Yes, this looks good.
>
> > So this is how file reads work. For NTFS, I figured this is too
> > inefficient (and because of other complications in NTFS, too) and I don't
>
>Well, at 512B a time, it looks so!
>
> > use block_read_full_page() any more. Instead, I copied
> > block_read_full_page() and get_block() and rewrote them for NTFS so that
> > they became one single function, which takes the page and maps all the
> > buffers in it in one go, keeping the conversion table from file offsets to
> > disk blocks locked (rw_semaphore) during that period. That is way more
> > efficient as it means I don't need to do a lookup in the conversion table
> > for every 512 byte block any more and instead I perform the lookup for all
> > blocks in the current page in one go.
>
>yes. Well, I didn't follow that detail.

It just means that NTFS does everything on its own. No talking to VFS. In
fact NTFS doesn't implement a get_block() function at all. Its
functionality is merged into ntfs_readpage() and its friends.

NTFS now works like this: VFS says, please read PAGE_CACHE_SIZE bytes of
data from file F at offset X into this provided page, thank you. And NTFS
goes away looks up the blocks, reads the data by talking to the block layer
directly, and returns to the VFS saying: "All done.".

So this example of NTFS shows you why you can't do your intercept at
get_block() level. It simply isn't used by all FS. So you are already
restricting which FS you will support... Remember you had said initially
that you want to support all FS in the kernel... Most people who replied
said "don't do that, support only one or a few FS"... So perhaps you are
starting to agree. (-;

> > 512 if the FS says so. blocksize on NTFS is 512 bytes. Fixed. Otherwise
>
>:-). It won't say so, I somehow think!

I like to personify code because it makes it much easier to talk about it.
You yourself complained about using "our" FS terminology which you are
unfamiliar with... So I am trying to use the personification method so I
can talk in "normal" language... (-;

> > thing instead. But at least for file systems using the VFS helpers a
> > modification of the VFS means changing every single FS driver using this
>
>Only modify those that I want to use. The difficulty is in identifying a
>strategy and a point at which a change can be made.

Ok, now you are getting sensible. You now want to support only a few FS not
all. Good! (-:

> > Or do they all talk to the master node?
> >
> > - the FS mounted on the servers is exported via NFS (or other networking
> > FS) to the workstations which then read/write files on the NFS shares.
>
>NFS is not the most likely candidate becuase of its known weaknesses, but
>it is a possibility.
>
> > is difficult to talk about it... Your turn to explain your idea better
> > now. (-;
>
>Well, it's hard to go into details because what I am asking for is clues
>to enable the idea to be prototyped and evaluated. There will be a grid
>topolgy. Each machine will locally be connected to several others. The
>whole grid will form a ring (or a torus, if you see that I am talking
>about a 2-D planar local connectivity graph, so that I must be talking
>about an orientable surface ...). Each machine will have at least one
>disk of its own, and share it with three other machines. There is
>planned redundancy here (i.e. raid as well).

Ok. But you are saying that clients will talk to servers at FS level. Is
this correct? I.e. NFS or some other FS which can talk over the wire? Or
are the clients actually going to mount the "virtual" harddisk that they
will write files to? -- These two questions are really the clincher of the
discussion! -- Can you at least say which of those two approaches are
going to be used? Or are you open minded there, too?

Best regards,

Anton

--
"I've not lost my mind. It's backed up on tape somewhere." - Unknown
--
Anton Altaparmakov <aia21 at cantab.net> (replace at with @)
Linux NTFS Maintainer / IRC: #ntfs on irc.openprojects.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/

2002-09-06 19:28:13

[permalink] [raw]

Subject: Re: (fwd) Re: [RFC] mount flag "direct"

At 18:33 06/09/02, Daniel Phillips wrote:
>On Friday 06 September 2002 19:20, Anton Altaparmakov wrote:
> > As of very recently, Andrew Morton introduced an optimization to this with
> > the get_blocks() interface in 2.5 kernels. Now the file system, when doing
> > direct_IO at least, returns to the VFS the requested block position _and_
> > the size of the block. So the VFS now gains in power in that it only needs
> > to ask for each block once as it is now aware of the size of the block.
> >
> > But still, even with this optimization, the VFS still asks the FS for each
> > block, and then the FS has to lookup each block.
>
>Well, it takes no great imagination to see the progression: get_blocks
>acts on extents instead of arrays of blocks. Expect to see that around
>the 2.7 timeframe.

Isn't that just a matter of terminology? Whether you are calling the
things "variable sized blocks" or "extents" they are still the same thing,
no? So I would say, get_blocks() acts on extents right now. (-: [I may be
missing your point in which case a small explanation would be appreciated...]

Cheers,

Anton

--
"I've not lost my mind. It's backed up on tape somewhere." - Unknown
--
Anton Altaparmakov <aia21 at cantab.net> (replace at with @)
Linux NTFS Maintainer / IRC: #ntfs on irc.openprojects.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/

2002-09-07 13:32:05

[permalink] [raw]

Subject: Re: (fwd) Re: [RFC] mount flag "direct"

"Chris Siebenmann wrote:"
> You write:
> | but I'd like to know if we expect a file of fixed size which is being
> | overwritten without O_TRUNC to have any metadata changes apart from in
> | its inode (and there trivially) ...?
>
> It depends on the filesystem. A 'traditional' Unix filesystem (original
> V7 or Berkeley FFS derived) will not. A journaling filesystem that is
> journaling the data will write to the log. Something like Reiserfs may

OK.

> I think I have an alternative paper design for what you want, though.

Let's go ...

> Hack up your chosen underlying filesystems to understand two additional
> mount flags: REWRITEO and EXTENDO. A filesystem mounted REWRITEO allows
> all read operations and only write operations to already allocated file
> blocks (and it does not update inode mtime when such writes happen). A

OK. RWO means "overwrite allowed".

> filesystem mounted EXTENDO allows REWRITEO operations and files to be
> extended, but no other write operations; writes under EXTENDO update
> inode metadata as they normally would.

Hmm.

> Define two new internal errnos, returned by the filesystem to mean
> 'operation requires EXTENDO mount' or 'operation requires full write
> mount'.
>
> Create an overlay pseudo-filesystem type (you could hack this into
> the VFS, but it's simpler to make it a new filesystem), and a user
> level helper for lock management. This pseudo-filesystem forwards
> VFS operations to an actual underlying filesystem, traps and handles
> operations that require changing the underlying filesystem's mount
> options, returns the results to the user with any editing they need, and
> handles lock state transitions.

Well, not a bad idea anyway to try an overlay first. That seems to
counter most objections I've heard on its own!

> IO to the underlying filesystem is done O_DIRECT, to bypass caching
> both ways. Because the overlay filesystem does the actual opens, it can
> transparently add O_DIRECT to the open flags. The overlay filesystem

Well, that was easy to do anyway. I hacked the VFS mount calls to
support a MNT_DIRECT flag and hacked sys_open to notice that flag on
the mount when it was called, and do an O_DIRECT open.

> needs to trap opens (and closes) in order to keep track of what files
> are open on the underlying filesystem; I suspect it needs to dummy up
> file objects and do some forwarding there in order to keep track of
> everything.

All we will do at the end of the day is close the file!
I don't see what needs tracking until then ...

> The underlying filesystem is normally mounted REWRITEO on all nodes.
> A single instance may be mounted EXTENDO while the others continue to

Oh, OK.

> be in REWRITEO. Full write is only allowed when no one else is using
> the filesystem at al. This is all managed by a lock manager server
> for the disk store, which talks to clients on each node using the
> particular store.
>
> When a node requests EXTENDO, the lock manager verifies that everyone
> else is in REWRITEO or tells the node to stall on that until everyone
> else is. When a node requests full write, the lock manager asks

Hmm. This can starve.

> all other nodes to temporarily unmount the filesystem and stall IO

That's because you don't have access to the dcache entries for the
underlying fs? I think one can get them in a finer grained way. One can
certainly vamoosh them all at once - there's a call for that already.
It walks the dcache and kills anything pointing to the right system.

> operations on it; when full write is released, the lock manager tells
> everyone they can go to REWRITEO and start IO again. When a node joins
> the lock manager, it asks for REWRITEO and the lock manager verifies
> that no one is in write mode at the moment before saying 'go ahead'.

Yes.

> On transitions between states, the kernel overlay filesystem closes
> down all references held (open files, etc) to the underlying filesystem,
> unmounts it (optimization: some transitions can be done by remount, for

This is because you can't get at the underlying dcache easily.

> example EXTENDO -> REWRITEO), and then when the user level lock manager
> says it's okay remounts the filesystem with the new mount. It must then
> re-obtain all the underlying filesystem inodes and file references it
> was using. There are two ways:

Really? Why? Can't we just lose our own dcace as well?

> - you can steal code from the NFS server, which only works on some
> filesystems because it assumes constant inode numbers over the
> lifetime of the filesystem. (For example, for a while it didn't
> work on Reiserfs.)

Well, I feel bound to comment that the fact that NFS didn't work
universally for a while didn't seem to stop people wanting to use it!

I'd be quite happy to make that assumption, and let RFS worry about it!

> - you pull the filenames of open files from the dentry reference
> you're holding, and reopen them. If it fails, mark the file as

Oh, I see, that's why you wanted them.

> errored-out and return ESTALE for all further IO against it.
> [Somewhat hazardous, since the filename may now point to a
> different file.]

Well, dunno.

> When an EXTENDO client drops down to REWRITEO, the disk store lock
> manager must kick all clients to revalidate the filesystem by executing
> a null state transition (REWRITEO to REWRITEO). This insures that they
> immediately see the full size of the newly extended file.

Hmm. OK. They might want to wait until they need to know, but that's
OK.

> When an operation fails because it needs EXTENDO or full write, the
> pseudo filesystem layer stalls the request and notifies the user level
> process that it needs a lock at the relevant level. The user level
> process goes off to negociate this with the server, which will call back
> to other clients as necessary and then notify this client that it can
> go ahead. When the mount is upgraded to the needed level, the operation
> goes forward.
>
> Unmounting the underlying filesystem on lock state changes means that
> you flush all metadata automatically. By having REWRITEO, we know that
> we can safely cache metadata -- no one is going to be changing it by the
> rules of the game.

I might try this.

> EXTENDO is a wart, and it may be worth eliminating it; as it is,
> some clients may see the newly allocated space even before EXTENDO is
> dropped, but some may not. This design assumes that it's okay to not
> necessarily let other people at the data until the extending client
> drops the lock.
>
> This design is inefficient if there are many full write operations;

Well, of course. The point i sthat it allows aordinary caching
normally, and then causes all caches to be dropped whenever anyone
anywhere does anything that might cause some metadatachange somwhere.

I think one can be more exact than that, but it's OK as a tryout.

> that would be creating files, creating directories, renaming files,
> removing files, etc. But if it is mostly reading and rewriting (the
> assumptions I've seen) it should go very nicely.
>
> The design does assume that the inode access and modification times

They are unimportant.

> are unimportant. I don't think you can get good performance without
> this assumption.

Correct.

> Hopefully this is clear enough.

It is. Thank you.

Peter

2002-09-09 16:25:44

[permalink] [raw]

Subject: Re: (fwd) Re: [RFC] mount flag "direct"

A concrete question wrt the current file i/o data flow ...

"Anton Altaparmakov wrote:"
> VFS: on file read we eventually reach:
>
> mm/filemap.c::generic_file_read() which calls
> mm/filemap.c::do_generic_file_read() which breaks the request into units
> of PAGE_CACHE_SIZE, i.e. into individual pages and calls the file system's
> address space ->readpage() method for each of those pages.
>
> Assuming the file system uses the generic readpage implementation, i.e.
> fs/buffer.c::block_read_full_page(), this in turn breaks the page into
> blocksize blocks (for NTFS 512 bytes, remember?) and calls the FS
> supplied get_block() callback, once for each blocksize block (in the form

There is an inode arg passed to the fs get_block. Is this really the
inode of a file, or is it a single inode associated with the mount
(I see we only use it to get to the sb and thence the blocksize)

I am confused because ext_get_block says it's the file inode:

static int ext2_get_block(struct inode *inode, sector_t iblock, struct
buffer_head *bh_result, int create)
...
int depth = ext2_block_to_path(inode, iblock, offsets, &boundary);
...

* ext2_block_to_path - parse the block number into array of offsets
* @inode: inode in question (we are only interested in its superblock)
^^^^^^^^^^^^^^^^^^^^^^^^
* @i_block: block number to be parsed
...

and the vfs layer seems to pass mapping->host, which I believe should
be associated with the mount.

int block_read_full_page(struct page *page, get_block_t *get_block)
{
struct inode *inode = page->mapping->host;

(all kernel 2.5.31).

> of a struct buffer_head).
>
> This finds where this 512 byte block is, and maps the buffer head (i.e.

Now, you say (and I believe you!) that the get_block call finds where
the block is. But I understand you to say that the data pased in by VFS
is in local offsets .. that corresponds to what I see:
block_read_full_page() gets passed a page struct and then calculates the
first and last (local?) blocks from page->index ...

iblock = page->index << (PAGE_CACHE_SHIFT - inode->i_blkbits);
lblock = (inode->i_size+blocksize-1) >> inode->i_blkbits;

and that makes it seem as though the page already contained an offset
relative to the device, or logical in some other way that will now
be resolved by the fs specific get_block() call. However, the ext2
get_block only consults the superblock, nothing else when resolving
the logical blk number. So no inode gets looked at, as far as I can
see, at least in the simple cases ("direct inode"?). In the general
case there is some sort of chain lookup, that starts with the
superblock and goes somewhere ...

partial = ext2_get_branch(inode, depth, offsets, chain, &err);

**
* ext2_get_branch - read the chain of indirect blocks leading to data
* @inode: inode in question
* @depth: depth of the chain (1 - direct pointer, etc.)
* @offsets: offsets of pointers in inode/indirect blocks
* @chain: place to store the result
* @err: here we store the error value

Which seems to be trying to say that the inode arg is something that's
meant to be really the inode, not just some dummy that leads to the
superblock for purposes of getting the blksiz.

So I'm confused.

I'm also confused about how come the page already contains what seems
to be a device-relative offset (or other logical) block number.

> sets up the buffer_head to point to the correct on disk location) and

But the setup seems to be trivial in the "direct" case for ext2.
And I don't see how that can be because the VFS ought /not/ to have
enough info to make it trivial?

> returns it to the VFS, i.e. to block_read_full_page() which then goes and
> calls get_block() for the next blocksize block, etc...

Now, if I go back to mm/do_generic_file_read(), I see that it's working
on a struct file, and passes that file struct to the fs specific
readpage.

error = mapping->a_ops->readpage(filp, page);

I can believe that maybe the magic is there. But no, that's the
it's the generic ext2_readpage that gets used, and that _drops_
the file pointer!

static int ext2_readpage(struct file *file, struct page *page)
{
return mpage_readpage(page, ext2_get_block);
}

so, um, somehow the page contained all the info necessary to do lookups
on disk in ext2, no FS info required.

Do I read this right? How the heck di that page info get filled in in
the first place? How can it be enough? It can't be!

Peter

2002-09-09 19:17:34

[permalink] [raw]

Subject: Re: (fwd) Re: [RFC] mount flag "direct"

"A month of sundays ago Jan Harkes wrote:"
> On Mon, Sep 09, 2002 at 06:30:21PM +0200, Peter T. Breuer wrote:
> > A concrete question wrt the current file i/o data flow ...

> > I am confused because ext_get_block says it's the file inode:
...
> > and the vfs layer seems to pass mapping->host, which I believe should
> > be associated with the mount.
>
> mapping->host is the in memory representation of the file's inode (which
> is stored in the inode cache). Basically all filesystem operations go

Ah. Thank you. So inode->i_ino might be sufficient shared info to lock
on at the VFS level (well, excluding dropping the [di]cache data for it
on remote kernels, and locking free-block map updates in case its
a create or extend opn) ...

> through the pagecache, i.e. it is as if all files are memory mapped, but
> not visible to userspace.

OooooKay. And it looks as though the mapping remembers quite a bit about
how it was calculated, which is good news.

> Each in-memory cached inode has an 'address_space' associated with it.
> This is basically a logical representation of the file (i.e.
> offset/length). This mapping or address_space has a 'host', which is the
> actual inode that is associated with the data.

OK.

> Now lets consider inode->i_mapping->host, in most cases inode and host
> are identical, but in some cases (i.e. Coda) these are two completely
> different object, Coda inodes don't have their own backing storage, but

Wonderful. And which one is the more abstract?

> rely on an underlying 'container file' hosted by another filesystem to
> hold the actual file data. Stacking filesystems (overlayfs, cryptfs,
> sockfs, Erik Zadok's work) use the same technique.
>
> Each physical memory page (struct page) is associated with one specific
> address space (i.e. page->mapping). So depending on how we get into the

Depending? Groan.

> system, we can use inode->i_mapping->host, or page->mapping->host to
> find the inode/file object we have to write to and page->index is the

Oh, OK, I see. That's harmless news.

> offset >> PAGE_SIZE within this file. New pages are allocated and
> initialized in mm/filemap.c:page_cache_read, this is called from both
> the pagefault handler, and when generic_read/write cannot find a page in
> the page cache.

OK. You are saying that page info is filled out there, from the given
file pointer?

> Now everywhere in the code that references a 'struct inode' or
> 'mapping->host', is basically directly pointing to a cached object in
> the inode cache and in order to 'disable' caching, all of these would
> have to be replaced with i_ino, so that the inode can be fetched with

Oh, I see. Replace the object with the info necessary to retrieve the
info (again). Icache objects are reference counted, of course. I don't
want to think about this yet!

> iget. Similarily the pages and address_space objects are 'basically
> cached copies of on-disk data' that can become stale as soon as the
> underlying media is volatile. Several filesystem, NFS, Coda, Reiserfs,

Well, is there a problem with using the i_ino everywhere instead of the
inode struct itself? That is a straightforward (albeit universal)
change to make, and means providing an interface to the icache objects
using their i_ino and fs as index. But I really don't want to go there
yet ..

> actually need more than just i_ino to unambiguously identify and
> retrieve the struct inode, and use iget5_locked with filesystem specific

Hmm. OK.

> test/set functions and data, to obtain the in-memory inode.
>
> Which is just one reason why everyone went through the roof when you
> suggested to completely disable caching.

Well, :-).

> For pagecache consistency, NFS uses "NFS semantics" and uses
> mm/filemap.c:invalidate_inode_pages to flush unmodified/clean pages from

I see. It takes the inode as argument, but only wants the i_mapping
off it. Also good news. It walks what seems to be a list of associated
pages in the mapping.

> the cache. This is done to keep them consistent with the server. I
> believe NFS semantics only requires them to flush data when a file is
> opened an hasn't been referenced for at least 30 seconds, or was last

I've noticed.

> opened/checked more than 30 seconds ago. If you want better guarantees
> from NFS, you need to use file locking.
>
> Dirty pages can only be thrown out after they have been written
> back. Writing is typically asyncronous, the FS marks a page as dirty,
> then the VM/bdflush/updated/whatever walks all the 'address-spaces' and
> creates buffer heads for any dirty pages it bumps into. These
> bufferheads are then picked up by another daemon that sends them off to
> the underlying block device.

I'll try and make use of this info (later). It's very valuable. Thank you!

> Coda uses AFS or session semantics, which means that we don't care about
> whether cached pages are stale until after the object is closed. AFS
> semantics allows us to avoid the VM mangling to flush cached pages and
> other objects from memory. We only flush the 'dcache entry' when we get
> notified of an updated version on the server we unhash the cached lookup
> information and subsequent lookups will simply return the new file.

What function flushes a particular dcache entry, if I may ask?

> Regular on-disk filesystems have the very strong Unix Semantics.
> Basically when process A writes "A" and then process B writes "B" the
> file will contain "AB", but will never contain "BA", "A", or "B".

OK. So timing is important.

> - For AFS, the file will contain "B" because process B was the last one
> to close the file.

I see. Writes can/will delay until forced.

> - With Coda we have an update conflict, both "A" and "B" are preserved
> as 2 'conflicting' versions of the same file.
>
> - With NFS it depends when the writeback of dirty pages kicks in, the
> file may contain either "A" or "B".

Interesting. But not both. The implication is that they got different
copies of the file, not a pointer to the same file.

Well, that gives quite a lot of room to play with. But I think I can
manage the strong unix semantics. Thanks again.

> But without external synchronization (i.e. file-locking) none of these
> network filesystems will have the "AB" version which is only possible
> with filesystems that provide Unix semantics. There was some other

Yes. I see the problem.

> semantic model that could actually lead to the file ending up with "BA",
> but I don't think there are any real-life examples.

Well, this would happen with locking if the lock were on a third node
and there were a race to obtain the lock between two other writers.
Then the race could come out A B or B A, but whichever way it went,
umm, the other would have its cache invalidated after the first op
(well, before the first op got under way), and be forced to reread ..
almost anything can happen, depending on the application code.

Peter

2002-09-17 11:15:17