2004-06-02 23:17:48

by Peter Braam

[permalink] [raw]
Subject: RE: [PATCH/RFC] Lustre VFS patch, version 2

Hello!

The feedback of the Lustre patches was of very high quality, thanks a
lot for studying it carefully. Things are simpler now.

Oleg Drokin and I discussed the emails extensively and here is our reply.
We have attached another collection of patches, addressing many of the
concerns.
We felt it is was perhaps easier to keep this all in one long email.

People requested to see the code that uses the patch. We have uploaded that
to:

ftp://ftp.clusterfs.com:/pub/lustre/lkml/lustre-client_and_mds.tgz

The client file system is the primary user of the kernel patch, in the
llite directory. The MDS server is a sample user of do_kern_mount. As
requested I have removed many other things from the tar ball to make
review simple (so this won't compile or run).

1. Export symbols concerns by Christoph Hellwig:

Indeed we can do without __iget, kernel_text_address, reparent_to_init
and exit_files.

We actually need do_kern_mount and truncate_complete_page. Do kern
mount is used because we use a file system namespace in servers in the
kernel without exporting it to user space (mds/handler.c). The server
file systems are ext3 file systems but we replace VFS locking with DLM
locks, and it would take considerable work to export that as a file
system.

Truncate_complete_page is used to remove pages in the middle of a file
mapping, when lock revocations happen (llite/file.c
ll_extent_lock_callback, calling ll_pgcache_remove_extent) .

2. lustre_version.patch concerns by Christoph Hellwig:

This one can easily be removed, but kernel version alone does not
necessarily represent anything useful. There are tons of people
patching their kernel with patches, even applying parts of newer
kernel and still leaving kernel version at its old value
(distributions immediately come to mind). So we still need something
to identify version of necessary bits. E.g. version of intent API.

3. Introduction of lock-less version of d_rehash (__d_rehash) by
Christoph Hellwig:

In some places lustre needs to do several things to dentry's with
dcache lock held already, e.g. traverse alias dentries in inode to
find one with same name and parent as the one we have already. Lustre
can invalidate busy dentries, which we put on a list. If these are
looked up again, concurrently, we find them on this list and re-use
them, to avoid having several identical aliases in an inode. See
llite/{dcache.c,namei.c} ll_revalidate and the lock callback function
ll_mdc_blocking_ast which calls ll_unhash_aliases. We use d_move to
manipulate dentries associated with raw inodes and names in ext3.

4. vfs intent API changes kernel exported concern API by Christoph
Hellwig:

With slight modification it is possible to reduce the changes to just
changes in the name of intent structure itself and some of its
fields.

This renaming was requested by Linus, but we can change names back
easily if needed, that would avoid any api change. Are there other
users, please let us know what to do?

All the functions can easily be split into valid intent expecting ones
(with some suffix in name like _it) and those that are part of old API
would just initialise the intent to something sensible and then call
corresponding intent-expecting function. No harm should be done to
external filesystems this way. We have modified vfs intent API patch
to achieve this.

5. Some objections from Trond Myklebust about open flags in exec, cwd
revalidation, and revalidate_counter patch:

We have fixed the exec open flags issue (our error). Also
revalidate_counter patch was dropped since we can do this inside
lustre as well. CWD revalidation can be converted to FS_REVAL_DOT in
fs flags instead, but we still need part of that patch, the
LOOKUP_LAST/LOOKUP_NOT_LAST part. Lustre needs to know when we reached
the last component in the path so that intent needs to be looked
at. (It seems we cannot use LOOKUP_CONTINUE for this reliably).

6. from Trond Myklebust:

> The vfs-intent_lustre-vanilla-2.6.patch + the "intent_release()"
> code. What if you end up crossing a mountpoint? How do you then know
> to which superblock/filesystem the private field belongs if there are
> more than one user of this mechanism?

Basically intent only makes sence for the last component. Our code
checks that and if we are doing lookup a component before the last,
then a dummy IT_LOOKUP intent is created on stack and we work with
that, perhaps the same is true for other filesystems that would like
to use this mechanism.

7. raw operations concerns by various people:

We have now implemented an alternative approach to this, that is
taking place when parent lookup is done, using intents. For setattr
we managed to remove the raw operations alltogether, (praying that we
haven't forgotten some awful problem we solved that led to the
introduction of setattr_raw in the first place).

The correctly filled intent is recognised by filesystem's lookup or
revalidate method. After the parent is looked up, based on the intent
the correct "raw" server call is executed, within the file
system. Then a special flag is set in intent, the caller of parent
lookup checks for the flag and if it is set, the functions returns
immediately with supplied (in intent)exit code, without instantiating
child dentries.

This needs some minor changes to VFS, though. There are at
least two approaches.

One is to not introduce any new methods and just rely on fs' metohds
to do everything, for this to work filesystem needs to know the
remaining path to be traversed (we can fill nd->last with remaining
path before calling into fs). In the root directory of the mount, we
need to call a revalidate (if supported by fs) on mountpoint to
intercept the intent, after we crossed mountpoint. We have this
approach implemented in that attached patch. Does it look better than
the raw operations?

Much simpler for us is to add additional inode operation
"process_intent" method that would be called when LOOKUP_PARENT sort
of lookup was requested and we are about to leave link_path_walk()
with nameidata structure filled and everything ready. Then the same
flag in intent will be set and everything else as in previous
approach.

We believe both methods are less intrusive than the raw methods, but
slightly more delicate.

8. Mountpoint-crossing issues during rename (and link) noticed by
Arjan van de Ven:

Well, indeed this can happen if source or destination is a mountpoint
on client but not server, this needs to be addressed by individual
filesystems that chose to implement those raw methods.

9. dev_readonly patch concerns by Jens Axboe:

We already clarified why we need it in this exact way. But there were
some valid suggestions to use other means like dm-flakey device mapper
module, so we decided to write a failure simulator DM.

10. "Have these patches undergone any siginifant test?" by Anton Blanchard:

There are two important questions I think:
- Do the patches cause damage?
Probably not anymore. SUSE has done testing and it appears the
original patch I attached didn't break things (after one fix was
made).
- Is Lustre stable?
On 2.4 Lustre is quite stable. On 2.6 we have done testing but,
for example, never more than on 40 nodes. We don't consider it
rock solid on 2.6, it does pass POSIX and just about every other
benchmark without failures.

Since the patches were modified for this discussion there are of
course some new issues which Oleg Drokin is now ironing out.

Our test results are visible at https://buffalo.lustre.org

Well, how close are we now to this being acceptable?

- Peter J. Braam & Oleg Drokin -


Attachments:
export-vanilla-2.6.patch (3.07 kB)
header_guards-vanilla-2.6.patch (1.40 kB)
lustre_version.patch (469.00 B)
vanilla-2.6.6 (363.00 B)
vfs_intent-flags_rename-vanilla-2.6.patch (7.74 kB)
vfs-dcache_locking-vanilla-2.6.patch (2.54 kB)
vfs-dcache_lustre_invalid-vanilla-2.6.patch (1.19 kB)
vfs-do_truncate.patch (3.12 kB)
vfs-intent_api-vanilla-2.6.patch (15.32 kB)
vfs-intent_lustre-vanilla-2.6.patch (909.00 B)
vfs-lookup_last-vanilla-2.6.patch (1.70 kB)
vfs-raw_ops-vanilla-2.6.patch (6.06 kB)
Download all attachments

2004-06-03 14:00:00

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH/RFC] Lustre VFS patch, version 2

On Wed, Jun 02, 2004 at 05:15:27PM -0600, Peter J. Braam wrote:
> People requested to see the code that uses the patch. We have uploaded that
> to:
>
> ftp://ftp.clusterfs.com:/pub/lustre/lkml/lustre-client_and_mds.tgz
>
> The client file system is the primary user of the kernel patch, in the
> llite directory. The MDS server is a sample user of do_kern_mount. As
> requested I have removed many other things from the tar ball to make
> review simple (so this won't compile or run).

> We actually need do_kern_mount and truncate_complete_page. Do kern
> mount is used because we use a file system namespace in servers in the
> kernel without exporting it to user space (mds/handler.c). The server
> file systems are ext3 file systems but we replace VFS locking with DLM
> locks, and it would take considerable work to export that as a file
> system.

Yikes. I'd rather not see something like this going in, and better work
on properly integrating the MDS code with the filesystem. There's also
lots of duplication or almost duplication of VFS functionality in that
directory and the fsfilter horrors. I'd suggest you get that cleaned up
and we'll try to merge it into 2.7, okay?


> Truncate_complete_page is used to remove pages in the middle of a file
> mapping, when lock revocations happen (llite/file.c
> ll_extent_lock_callback, calling ll_pgcache_remove_extent) .

Most of ll_pgcache_remove_extent probably wants to be a proper VFS
function. Again, only interesting if the rest of lustre gets merged.
>
> 2. lustre_version.patch concerns by Christoph Hellwig:
>
> This one can easily be removed, but kernel version alone does not
> necessarily represent anything useful. There are tons of people
> patching their kernel with patches, even applying parts of newer
> kernel and still leaving kernel version at its old value
> (distributions immediately come to mind). So we still need something
> to identify version of necessary bits. E.g. version of intent API.

Well, bad luck for you. It's not like there much interest to merge
any of these patches into the tree without the actual users anyway..

> 3. Introduction of lock-less version of d_rehash (__d_rehash) by
> Christoph Hellwig:
>
> In some places lustre needs to do several things to dentry's with
> dcache lock held already, e.g. traverse alias dentries in inode to
> find one with same name and parent as the one we have already. Lustre
> can invalidate busy dentries, which we put on a list. If these are
> looked up again, concurrently, we find them on this list and re-use
> them, to avoid having several identical aliases in an inode. See
> llite/{dcache.c,namei.c} ll_revalidate and the lock callback function
> ll_mdc_blocking_ast which calls ll_unhash_aliases. We use d_move to
> manipulate dentries associated with raw inodes and names in ext3.

I've only taken a short look at the dcache operations you're doing and
it looks a little fishy and very senistive for small changes in internal
dcache semantics. You're also missing e.g. the LSM callbacks it seems.

Have you talked to Al about that code?

> 4. vfs intent API changes kernel exported concern API by Christoph
> Hellwig:
>
> With slight modification it is possible to reduce the changes to just
> changes in the name of intent structure itself and some of its
> fields.
>
> This renaming was requested by Linus, but we can change names back
> easily if needed, that would avoid any api change. Are there other
> users, please let us know what to do?

Again, you're changing a filesystem API, we have a bunch of intree users
that can be modular so it's likely there are out of tree users, too.
The new semantics might be much nicer, but it's 2.7 material.

> 7. raw operations concerns by various people:
>
> We have now implemented an alternative approach to this, that is
> taking place when parent lookup is done, using intents. For setattr
> we managed to remove the raw operations alltogether, (praying that we
> haven't forgotten some awful problem we solved that led to the
> introduction of setattr_raw in the first place).
>
> The correctly filled intent is recognised by filesystem's lookup or
> revalidate method. After the parent is looked up, based on the intent
> the correct "raw" server call is executed, within the file
> system. Then a special flag is set in intent, the caller of parent
> lookup checks for the flag and if it is set, the functions returns
> immediately with supplied (in intent)exit code, without instantiating
> child dentries.
>
> This needs some minor changes to VFS, though. There are at
> least two approaches.
>
> One is to not introduce any new methods and just rely on fs' metohds
> to do everything, for this to work filesystem needs to know the
> remaining path to be traversed (we can fill nd->last with remaining
> path before calling into fs). In the root directory of the mount, we
> need to call a revalidate (if supported by fs) on mountpoint to
> intercept the intent, after we crossed mountpoint. We have this
> approach implemented in that attached patch. Does it look better than
> the raw operations?

I'm not sure whether overloading ->d_revalidate or a new method for
that is prefferable.

> Much simpler for us is to add additional inode operation
> "process_intent" method that would be called when LOOKUP_PARENT sort
> of lookup was requested and we are about to leave link_path_walk()
> with nameidata structure filled and everything ready. Then the same
> flag in intent will be set and everything else as in previous
> approach.

Yupp, that sounds better.

> Well, how close are we now to this being acceptable?

As already mentioned above they're completely uninteresting without
actually getting the user in tree _and_ maintained there (unlike e.g.
intermezzo or coda that are creeping along). I think based on those
patch we should be able to properly integrate intermezzo once 2.7 opens.

2004-06-03 14:20:25

by Lars Marowsky-Bree

[permalink] [raw]
Subject: Re: [PATCH/RFC] Lustre VFS patch, version 2

On 2004-06-03T14:59:52,
Christoph Hellwig <[email protected]> said:

> > Well, how close are we now to this being acceptable?
> As already mentioned above they're completely uninteresting without
> actually getting the user in tree _and_ maintained there (unlike e.g.
> intermezzo or coda that are creeping along). I think based on those
> patch we should be able to properly integrate intermezzo once 2.7 opens.

This is something I've got to disagree with.

First, Inter-mezzo is reasonably dead, from what I can see. As is Coda.
You'll notice that the developers behind them have sort-of moved on to
Lustre ;-)

The hooks (once cleaned up, no disagreement here, the technical feedback
so far has been very valuable and continues to be) are useful and in
effect needed not just for Lustre, but in principle for all cluster
filesystems, such as (Open)GFS and others, even potentially NFS4 et al.

The logic that _all_ modules and functionality need to be "in the tree"
right from the start for hooks to be useful is flawed, I'm afraid. Pure
horror that a proprietary cluster file system might also profit from it
is not, exactly, a sound technical argument. (I can assure you I don't
care at all for the proprietary cluster-fs.)

Lustre alone would be, roughly, ~10MB more sources, just in the kernel.
I don't think you want to merge that right now, as desireable as it is
on the other hand to be able to use it with a mainstream kernel. I think
this is why kbuild allows external modules to be build; with that logic
it would follow that this should be disabled too.

There certainly is an interest in merging these (cleaned up) extensions
and allowing powerful cluster filesystems to exist on Linux.

Another example of this is the cache invalidation hook which we went
through a few weeks ago too. Back then you complained about not having
an Open Source user (because it was requested by IBM GPFS), and so
GFS/OpenGFS chimed in - now it is the lack of an _in-tree_ Open Source
user...


Sincerely,
Lars Marowsky-Br?e <[email protected]>

--
High Availability & Clustering \ ever tried. ever failed. no matter.
SUSE Labs | try again. fail again. fail better.
Research & Development, SUSE LINUX AG \ -- Samuel Beckett

2004-06-03 14:31:15

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH/RFC] Lustre VFS patch, version 2


Btw, you you please stop cross-posting to closed lists?

2004-06-03 14:38:09

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH/RFC] Lustre VFS patch, version 2

On Thu, Jun 03, 2004 at 04:19:22PM +0200, Lars Marowsky-Bree wrote:
> The logic that _all_ modules and functionality need to be "in the tree"
> right from the start for hooks to be useful is flawed, I'm afraid.

And btw, I didn't say from the beginning. I just want a comitment from
the lustre folks that they're merging it so we can work out the rough edges
together. There's not much of a problem doing the merge spread over a few
kernel releases.

> Lustre alone would be, roughly, ~10MB more sources, just in the kernel.

I think for mainline mostly the client, aka the llite directory would
be interesting, so a linux box can simply join the lustre cluster. the
metadata server and even worse the object storage box mods would require
tons of work to get anywhere a mergeable shape and are less interesting
anyway.

2004-06-03 14:38:12

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH/RFC] Lustre VFS patch, version 2

On Thu, Jun 03, 2004 at 04:19:22PM +0200, Lars Marowsky-Bree wrote:
> On 2004-06-03T14:59:52,
> Christoph Hellwig <[email protected]> said:
>
> > > Well, how close are we now to this being acceptable?
> > As already mentioned above they're completely uninteresting without
> > actually getting the user in tree _and_ maintained there (unlike e.g.
> > intermezzo or coda that are creeping along). I think based on those
> > patch we should be able to properly integrate intermezzo once 2.7 opens.
>
> This is something I've got to disagree with.
>
> First, Inter-mezzo is reasonably dead, from what I can see. As is Coda.
> You'll notice that the developers behind them have sort-of moved on to
> Lustre ;-)

Arggg, sorry. Typo there. It should have of course read

"I think based on those patches we should be able to properly integrate
LUSTRE once 2.7 opens"

.oO(/me looks for a brown paperbag to hide)

> The logic that _all_ modules and functionality need to be "in the tree"
> right from the start for hooks to be useful is flawed, I'm afraid. Pure
> horror that a proprietary cluster file system might also profit from it
> is not, exactly, a sound technical argument. (I can assure you I don't
> care at all for the proprietary cluster-fs.)

It's more about maintaince overhead. Maintaining features without the
user direct at hand isn't going anywhere. Especially when messing around
deeply in the VFS. By your argumentation we should also throw in all the
mosix and openssi hooks because they could be possibly useful, no? ;-)

> Another example of this is the cache invalidation hook which we went
> through a few weeks ago too. Back then you complained about not having
> an Open Source user (because it was requested by IBM GPFS), and so
> GFS/OpenGFS chimed in - now it is the lack of an _in-tree_ Open Source
> user...

I was always arguing against the lack of an intree user mostly. Lack of
something that could we could merge even in the future is even worse.

2004-06-03 14:56:58

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH/RFC] Lustre VFS patch, version 2

P? to , 03/06/2004 klokka 07:19, skreiv Lars Marowsky-Bree:
> The hooks (once cleaned up, no disagreement here, the technical feedback
> so far has been very valuable and continues to be) are useful and in
> effect needed not just for Lustre, but in principle for all cluster
> filesystems, such as (Open)GFS and others, even potentially NFS4 et al.
>
> The logic that _all_ modules and functionality need to be "in the tree"
> right from the start for hooks to be useful is flawed, I'm afraid. Pure
> horror that a proprietary cluster file system might also profit from it
> is not, exactly, a sound technical argument. (I can assure you I don't
> care at all for the proprietary cluster-fs.)

Whereas I agree that NFSv4 could use some of this (I'm mainly interested
in the intent_release() stuff in order to fix up an existing race), I
also agree with Christoph on the principle that having in-tree users
right from the start should be the norm rather than the exception.

Otherwise, exactly what is the plan for how to determine when an
interface is obsolete? Are we going to rely on all the out-of-tree
vendors to collectively step up and say "by the way - we're not using
this anymore."?

Cheers,
Trond

2004-06-03 18:10:58

by Jan Harkes

[permalink] [raw]
Subject: Re: [PATCH/RFC] Lustre VFS patch, version 2

On Thu, Jun 03, 2004 at 04:19:22PM +0200, Lars Marowsky-Bree wrote:
> First, Inter-mezzo is reasonably dead, from what I can see. As is Coda.
> You'll notice that the developers behind them have sort-of moved on to
> Lustre ;-)

Actually, Coda is not dead, there is still quite a bit of activity. It
is just seems slow on the kernel side because we actually have kernel
modules for various operating systems, FreeBSD, NetBSD, Windows 9x,
Windows NT/2000/XP, Solaris, and recently MacOS/Darwin. As a result we
are quite conservative as far as any significant changes in the
kernel-userspace interface.

Jan

2004-06-04 05:02:38

by Daniel Phillips

[permalink] [raw]
Subject: Re: [PATCH/RFC] Lustre VFS patch, version 2

On Thursday 03 June 2004 10:19, Lars Marowsky-Bree wrote:
> The hooks (once cleaned up, no disagreement here, the technical feedback
> so far has been very valuable and continues to be) are useful and in
> effect needed not just for Lustre, but in principle for all cluster
> filesystems, such as (Open)GFS and others, even potentially NFS4 et al.

GFS is now down to needing two trivial patches:

1) export sync_inodes_sb
2) provide a filesystem hook for flock

Since GFS functions well without any of the current batch of proposed vfs
hooks, the word "needed" is not appropriate. Maybe there is something in
here that could benefit GFS, most probably in the intents department, but we
certainly do want to try it first before pronouncing on that. The raw_ops
seem to be entirely irrelevant to GFS, which is peer-to-pear, so does not
delegate anything to a server. I don't think we have a use for lookup_last.
There are quite possibly some helpful ideas in the dcache tweaks but the devil
is in the details: again we need to try it.

Such things as:

+#define DCACHE_LUSTRE_INVALID 0x0020 /* invalidated by Lustre */

clearly fail the "needed not just for Lustre" test.

Looking into my crystal ball, I see many further revisions of this patch set.
Unfortunately, in the latest revision we lost the patch-by-patch discussion,
which seems to have been replaced by list of issues sorted by complainant.
That's interesting, but it's no substitute.

Regards,

Daniel

2004-06-04 16:58:39

by Anton Blanchard

[permalink] [raw]
Subject: Re: [PATCH/RFC] Lustre VFS patch, version 2


> 10. "Have these patches undergone any siginifant test?" by Anton Blanchard:
>
> There are two important questions I think:
> - Do the patches cause damage?
> Probably not anymore. SUSE has done testing and it appears the
> original patch I attached didn't break things (after one fix was
> made).

IBM did a lot of the work on that issue and it took the better part of a
week to find, fix and verify.

Anton

2004-06-07 18:06:57

by Dipankar Sarma

[permalink] [raw]
Subject: Re: [PATCH/RFC] Lustre VFS patch, version 2

On Sat, Jun 05, 2004 at 02:55:48AM +1000, Anton Blanchard wrote:
>
> > 10. "Have these patches undergone any siginifant test?" by Anton Blanchard:
> >
> > There are two important questions I think:
> > - Do the patches cause damage?
> > Probably not anymore. SUSE has done testing and it appears the
> > original patch I attached didn't break things (after one fix was
> > made).
>
> IBM did a lot of the work on that issue and it took the better part of a
> week to find, fix and verify.

AFAIK, Maneesh asked about revalidate_special() returning negative dentries
and no checking of it in path lookup(), but got no reply from Lustre
folks. It is clearly broken. Maneesh has more breakage from Lustre
VFS patches now. It would be helpful if they atleast comment on
fixes for those.

Thanks
Dipankar