by Nick Piggin

[permalink] [raw]

Subject: Re: [PATCH 36/42] VFS: export drop_pagecache_sb

On Friday 14 December 2007 02:24, Erez Zadok wrote:
> In message <[email protected]>, Nick Piggin writes:
> > On Monday 10 December 2007 13:42, Erez Zadok wrote:
> > > Needed to maintain cache coherency after branch management.
> >
> > Hmm, I'd much prefer to be able to sleep in invalidate_mapping_pages
> > before this function gets exported.
> >
> > As it is, it can cause massive latencies on preemption and the inode_lock
> > so it is pretty much debug-only IMO. I'd rather it didn't escape into the
> > wild as is.
> >
> > Either that or rework your cache coherency somehow.
>
> Nick, thanks for the advice.
>
> We use a generation number after each successful branch configuration
> command, so that ->d_revalidate later on can discover that change, and
> rebuild the union of objects. At ->remount time, I figured it'd be nice to
> "encourage" that revalidation to happen sooner, by invalidating as many
> upper pages as possible, thus causing ->d_revalidate/->readpage to take
> place sooner. So we used to call drop_pagecache_sb from our remount code:
> it was the only caller of drop_pagecache_sb. It wasn't too much of an
> latency issue to call drop_pagecache_sb there: the VFS remount code path is
> already pretty slow (dropping temporarily to readonly mode, and dropping
> other caches), and remount isn't an operation used often, so a little bit
> more latency would probably not have been noticed by users.

Well a large, infrequent spike is the most damaging to latency sensitive
users. And anyway, I guess the infrequency of remount means it doesn't
have to be really efficient with invalidating pagecache either.

> Nevertheless, it was not strictly necessary to call drop_pagecache_sb in
> unionfs_remount, because the objects in question will have gotten
> revalidated sooner or later anyway; the call to drop_pagecache_sb was just
> an optimization (one which I wasn't 100% sure about anyway, as per my long
> "XXX" comment above that call in unionfs_remount).
>
> So I agree with you: if this symbol can be abused by modules and cause
> problems, then exporting it to modules is too risky. I've reworked my code
> to avoid calling drop_pagecache_sb and I'll [sic] drop that patch.

Thanks, I'd be much happier with that.

2007-12-14 21:16:47

by J. R. Okajima

[permalink] [raw]

Subject: Re: [UNIONFS] 00/42 Unionfs and related patches review

Hello Professor Zadok,

Erez Zadok:
> I believe that small VFS changes to help stackable file systems are
> perfectly reasonable, and a good thing; and I'm working on such patches.
> Conversely, I am very mindful of the VFS's complexity, so I also believe
> that massive VFS changes are a bad thing; I want to avoid bloating the VFS
> or de-stabilizing it just for the sake of stacking or any single stackable
> f/s. I am also concerned about not changing existing "lower" file systems
> whatsoever, because they are well tested and stable.

I have no objection against your opinion about massive VFS changes or
existing "lower" filesystems.

> from). So in my opinion, the chances are very slim that a large amount of
> data changes will happen on a lower inode all within one second and not be
> detected by our mtime/cite cache-coherency algorithms.

I agree that time-based checking is available in many cases.
But there will exist some opeartions which are done in one
second, and it may not be available when a user changes the clock/time
of his system.

> Also, time-based cache coherency is a [sic] time-honored technique in NFS.
> Users have gotten used to the fact that if they change something on the
> server (i.e., the "layer" below the client), that those changes many not be
> immediately visible on the client (esp. with heavy caching on the client).
> So if it's been good enough for NFS for over two decades, I don't see a
> compelling reason to complicate unionfs for a slim chance of detecting
> changes that occur within one second.

Since NFS is a remote filesystem, I don't think it is a good idea to
compare the behaviour of if and a stackable filesystem, since all
lower(branch) filesystems are able to be local filesystems.

> Right now my code to detect when a lower object has changed is very simple
> and short: just one "if" statement to compare the corresponding inode
> mtimes. I'd like to keep things simple if at all possible. Fundamentally,
> all I need is ONE simple bit of information that will tell me that the upper
> and lower inodes are no longer in sync. Just one bit, not a whole complex
> data structure with callbacks and bit-maps and such.

Agreed, so the inotify handler should set a flag or atomic_inc/dec
the internal generation, or enqueue such job and handle it
later (shortly). Of course, when the dentry/inode of the stackable
filesystem corresspoding to the modfied file are not cached, the handler
has nothing to do.
Additionally, it is only directories to be set inotify for monitoring,
instead of all files. The inotify handler for a directory receives all
necessary (for a stackable fs) events for its children.
(but there are a few limitations or exceptions)

> What you propose violates the clean layer separation in a way that I'm not
> too comfortable with (even if it works for you :-) I believe stackable file
:::
> system, each struct file/dentry/inode has a corresponding lower object. Our
> FiST templates for Linux include an extra mode---called "fist lite"---which
> saves on inodes and pages by having BOTH the upper and lower dentry point to
> the lower inode. This saves memory (shared pages) and reduces layering
> overhead (but you can't intercept mmap ops, which some stackable f/s like to
> do). The cost of such trick is violating the clean layering separation: a
> dentry of the upper file system now points to an inode (via dentry->d_inode)
> of the lower file system! To me, this is dangerous in the long run because
> objects from one layer can be "leaked" to another layer, with potentially
> disastrous results.

Currently, I don't think sharing page is any kind of
violation. Additionally the dentry of the upper file system does NOT
point to the inode of the lower file system. Of course it can implement
->mmap operation.

> What you propose above with vm_operations requires a sequence of operations
> in the vm->fault operation which looks like:
>
> saved_file = vma->vm_file;
> vma->vm_file = hidden_file;
> call the lower ->fault op passing it the modified vma
> vma->vm_file = saved_file;

Basically, yes.
But there are several things to do such as locking.

> Even if both of these techniques work (and they do, at least in limited
> testing I've done), there is something very unpleasant about having to
> temporarily override a field's value, then fix it again, after coming back
> from calling the lower op. Aside from the uncleanliness of this kind of
> technique, it can seriously lead to races and other data corruptions when/if
> the temporarily-fixed fields "leak" outside the current code. (I have a
> strong feeling that several kernel developers might vomit in disgust if I
> dared to submit such hacky patches to unionfs... :-)

I guess probably you forgot some locking.

> To me, any time such a hack has to be employed, it tells me that there's
> something wrong with the API in question. And so I'd much rather see the
> API fixed The Right Way[tm], than promote such potentially unsafe practices.

If you changed some important members of internal structures without
locking, it would be unsafe and violate something.

Finally I think the approach of sharing pages, you may call it
zero-copy conversely your approach, is safe. At least, this approach is
working over a year while several people are using it.
Of course, I never say it is bug-free. There may exist a problem which
simply I don't know yet.

Sincerely,
Junjiro Okajima