2011-06-01 22:46:20

by Dan Magenheimer

[permalink] [raw]
Subject: bug in cleancache ocfs2 hook, anybody want to try cleancache?

As Steven Whitehouse points out in this lkml thread:
https://lkml.org/lkml/2011/5/27/221
there is a bug in the ocfs2 hook to cleancache.
The fix is fairly trivial, but I wonder if anyone
in the ocfs2 community might be interested in trying
out cleancache to author and test the fix?

Currently, the only implementation that benefits from
the sharing functionality is on Xen.

So if you know how to (or are interested in learning
how to) bring up multiple ocfs2 cluster nodes sharing
an ocfs2 filesystem on top of Xen and you are interested
in giving cleancache a spin, please let me know. Else
I will probably push the fix myself.

Dan

P.S. Links to cleancache info:
http://www.phoronix.com/scan.php?page=news_item&px=OTQ5Mw
http://lwn.net/Articles/386090/
http://blogs.oracle.com/wim/entry/another_feature_hit_mainline_linux

---

Thanks... for the memory!
I really could use more / my throughput's on the floor
The balloon is flat / my swap disk's fat / I've OOM's in store
Overcommitted so much
(with apologies to Bob Hope)


2011-06-02 08:44:12

by Steven Whitehouse

[permalink] [raw]
Subject: Re: bug in cleancache ocfs2 hook, anybody want to try cleancache?

Hi,

On Wed, 2011-06-01 at 15:45 -0700, Dan Magenheimer wrote:
> As Steven Whitehouse points out in this lkml thread:
> https://lkml.org/lkml/2011/5/27/221
> there is a bug in the ocfs2 hook to cleancache.
> The fix is fairly trivial, but I wonder if anyone
> in the ocfs2 community might be interested in trying
> out cleancache to author and test the fix?
>
> Currently, the only implementation that benefits from
> the sharing functionality is on Xen.
>
> So if you know how to (or are interested in learning
> how to) bring up multiple ocfs2 cluster nodes sharing
> an ocfs2 filesystem on top of Xen and you are interested
> in giving cleancache a spin, please let me know. Else
> I will probably push the fix myself.
>
> Dan
>

Having started looking at the cleancache code in a bit more detail, I
have another question... what is the intended mechanism for selecting a
cleancache backend? The registration code looks like this:

struct cleancache_ops cleancache_register_ops(struct cleancache_ops
*ops)
{
struct cleancache_ops old = cleancache_ops;

cleancache_ops = *ops;
cleancache_enabled = 1;
return old;
}
EXPORT_SYMBOL(cleancache_register_ops);

but I wonder what the intent was here. It looks racy to me, and what
prevents the backend module from unloading while it is in use? Neither
of the two in-tree callers seems to do anything with the returned
structure beyond printing a warning if another backend has already
registered itself. Also why return the structure and not a pointer to
it? The ops structure pointer passed in should also be const I think.

>From the code I assume that it is only valid to load the module for a
single cleancache backend at a time, though nothing appears to enforce
that.

Also, as regards your earlier question wrt a kvm backend, I may be
tempted to have a go at writing one, but I'd like to figure out what I'm
letting myself in for before making any commitment to that,

Steve.

2011-06-02 18:26:41

by Dan Magenheimer

[permalink] [raw]
Subject: RE: bug in cleancache ocfs2 hook, anybody want to try cleancache?

> Having started looking at the cleancache code in a bit more detail, I
> have another question... what is the intended mechanism for selecting a
> cleancache backend? The registration code looks like this:
>
> struct cleancache_ops cleancache_register_ops(struct cleancache_ops
> *ops)
> {
> struct cleancache_ops old = cleancache_ops;
>
> cleancache_ops = *ops;
> cleancache_enabled = 1;
> return old;
> }
> EXPORT_SYMBOL(cleancache_register_ops);
>
> but I wonder what the intent was here. It looks racy to me, and what
> prevents the backend module from unloading while it is in use? Neither
> of the two in-tree callers seems to do anything with the returned
> structure beyond printing a warning if another backend has already
> registered itself. Also why return the structure and not a pointer to
> it? The ops structure pointer passed in should also be const I think.
>
> From the code I assume that it is only valid to load the module for a
> single cleancache backend at a time, though nothing appears to enforce
> that.

Hi Steven --

The intent was to allow backends to be "chained", but this is
not used yet and not really very well thought through yet either
(e.g. possible coherency issues of chaining).
So, yes, currently only one cleancache backend can be loaded
at time.

There's another initialization issue... if mounts are done
before a backend registers, those mounts are not enabled
for cleancache. As a result, cleancache backends generally
need to be built-in, not loaded separately as a module.
I've had ideas on how to fix this for some time (basically
recording calls to cleancache_init_fs that occur when no
backend is registered, then calling the backend lazily after
registration occurs).

> Also, as regards your earlier question wrt a kvm backend, I may be
> tempted to have a go at writing one, but I'd like to figure out what
> I'm
> letting myself in for before making any commitment to that,

I think the hardest part is updating the tmem.c module in zcache
to support multiple "clients". When I ported it from Xen, I tore
all that out. Fortunately, I've put it back in during RAMster
development but those changes haven't yet seen the light of day
(though I can share them offlist).

The next issue is the guest->host interface. Is there the equivalent
of a hypercall in KVM? If so, a shim like drivers/xen/tmem.c is
needed in the guest, and some shim that interfaces the host side
of the hypercall to tmem.c (and presumably zcache).

That may be enough for a proof-of-concept, though Xen has
a bunch of tools and stuff for which KVM would probably want some
equivalent.

If you are at all interested, let's take the details offlist.
It would be great to have a proof-of-concept by KVM Forum!

Thanks,
Dan

2011-06-03 08:42:12

by Steven Whitehouse

[permalink] [raw]
Subject: RE: bug in cleancache ocfs2 hook, anybody want to try cleancache?

Hi,

On Thu, 2011-06-02 at 11:26 -0700, Dan Magenheimer wrote:
> > Having started looking at the cleancache code in a bit more detail, I
> > have another question... what is the intended mechanism for selecting a
> > cleancache backend? The registration code looks like this:
> >
> > struct cleancache_ops cleancache_register_ops(struct cleancache_ops
> > *ops)
> > {
> > struct cleancache_ops old = cleancache_ops;
> >
> > cleancache_ops = *ops;
> > cleancache_enabled = 1;
> > return old;
> > }
> > EXPORT_SYMBOL(cleancache_register_ops);
> >
> > but I wonder what the intent was here. It looks racy to me, and what
> > prevents the backend module from unloading while it is in use? Neither
> > of the two in-tree callers seems to do anything with the returned
> > structure beyond printing a warning if another backend has already
> > registered itself. Also why return the structure and not a pointer to
> > it? The ops structure pointer passed in should also be const I think.
> >
> > From the code I assume that it is only valid to load the module for a
> > single cleancache backend at a time, though nothing appears to enforce
> > that.
>
> Hi Steven --
>
> The intent was to allow backends to be "chained", but this is
> not used yet and not really very well thought through yet either
> (e.g. possible coherency issues of chaining).
> So, yes, currently only one cleancache backend can be loaded
> at time.
>
> There's another initialization issue... if mounts are done
> before a backend registers, those mounts are not enabled
> for cleancache. As a result, cleancache backends generally
> need to be built-in, not loaded separately as a module.
> I've had ideas on how to fix this for some time (basically
> recording calls to cleancache_init_fs that occur when no
> backend is registered, then calling the backend lazily after
> registration occurs).
>
Ok... but if cleancache_init_fs were to take (for example) an argument
specifying the back end to use (I'm thinking here of say a
cleancache=tmem mount argument for filesystems or something similar)
then the backend module could be automatically loaded if required. It
would also allow, by design, multiple backends to be used without
interfering with each other.

I don't understand the intent behind chaining of the backends. Did you
mean that pages would migrate from one backend to another down the stack
as each one discards pages and that pages would migrate back up the
stack again when pulled back in from the filesystem? I'm not sure I can
see any application for such a scheme, unless I'm missing something.

I'd like to try and understand the design of the existing code before I
consider anything more advanced such as writing a kvm backend,

Steve.

2011-06-03 15:03:47

by Dan Magenheimer

[permalink] [raw]
Subject: RE: bug in cleancache ocfs2 hook, anybody want to try cleancache?

> > There's another initialization issue... if mounts are done
> > before a backend registers, those mounts are not enabled
> > for cleancache. As a result, cleancache backends generally
> > need to be built-in, not loaded separately as a module.
> > I've had ideas on how to fix this for some time (basically
> > recording calls to cleancache_init_fs that occur when no
> > backend is registered, then calling the backend lazily after
> > registration occurs).
> >
> Ok... but if cleancache_init_fs were to take (for example) an argument
> specifying the back end to use (I'm thinking here of say a
> cleancache=tmem mount argument for filesystems or something similar)
> then the backend module could be automatically loaded if required. It
> would also allow, by design, multiple backends to be used without
> interfering with each other.

That's an interesting approach. What use model do you
have in mind for this? I can see a disadvantage of
having one fs use one cleancache backend while another
fs uses another independent cleancache backend: It
might be much more difficult to do accounting and things
like deduplication across multiple backends. Also,
statistically, managing multiple LRU queues (e.g. to
ensure ephemeral pages are evicted in LRU order) is less
efficient that managing a single one. But I may not
understand what you have in mind.

> > The intent was to allow backends to be "chained", but this is
> > not used yet and not really very well thought through yet either
> > (e.g. possible coherency issues of chaining).
> > So, yes, currently only one cleancache backend can be loaded
> > at time.
> >
> I don't understand the intent behind chaining of the backends. Did you
> mean that pages would migrate from one backend to another down the
> stack
> as each one discards pages and that pages would migrate back up the
> stack again when pulled back in from the filesystem? I'm not sure I can
> see any application for such a scheme, unless I'm missing something.

Each put can be rejected by a cleancache backend. So I was
thinking that chaining could be used, for example, as follows:

1) zcache registers and discovers that another backend (Xen tmem)
had previously registered, so saves the ops
2) kernel puts lots of pages to cleancache
3) eventually zcache "fills up" and would normally have to reject
the put but...
4) instead zcache attempts to put the page to Xen tmem using
the saved ops
5) if Xen tmem accepts the page success is returned, if not zcache
returns failure
6) caveat: once zcache has put a page to Xen tmem,
zcache needs to always "get" to the chained backend if
a local get fails, and must always also flush both places

I thought I might use this for RAMster (to put/get to a different
physical machine), but instead have hard-coded a modified zcache
version.

> I'd like to try and understand the design of the existing code before I
> consider anything more advanced such as writing a kvm backend,

OK. I'd be happy to answer any questions on the design
at any time.

Thanks,
Dan

2011-06-03 15:17:32

by Dan Magenheimer

[permalink] [raw]
Subject: RE: bug in cleancache ocfs2 hook, anybody want to try cleancache?


> > I'd like to try and understand the design of the existing code before
> I
> > consider anything more advanced such as writing a kvm backend,
>
> OK. I'd be happy to answer any questions on the design
> at any time.

Oops, forgot to add: If you do any experimentation (e.g.
benchmarking), I strongly recommend that you also try
frontswap** at the same time. As of today, frontswap is
now in linux-next.

Dan

** Frontswap is complementary to cleancache. When memory is
plentiful, cleancache essentially stores more clean page
cache pages to avoid disk reads. When memory is scarce and
swapping is imminent or happening, frontswap stores swap
pages in memory to avoid disk writes and reads. Both
zcache and Xen tmem support both, and in both zcache and
Xen tmem, cleancache pages are prioritized lower than
frontswap pages, i.e. the number of cleancache pages is
reduced as frontswap pages increase. Both cleancache
and frontswap share the underlying Transcendent Memory
"ABI".