2011-05-27 13:50:09

by Steven Whitehouse

[permalink] [raw]
Subject: Cleancache and shared filesystems

Hi,

I'm trying to figure out what I would need to do in order to get GFS2 to
work with cleancache. Looking at the OCFS2 implementation leaves me with
some questions about how this is supposed to work. The docs say that the
cleancache_init_shared_fs() function is supposed to take a 128 bit UUID
plus the sb.

In OCFS2 it is passed a pointer to a 32 bit little endian quantity as
the UUID:

__le32 uuid_net_key;

...

memcpy(&uuid_net_key, di->id2.i_super.s_uuid, sizeof(uuid_net_key));

...

cleancache_init_shared_fs((char *)&uuid_net_key, sb);

and in the Xen backend driver this then appears to be dereferenced as if
its two 64 bit values, which doesn't look right to me.

Also, since the sb has a UUID field in it anyway, is there some reason
why that cannot be used directly rather than passing the uuid as a
separate variable?

Steve.


2011-05-27 15:32:05

by Dan Magenheimer

[permalink] [raw]
Subject: RE: Cleancache and shared filesystems

> From: Steven Whitehouse [mailto:[email protected]]
> Sent: Friday, May 27, 2011 7:52 AM
> Cc: [email protected]
> Subject: Cleancache and shared filesystems
>
> Hi,
>
> I'm trying to figure out what I would need to do in order to get GFS2
> to
> work with cleancache. Looking at the OCFS2 implementation leaves me
> with
> some questions about how this is supposed to work. The docs say that
> the
> cleancache_init_shared_fs() function is supposed to take a 128 bit UUID
> plus the sb.
>
> In OCFS2 it is passed a pointer to a 32 bit little endian quantity as
> the UUID:
>
> __le32 uuid_net_key;
>
> ...
>
> memcpy(&uuid_net_key, di->id2.i_super.s_uuid, sizeof(uuid_net_key));
>
> ...
>
> cleancache_init_shared_fs((char *)&uuid_net_key, sb);
>
> and in the Xen backend driver this then appears to be dereferenced as
> if
> its two 64 bit values, which doesn't look right to me.

Hi Steve --

Thanks for looking at cleancache!

Hmmm... yes... it looks like you are right and the cleancache
interface code for ocfs2 has bit-rotted over time and a bad value
is being used for the uuid. This would result in pages not being
shared between clustered VMs on the same physical machine, but
would only result in a lower performance advantage, not any
other visible effects, which would explain why it has gone
undiscovered for so long.

Thanks for pointing this out as it is definitely a bug!

> Also, since the sb has a UUID field in it anyway, is there some reason
> why that cannot be used directly rather than passing the uuid as a
> separate variable?

Forgive me but I am not a clustered FS expert (even for ocfs2):
If the UUID field in the sb is always 128-bits and is always
identical for all cluster nodes, and this fact is consistent
across all clustered FS's at mount time, I agree that there is
no need to pass the uuid as a parameter in
cleancache_init_shared_fs as it can be derived in the body of
cleancache_init_shared_fs and then passed to
__cleancache_init_shared_fs (which only cares that it gets
128-bits and probably has no business digging through a
superblock). OTOH, this call is made only once per mount
so there's no significant advantage in changing this... or am
I missing something?

Thanks,
Dan

====

Thanks... for the memory!
I really could use more / my throughput's on the floor
The balloon is flat / my swap disk's fat / I've OOM's in store
Overcommitted so much
(with apologies to Bob Hope)

2011-05-27 16:18:08

by Steven Whitehouse

[permalink] [raw]
Subject: RE: Cleancache and shared filesystems

Hi,

On Fri, 2011-05-27 at 08:31 -0700, Dan Magenheimer wrote:
> > From: Steven Whitehouse [mailto:[email protected]]
> > Sent: Friday, May 27, 2011 7:52 AM
> > Cc: [email protected]
> > Subject: Cleancache and shared filesystems
> >
> > Hi,
> >
> > I'm trying to figure out what I would need to do in order to get GFS2
> > to
> > work with cleancache. Looking at the OCFS2 implementation leaves me
> > with
> > some questions about how this is supposed to work. The docs say that
> > the
> > cleancache_init_shared_fs() function is supposed to take a 128 bit UUID
> > plus the sb.
> >
> > In OCFS2 it is passed a pointer to a 32 bit little endian quantity as
> > the UUID:
> >
> > __le32 uuid_net_key;
> >
> > ...
> >
> > memcpy(&uuid_net_key, di->id2.i_super.s_uuid, sizeof(uuid_net_key));
> >
> > ...
> >
> > cleancache_init_shared_fs((char *)&uuid_net_key, sb);
> >
> > and in the Xen backend driver this then appears to be dereferenced as
> > if
> > its two 64 bit values, which doesn't look right to me.
>
> Hi Steve --
>
> Thanks for looking at cleancache!
>
> Hmmm... yes... it looks like you are right and the cleancache
> interface code for ocfs2 has bit-rotted over time and a bad value
> is being used for the uuid. This would result in pages not being
> shared between clustered VMs on the same physical machine, but
> would only result in a lower performance advantage, not any
> other visible effects, which would explain why it has gone
> undiscovered for so long.
>
> Thanks for pointing this out as it is definitely a bug!
>
Ok, thanks for confirming.

> > Also, since the sb has a UUID field in it anyway, is there some reason
> > why that cannot be used directly rather than passing the uuid as a
> > separate variable?
>
> Forgive me but I am not a clustered FS expert (even for ocfs2):
> If the UUID field in the sb is always 128-bits and is always
> identical for all cluster nodes, and this fact is consistent
> across all clustered FS's at mount time, I agree that there is
> no need to pass the uuid as a parameter in
> cleancache_init_shared_fs as it can be derived in the body of
> cleancache_init_shared_fs and then passed to
> __cleancache_init_shared_fs (which only cares that it gets
> 128-bits and probably has no business digging through a
> superblock). OTOH, this call is made only once per mount
> so there's no significant advantage in changing this... or am
> I missing something?
>
> Thanks,
> Dan
>
The point was really just a question to see if I'd understood what was
intended at this point of the code. It might be cleaner though to
introduce a sb flag to say whether a particular fs is shared or not as a
generic feature. Then the same init function could be used for both
shared and non-shared filesystems presumably?

The way that GFS2 has worked in terms of unique filesystem IDs, is to
have a filesystem "name" which is a combination of a cluster name and a
filesystem specific part which are separated with a colon. This has been
used as the identifier which gives the unique ID for any particular
filesystem, and it is also the volume label for the filesystem.

In the early GFS2 code, we introduced, in addition a UUID as well. So
that should also be a unique ID across the cluster. That does mean that
it is possible (though very unlikely) that there will be GFS2
filesystems with a zero UUID in existence. That is easily fixable though
with tunegfs2.

So I think that the UUID is ok for this particular purpose, but if it
was possible to use the filesystem "name" instead that would be more
consistent with the rest of GFS2. I don't think its a big issue though.

I suspect that for GFS2 we'd need a patch looking something like this
(untested) based on what I think is the case so far:

diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
index 8ac9ae1..e807850 100644
--- a/fs/gfs2/ops_fstype.c
+++ b/fs/gfs2/ops_fstype.c
@@ -18,6 +18,7 @@
#include <linux/mount.h>
#include <linux/gfs2_ondisk.h>
#include <linux/quotaops.h>
+#include <linux/cleancache.h>

#include "gfs2.h"
#include "incore.h"
@@ -1187,6 +1188,12 @@ static int fill_super(struct super_block *sb, struct gfs2_args *args, int silent

gfs2_glock_dq_uninit(&mount_gh);
gfs2_online_uevent(sdp);
+ if (ls->ls_ops == &gfs2_dlm_ops) {
+ if (gfs2_uuid_valid(sb->s_uuid))
+ cleancache_init_shared_fs(sb->s_uuid, sb);
+ } else {
+ cleancache_init_fs(sb);
+ }
return 0;

fail_threads:


I would also be interested to know if there are any plans for a KVM
backend for cleancache,

Steve.

2011-05-27 16:31:44

by Dan Magenheimer

[permalink] [raw]
Subject: RE: Cleancache and shared filesystems

> > Forgive me but I am not a clustered FS expert (even for ocfs2):
> > If the UUID field in the sb is always 128-bits and is always
> > identical for all cluster nodes, and this fact is consistent
> > across all clustered FS's at mount time, I agree that there is
> > no need to pass the uuid as a parameter in
> > cleancache_init_shared_fs as it can be derived in the body of
> > cleancache_init_shared_fs and then passed to
> > __cleancache_init_shared_fs (which only cares that it gets
> > 128-bits and probably has no business digging through a
> > superblock). OTOH, this call is made only once per mount
> > so there's no significant advantage in changing this... or am
> > I missing something?
> >
> The point was really just a question to see if I'd understood what was
> intended at this point of the code. It might be cleaner though to
> introduce a sb flag to say whether a particular fs is shared or not as
> a
> generic feature. Then the same init function could be used for both
> shared and non-shared filesystems presumably?

True. I believe I had just one function long ago but an early reviewer
insisted that:

func_does_this()
func_does_this_and_also_X()

was more proper in the kernel than

func_does_this(parameter_selecting_X_or_notX)

> The way that GFS2 has worked in terms of unique filesystem IDs, is to
> have a filesystem "name" which is a combination of a cluster name and a
> filesystem specific part which are separated with a colon. This has
> been
> used as the identifier which gives the unique ID for any particular
> filesystem, and it is also the volume label for the filesystem.
>
> In the early GFS2 code, we introduced, in addition a UUID as well. So
> that should also be a unique ID across the cluster. That does mean that
> it is possible (though very unlikely) that there will be GFS2
> filesystems with a zero UUID in existence. That is easily fixable
> though
> with tunegfs2.
>
> So I think that the UUID is ok for this particular purpose, but if it
> was possible to use the filesystem "name" instead that would be more
> consistent with the rest of GFS2. I don't think its a big issue though.
>
> I suspect that for GFS2 we'd need a patch looking something like this
> (untested) based on what I think is the case so far:
>
> diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
> index 8ac9ae1..e807850 100644
> --- a/fs/gfs2/ops_fstype.c
> +++ b/fs/gfs2/ops_fstype.c
> @@ -18,6 +18,7 @@
> #include <linux/mount.h>
> #include <linux/gfs2_ondisk.h>
> #include <linux/quotaops.h>
> +#include <linux/cleancache.h>
>
> #include "gfs2.h"
> #include "incore.h"
> @@ -1187,6 +1188,12 @@ static int fill_super(struct super_block *sb,
> struct gfs2_args *args, int silent
>
> gfs2_glock_dq_uninit(&mount_gh);
> gfs2_online_uevent(sdp);
> + if (ls->ls_ops == &gfs2_dlm_ops) {
> + if (gfs2_uuid_valid(sb->s_uuid))
> + cleancache_init_shared_fs(sb->s_uuid, sb);

should this be &sb->s_uuid[0]? (or I guess it is the same
thing since it is an array).

> + } else {
> + cleancache_init_fs(sb);
> + }
> return 0;
>
> fail_threads:
>
>
> I would also be interested to know if there are any plans for a KVM
> backend for cleancache,

I've had a couple of inquiries and have actually done some
work (not released yet) on a multiple-physical machine user
for cleancache (called RAMster) that extends tmem.c in
zcache to support multiple clients, which would make a
KVM implementation fairly easy. But AFAIK, nobody has
started developing a KVM backend... are you interested?

Dan

2011-05-27 23:33:39

by Joel Becker

[permalink] [raw]
Subject: Re: Cleancache and shared filesystems

On Fri, May 27, 2011 at 05:19:39PM +0100, Steven Whitehouse wrote:
> + if (ls->ls_ops == &gfs2_dlm_ops) {
> + if (gfs2_uuid_valid(sb->s_uuid))
> + cleancache_init_shared_fs(sb->s_uuid, sb);
> + } else {
> + cleancache_init_fs(sb);
> + }

Hey Dan,
Steven makes a good point here. ocfs2 could also take advantage
of local filesystem behavior when running in local mode.

Joel

--

Life's Little Instruction Book #69

"Whistle"

http://www.jlbec.org/
[email protected]

2011-05-31 08:56:53

by Steven Whitehouse

[permalink] [raw]
Subject: Re: Cleancache and shared filesystems

Hi,

On Fri, 2011-05-27 at 16:33 -0700, Joel Becker wrote:
> On Fri, May 27, 2011 at 05:19:39PM +0100, Steven Whitehouse wrote:
> > + if (ls->ls_ops == &gfs2_dlm_ops) {
> > + if (gfs2_uuid_valid(sb->s_uuid))
> > + cleancache_init_shared_fs(sb->s_uuid, sb);
> > + } else {
> > + cleancache_init_fs(sb);
> > + }
>
> Hey Dan,
> Steven makes a good point here. ocfs2 could also take advantage
> of local filesystem behavior when running in local mode.
>
> Joel
>

There is a further issue as well - cleancache will only work when all
nodes can see the same shared cache, so we will need a mount option to
disable cleancache in the case we have (for example) a cluster of
virtual machines split over multiple physical hosts.

In fact, I think from the principle of least surprise this had better
default to off and be enabled explicitly. Otherwise I can see that
people will shoot themselves in the foot which will be very easy since
there is no automatic way that I can see to verify that all nodes are
looking at the same cache,

Steve.

2011-05-31 14:52:12

by Dan Magenheimer

[permalink] [raw]
Subject: RE: Cleancache and shared filesystems

> From: Joel Becker [mailto:[email protected]]
> Sent: Friday, May 27, 2011 5:34 PM
> To: Steven Whitehouse
> Cc: Dan Magenheimer; [email protected]; Sunil Mushran
> Subject: Re: Cleancache and shared filesystems
>
> On Fri, May 27, 2011 at 05:19:39PM +0100, Steven Whitehouse wrote:
> > + if (ls->ls_ops == &gfs2_dlm_ops) {
> > + if (gfs2_uuid_valid(sb->s_uuid))
> > + cleancache_init_shared_fs(sb->s_uuid, sb);
> > + } else {
> > + cleancache_init_fs(sb);
> > + }
>
> Hey Dan,
> Steven makes a good point here. ocfs2 could also take advantage
> of local filesystem behavior when running in local mode.

Hi Joel --

I guess the semantics need to be more clearly defined
(or perhaps changed if the shared-fs community wants),
but if cleancache_init_shared_fs is called only by
a single node, cleancache still enables all the same
functionality** as cleancache_init_fs.

I'm not sure I fully understand the semantics of
local mode though, so please clarify if you think
I am wrong or misunderstanding your point.

Thanks,
Dan

---
Thanks... for the memory!
I really could use more / my throughput's on the floor
The balloon is flat / my swap disk's fat / I've OOM's in store
Overcommitted so much
(with apologies to Bob Hope)

2011-05-31 15:13:48

by Dan Magenheimer

[permalink] [raw]
Subject: RE: Cleancache and shared filesystems

> From: Steven Whitehouse [mailto:[email protected]]
> Sent: Tuesday, May 31, 2011 2:58 AM
> To: Joel Becker
> Cc: Dan Magenheimer; [email protected]; Sunil Mushran
> Subject: Re: Cleancache and shared filesystems
>
> Hi,
>
> On Fri, 2011-05-27 at 16:33 -0700, Joel Becker wrote:
> > On Fri, May 27, 2011 at 05:19:39PM +0100, Steven Whitehouse wrote:
> > > + if (ls->ls_ops == &gfs2_dlm_ops) {
> > > + if (gfs2_uuid_valid(sb->s_uuid))
> > > + cleancache_init_shared_fs(sb->s_uuid, sb);
> > > + } else {
> > > + cleancache_init_fs(sb);
> > > + }
> >
> > Hey Dan,
> > Steven makes a good point here. ocfs2 could also take advantage
> > of local filesystem behavior when running in local mode.
> >
> > Joel
> >
>
> There is a further issue as well - cleancache will only work when all
> nodes can see the same shared cache, so we will need a mount option to
> disable cleancache in the case we have (for example) a cluster of
> virtual machines split over multiple physical hosts.
>
> In fact, I think from the principle of least surprise this had better
> default to off and be enabled explicitly. Otherwise I can see that
> people will shoot themselves in the foot which will be very easy since
> there is no automatic way that I can see to verify that all nodes are
> looking at the same cache,

Though it's been nearly two years now since I thought
through this, I remember being concerned about that issue
too. But, for ocfs2 at least, cleancache hooks are embedded
in all the right places in VFS that the ocfs2 code that
cross-invalidates stale page cache pages on different
nodes also ensures coherence of cleancache and it
all just worked, whether VMs are split across hosts or not.

This may or may not be true for GFS2... for example, btrfs
required one cleancache hook outside of VFS to function
correctly.

Again, I am pretty ignorant about shared filesystems
so please correct me if I am missing anything important.

Also, I checked and Xen tmem (the only current user of
cleancache for which cluster-sharing makes sense) uses
128-bit -1 as its internal "don't share" indicator.
So you are correct that multiple non-shared VMs using
uuid==0 could potentially cause data corruption
if they share a physical machine and your code snippet
above is needed (assuming gfs2_uuid_valid returns
false for uuid==0?).


Dan
---
Thanks... for the memory!
I really could use more / my throughput's on the floor
The balloon is flat / my swap disk's fat / I've OOM's in store
Overcommitted so much
(with apologies to Bob Hope)

2011-05-31 21:54:51

by Joel Becker

[permalink] [raw]
Subject: Re: Cleancache and shared filesystems

On Tue, May 31, 2011 at 07:51:36AM -0700, Dan Magenheimer wrote:
> > Hey Dan,
> > Steven makes a good point here. ocfs2 could also take advantage
> > of local filesystem behavior when running in local mode.
>
> I guess the semantics need to be more clearly defined
> (or perhaps changed if the shared-fs community wants),
> but if cleancache_init_shared_fs is called only by
> a single node, cleancache still enables all the same
> functionality** as cleancache_init_fs.

I don't see the ** reference in the footnotes ;-) You're saying
that, for a single caller, you will properly keep the pagecache bits
around in cleancache as you shrink the balloon and give them back when
requested? So an ocfs2 calling cleancache_init_share_fs() in only one
VM will have the same page lifetimes (including life inside cleancache
but not in guest pagecache) as a similar ext3? If so, there are no
changes needed at all.

> I'm not sure I fully understand the semantics of
> local mode though, so please clarify if you think
> I am wrong or misunderstanding your point.

ocfs2 local mode means that it is not a cluster filesystem. The
cluster services are not enabled, and ocfs2 behaves like xfs/extN/btrfs.

Joel

--

"Lately I've been talking in my sleep.
Can't imagine what I'd have to say.
Except my world will be right
When love comes back my way."

http://www.jlbec.org/
[email protected]