2005-04-25 15:08:13

by David Teigland

[permalink] [raw]
Subject: [PATCH 0/7] dlm: overview


Hi,

This is a distributed lock manager (dlm) that we'd like to see added to
the kernel. The dlm programming api is very similar to that found on
other operating systems, but this is modeled most closely after that in
VMS.

Many distributed/cluster applications use a dlm for inter-process
synchronization where processes may live on different machines. (GFS and
CLVM are two examples which our group also develop; GFS using the dlm from
the kernel and CLVM from userspace.)

We've done a lot of work in this second version to meet the kernel's
conventions. Comments and suggestions are welcome; we're happy to answer
questions and make changes so this can be a widely useful feature for
people running Linux clusters.

The dlm requires configuration from userspace. In particular, it needs to
be told which other nodes it should work with, and provided with their IP
addresses. In a typical setup, a cluster-membership system in userspace
would do this configuration (the dlm is agnostic about what system that
is). A command line tool can also be used to configure it manually.
(It's helpful to compare this with device-mapper where dmsetup/LVM2/EVMS
are all valid userland alternatives for driving device-mapper.)

Features _not_ in this patchset that can be added incrementally:
. hierarchical parent/child locks
. lock querying interface
. deadlock handling, lock timeouts
. configurable method for master selection (e.g. static master
assignment with no resource directory)

Background: we began developing this dlm from the ground up about 3 years
ago at Sistina. It became GPL after the Red Hat acquisition.

The following patches are against 2.6.12-rc3 and do not touch any existing
files in the kernel.


2005-04-25 20:42:48

by Wim Coekaerts

[permalink] [raw]
Subject: Re: [PATCH 0/7] dlm: overview

> This is a distributed lock manager (dlm) that we'd like to see added to
> the kernel. The dlm programming api is very similar to that found on
> other operating systems, but this is modeled most closely after that in
> VMS.

do you have any performance data at all on this ? I like to see a dlm
but I like to see something that will also perform well. My main concern
is that I have not seen anything relying on this code do "reasonably
well". eg can you show gfs numbers w/ number of nodes and scalability ?

I think it's time we submit ocfs2 w/ it's cluster stack so that folks
can compare (including actual data/numbers), we have been waiting to
stabilize everything but I guess there is this preemptive strike going
on so we might just as well. at least we have had hch and folks comment,
before sending to submit code.

Andrew - we will submit ocfs2 so you can have a look, compare and move
on. we will work with any stack that eventuslly gets accepted, just want
to see the choice out there and an educated decision.

hopefully tomorrow, including data comparing single node and multinode
performance.

Wim

2005-04-25 20:52:49

by Daniel Phillips

[permalink] [raw]
Subject: Re: [PATCH 0/7] dlm: overview

Hi Dave,

On Monday 25 April 2005 11:11, David Teigland wrote:
> We've done a lot of work in this second version to meet the kernel's
> conventions. Comments and suggestions are welcome; we're happy to answer
> questions and make changes so this can be a widely useful feature for
> people running Linux clusters.

Good luck with this. A meta-comment: you used to call it gdlm, right? I
think it would be a very good idea to return to that name, unless you think
that this will be the only dlm in linux, ever.

Regards,

Daniel

2005-04-25 21:10:12

by Lars Marowsky-Bree

[permalink] [raw]
Subject: Re: [PATCH 0/7] dlm: overview

On 2005-04-25T13:39:52, Wim Coekaerts <[email protected]> wrote:

> I think it's time we submit ocfs2 w/ it's cluster stack so that folks
> can compare (including actual data/numbers), we have been waiting to
> stabilize everything but I guess there is this preemptive strike going
> on so we might just as well. at least we have had hch and folks comment,
> before sending to submit code.
>
> Andrew - we will submit ocfs2 so you can have a look, compare and move
> on. we will work with any stack that eventuslly gets accepted, just want
> to see the choice out there and an educated decision.

I think "preemptive strike" is a bit over the top, Mr Seklos ;-)

Eventually I am convinced this will end up like much everything else:
One DLM will be better for some things than for some others, and what we
need is reasonably clean modularization and APIs so that people can swap
DLMs et al too; the VFS layer is there for a reason, as is the driver
model.

It's great to see that now two viable solutions are finally being
submitted, and I assume that Bruce will jump in within a couple of hours
too. ;-)

Now that we have two (or three) options with actual users, now is the
right time to finally come up with sane and useful abstractions. This is
great.

(In the past, some, including myself, have been guilty of trying this
the other way around, which didn't work. But it was a worthwhile
experience.)

With APIs, I think we do need a DLM-switch in the kernel, but also the
DLMs should really seem much the same to user-space apps. From what I've
seen, dlmfs is OCFS2 wasn't doing too badly there. The icing would of
course be if even the configuration was roughly similar, and if OCFS2's
configfs might prove valuable to other users too.

The cluster summit in June will certainly be a very ... exciting place.
Let's hope this also stirs up KS a bit ;-)

Oh, and just to anticipate that discussion, anyone who suggests to adopt
the SAF AIS locking API into the kernel should be preemptively struck;
that naming etc is just beyond words.


Sincerely,
Lars Marowsky-Br?e <[email protected]>

--
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business

2005-04-25 21:22:53

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 0/7] dlm: overview

Wim Coekaerts <[email protected]> wrote:
>
> > This is a distributed lock manager (dlm) that we'd like to see added to
> > the kernel. The dlm programming api is very similar to that found on
> > other operating systems, but this is modeled most closely after that in
> > VMS.
>
> do you have any performance data at all on this ? I like to see a dlm
> but I like to see something that will also perform well. My main concern
> is that I have not seen anything relying on this code do "reasonably
> well". eg can you show gfs numbers w/ number of nodes and scalability ?
>
> I think it's time we submit ocfs2 w/ it's cluster stack so that folks
> can compare (including actual data/numbers), we have been waiting to
> stabilize everything but I guess there is this preemptive strike going
> on so we might just as well. at least we have had hch and folks comment,
> before sending to submit code.

Preemptive strikes won't work, coz this little target will just redirect
the incoming munitions at his other aggressors ;)

It is good that RH has got this process underway. David, I assume that
other interested parties (ocfs, lustre, etc) know that this is happening?
If not, could you please let them know and invite them to comment?

> Andrew - we will submit ocfs2 so you can have a look, compare and move
> on. we will work with any stack that eventuslly gets accepted, just want
> to see the choice out there and an educated decision.

In an ideal world, the various clustering groups would haggle this thing
into shape, come to a consensus patch series which they all can use and I
would never need to look at the code from a decision-making POV.

The world isn't ideal, but merging something over the strenuous objections
of one or more major groups would be quite regrettable - let's hope that
doesn't happen. Although it might.

> hopefully tomorrow, including data comparing single node and multinode
> performance.

OK. I'm unlikely to merge any first-round patches, as I expect there will
be plenty of review comments and I'm already in a bit of a mess here wrt
patch backlog, email backlog, bug backlog and apparently there have been
some changes in the SCM area...

2005-04-26 05:31:47

by Daniel Phillips

[permalink] [raw]
Subject: Re: [PATCH 0/7] dlm: overview

On Monday 25 April 2005 17:09, Lars Marowsky-Bree wrote:
> Now that we have two (or three) options with actual users, now is the
> right time to finally come up with sane and useful abstractions. This is
> great.

Great thought, but it won't work unless you actually read them all, which I
hope is what you're proposing.

> With APIs, I think we do need a DLM-switch in the kernel, but also the
> DLMs should really seem much the same to user-space apps. From what I've
> seen, dlmfs is OCFS2 wasn't doing too badly there. The icing would of
> course be if even the configuration was roughly similar, and if OCFS2's
> configfs might prove valuable to other users too.

I'm a little skeptical about the chance of fitting an 11-parameter function
call into a generic kernel plug-in framework. Are those the exact same 11
parameters that God intended?

While it would be great to share a single dlm between gfs and ocfs2 - maybe
Lustre too - my crystal ball says that that laudable goal is unlikely to be
achieved in the near future, whereas there isn't much choice but to sort out
a common membership framework right now.

As far as I can see, only cluster membership wants or needs a common
framework. And I'm not sure that any of that even needs to be in-kernel.

Regards,

Daniel



> The cluster summit in June will certainly be a very ... exciting place.
> Let's hope this also stirs up KS a bit ;-)
>
> Oh, and just to anticipate that discussion, anyone who suggests to adopt
> the SAF AIS locking API into the kernel should be preemptively struck;
> that naming etc is just beyond words.
>
>
> Sincerely,
> Lars Marowsky-Br?e <[email protected]>

2005-04-26 05:36:49

by David Teigland

[permalink] [raw]
Subject: Re: [PATCH 0/7] dlm: overview

On Mon, Apr 25, 2005 at 01:39:52PM -0700, Wim Coekaerts wrote:
> > This is a distributed lock manager (dlm) that we'd like to see added to
> > the kernel. The dlm programming api is very similar to that found on
> > other operating systems, but this is modeled most closely after that in
> > VMS.
>
> do you have any performance data at all on this ? I like to see a dlm
> but I like to see something that will also perform well.

No. What kind of performance measurements do you have in mind? Most dlm
lock requests involve sending a message to a remote machine and waiting
for a reply. I expect this network round-trip is the bulk of the time for
a request, which is why I'm a bit confused by your question.

Now, sometimes there are two remote messages (when a resource directory
lookup is needed). You can eliminate that by not using a resource
directory, which will soon be a configurable option.


> My main concern is that I have not seen anything relying on this code do
> "reasonably well". eg can you show gfs numbers w/ number of nodes and
> scalability ?

I'd suggest that if some cluster application is using the dlm and has poor
performance or scalability, the reason and solution lies mostly in the
app, not in the dlm. That's assuming we're not doing anything blatantly
dumb in the dlm, butI think you may be placing too much emphasis on the
role of the dlm here.


> I think it's time we submit ocfs2 w/ it's cluster stack so that folks
> can compare (including actual data/numbers), we have been waiting to
> stabilize everything but I guess there is this preemptive strike going
> on so we might just as well. at least we have had hch and folks comment,
> before sending to submit code.

Strike? Preemption? That sounds frightfully conspiratorial and
contentious; who have you been talking to? It's obvious to me that ocfs2
and gfs each have their own happy niche; they're hardly equivalent (more
so considering all the flavors of local file systems.) This is surely a
case of "different", not "conflict"!


> Andrew - we will submit ocfs2 so you can have a look, compare and move
> on. we will work with any stack that eventuslly gets accepted, just want
> to see the choice out there and an educated decision.
>
> hopefully tomorrow, including data comparing single node and multinode
> performance.

I'd really like to see ocfs succeed, but good heavens, why do we need to
study an entire cluster fs when looking at a dlm!? A cluster fs may use a
dlm, but a dlm is surely a stand-alone entity with _many_ applications
beyond a cluster fs (which is frankly a rather obscure app.)

We've made great effort to make the dlm broadly useful beyond the realm of
gfs or cluster file systems. In the long run I expect other cluster apps
will out-use the dlm by far.

Dave

2005-04-26 05:43:14

by David Teigland

[permalink] [raw]
Subject: Re: [PATCH 0/7] dlm: overview

On Mon, Apr 25, 2005 at 02:19:53PM -0700, Andrew Morton wrote:
> In an ideal world, the various clustering groups would haggle this thing
> into shape, come to a consensus patch series which they all can use and I
> would never need to look at the code from a decision-making POV.

It appears that the different clustering groups are all involved. As I
said, we're committed to making this useful for everyone. Our approach
from the start has been to copy the "standard" dlm api available on other
operating systems (in particular VMS); there's never been a tie to one
application. We're also trying to make the behaviour configurable at
points where there are useful differences in the dlm's people are
accustomed to.

Other groups will obviously not be able to adopt this dlm immediately, but
the goal is for them to be _able_ to do so eventually -- if they want and
when time permits. If anyone is _unable_ to use this dlm for some
technical reason, we'd definately like to know about that now so it can be
fixed.

Thanks,
Dave

2005-04-26 18:49:21

by Mark Fasheh

[permalink] [raw]
Subject: Re: [PATCH 0/7] dlm: overview

On Tue, Apr 26, 2005 at 01:39:30PM +0800, David Teigland wrote:
> On Mon, Apr 25, 2005 at 01:39:52PM -0700, Wim Coekaerts wrote:
> > > This is a distributed lock manager (dlm) that we'd like to see added to
> > > the kernel. The dlm programming api is very similar to that found on
> > > other operating systems, but this is modeled most closely after that in
> > > VMS.
> >
> > do you have any performance data at all on this ? I like to see a dlm
> > but I like to see something that will also perform well.
>
> No. What kind of performance measurements do you have in mind? Most dlm
> lock requests involve sending a message to a remote machine and waiting
> for a reply. I expect this network round-trip is the bulk of the time for
> a request, which is why I'm a bit confused by your question.
Resource lookup times, times to deliver events to clients (asts, basts,
etc) for starters. How long does recovery take after a node crash? How does
all of this scale as you increase the number of nodes in your cluster?
Sure, network speed is a part of the equation, but it's not the *whole*
equation and I've seen dlms that can get downright nasty when it comes to
recovery speeds, etc.

> Now, sometimes there are two remote messages (when a resource directory
> lookup is needed). You can eliminate that by not using a resource
> directory, which will soon be a configurable option.
>
>
> > My main concern is that I have not seen anything relying on this code do
> > "reasonably well". eg can you show gfs numbers w/ number of nodes and
> > scalability ?
>
> I'd suggest that if some cluster application is using the dlm and has poor
> performance or scalability, the reason and solution lies mostly in the
> app, not in the dlm. That's assuming we're not doing anything blatantly
> dumb in the dlm, butI think you may be placing too much emphasis on the
> role of the dlm here.
Well, obviously the dlm is only one component of an entire system, but for a
cluster application it can certainly be an important component, one whose
performance is worth looking into. I don't think asking for this
information is out of the question.
--Mark

> > I think it's time we submit ocfs2 w/ it's cluster stack so that folks
> > can compare (including actual data/numbers), we have been waiting to
> > stabilize everything but I guess there is this preemptive strike going
> > on so we might just as well. at least we have had hch and folks comment,
> > before sending to submit code.
>
> Strike? Preemption? That sounds frightfully conspiratorial and
> contentious; who have you been talking to? It's obvious to me that ocfs2
> and gfs each have their own happy niche; they're hardly equivalent (more
> so considering all the flavors of local file systems.) This is surely a
> case of "different", not "conflict"!
>
>
> > Andrew - we will submit ocfs2 so you can have a look, compare and move
> > on. we will work with any stack that eventuslly gets accepted, just want
> > to see the choice out there and an educated decision.
> >
> > hopefully tomorrow, including data comparing single node and multinode
> > performance.
>
> I'd really like to see ocfs succeed, but good heavens, why do we need to
> study an entire cluster fs when looking at a dlm!? A cluster fs may use a
> dlm, but a dlm is surely a stand-alone entity with _many_ applications
> beyond a cluster fs (which is frankly a rather obscure app.)
>
> We've made great effort to make the dlm broadly useful beyond the realm of
> gfs or cluster file systems. In the long run I expect other cluster apps
> will out-use the dlm by far.
>
> Dave
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
Mark Fasheh
Senior Software Developer, Oracle
[email protected]

2005-04-26 22:35:01

by Steven Dake

[permalink] [raw]
Subject: Re: [PATCH 0/7] dlm: overview

On Tue, 2005-04-26 at 11:48, Mark Fasheh wrote:
> On Tue, Apr 26, 2005 at 01:39:30PM +0800, David Teigland wrote:
> > On Mon, Apr 25, 2005 at 01:39:52PM -0700, Wim Coekaerts wrote:
> > > > This is a distributed lock manager (dlm) that we'd like to see added to
> > > > the kernel. The dlm programming api is very similar to that found on
> > > > other operating systems, but this is modeled most closely after that in
> > > > VMS.
> > >
> > > do you have any performance data at all on this ? I like to see a dlm
> > > but I like to see something that will also perform well.
> >
> > No. What kind of performance measurements do you have in mind? Most dlm
> > lock requests involve sending a message to a remote machine and waiting
> > for a reply. I expect this network round-trip is the bulk of the time for
> > a request, which is why I'm a bit confused by your question.
> Resource lookup times, times to deliver events to clients (asts, basts,
> etc) for starters. How long does recovery take after a node crash? How does
> all of this scale as you increase the number of nodes in your cluster?
> Sure, network speed is a part of the equation, but it's not the *whole*
> equation and I've seen dlms that can get downright nasty when it comes to
> recovery speeds, etc.
>

Reading these requests and information about how locking works and
performs, I have a suggestion to improve performance dramatically.

Use implicit acknowledgement, self delivery, and message packing. With
these approaches I think it is possible to support acquisition of
15,000-30,000 locks per second on modern 3ghz cpus with 100mbit
networking.

Most important of these is implicit acknoweldgement, whereby instead of
sending a request and waiting for a response from one node, the request
is sent to all processors. Then self delivery is used to deliver the
lock request to the lock service which processes it and accepts it if
the lock can be taken or rejects it if the lock cannot be granted (or
puts it in a list to be granted later, or whatever). This does require
that all processors agree upon the order of the messages sent through
implicit acknowledgement. But it removes the *reply* step which reduces
latency and allows a processor to obtain a lock as soon as the message
is self-delivered to the requesting processor. It also creates a
redundant copy of the lock state that all processors maintain, likely
improving availability in the face of stop or restart faults.

Message packing is an important improvement, since I assume the size of
the lock request structures are small. the grant time on ethernet is
about the same for small or large packets, so reducing the number of
times access to the ethernet must be granted can make a huge
improvement. This allows packing a grouping of lock requests from one
processor into a full MTU sized packet for 8x-10x performance gains for
150 byte messages on 1500 mtu networks...

Without these approaches, I'd expect performance somewhere in the
1,000-2,000 grants per second... How does the implementation in this
patch perform? I'd also be interested to know how long it takes from
the time the request is made on the processor to the time it is
completed (perhaps as measured by gettimeofday, or something similiar).

Mark, how does oracle's dlm perform for your questions above?

> > Now, sometimes there are two remote messages (when a resource directory
> > lookup is needed). You can eliminate that by not using a resource
> > directory, which will soon be a configurable option.
> >
> >
> > > My main concern is that I have not seen anything relying on this code do
> > > "reasonably well". eg can you show gfs numbers w/ number of nodes and
> > > scalability ?
> >
> > I'd suggest that if some cluster application is using the dlm and has poor
> > performance or scalability, the reason and solution lies mostly in the
> > app, not in the dlm. That's assuming we're not doing anything blatantly
> > dumb in the dlm, butI think you may be placing too much emphasis on the
> > role of the dlm here.
> Well, obviously the dlm is only one component of an entire system, but for a
> cluster application it can certainly be an important component, one whose
> performance is worth looking into. I don't think asking for this
> information is out of the question.
> --Mark
>
> > > I think it's time we submit ocfs2 w/ it's cluster stack so that folks
> > > can compare (including actual data/numbers), we have been waiting to
> > > stabilize everything but I guess there is this preemptive strike going
> > > on so we might just as well. at least we have had hch and folks comment,
> > > before sending to submit code.
> >
> > Strike? Preemption? That sounds frightfully conspiratorial and
> > contentious; who have you been talking to? It's obvious to me that ocfs2
> > and gfs each have their own happy niche; they're hardly equivalent (more
> > so considering all the flavors of local file systems.) This is surely a
> > case of "different", not "conflict"!
> >
> >
> > > Andrew - we will submit ocfs2 so you can have a look, compare and move
> > > on. we will work with any stack that eventuslly gets accepted, just want
> > > to see the choice out there and an educated decision.
> > >
> > > hopefully tomorrow, including data comparing single node and multinode
> > > performance.
> >
> > I'd really like to see ocfs succeed, but good heavens, why do we need to
> > study an entire cluster fs when looking at a dlm!? A cluster fs may use a
> > dlm, but a dlm is surely a stand-alone entity with _many_ applications
> > beyond a cluster fs (which is frankly a rather obscure app.)
> >
> > We've made great effort to make the dlm broadly useful beyond the realm of
> > gfs or cluster file systems. In the long run I expect other cluster apps
> > will out-use the dlm by far.
> >
> > Dave
> >
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at http://www.tux.org/lkml/
> --
> Mark Fasheh
> Senior Software Developer, Oracle
> [email protected]
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2005-04-27 03:28:40

by David Teigland

[permalink] [raw]
Subject: Re: [PATCH 0/7] dlm: overview

On Tue, Apr 26, 2005 at 11:48:45AM -0700, Mark Fasheh wrote:
> On Tue, Apr 26, 2005 at 01:39:30PM +0800, David Teigland wrote:

> > No. What kind of performance measurements do you have in mind? Most
> > dlm lock requests involve sending a message to a remote machine and
> > waiting for a reply. I expect this network round-trip is the bulk of
> > the time for a request, which is why I'm a bit confused by your
> > question.

> Resource lookup times, times to deliver events to clients (asts, basts,
> etc) for starters. How long does recovery take after a node crash? How
> does all of this scale as you increase the number of nodes in your
> cluster? Sure, network speed is a part of the equation, but it's not
> the *whole* equation and I've seen dlms that can get downright nasty
> when it comes to recovery speeds, etc.

Ok, we'll look into how to measure some of that in a way that's
meaningful.


> > > My main concern is that I have not seen anything relying on this
> > > code do "reasonably well". eg can you show gfs numbers w/ number of
> > > nodes and scalability ?
> >
> > I'd suggest that if some cluster application is using the dlm and has
> > poor performance or scalability, the reason and solution lies mostly
> > in the app, not in the dlm. That's assuming we're not doing anything
> > blatantly dumb in the dlm, butI think you may be placing too much
> > emphasis on the role of the dlm here.

> Well, obviously the dlm is only one component of an entire system, but
> for a cluster application it can certainly be an important component,
> one whose performance is worth looking into. I don't think asking for
> this information is out of the question.

GFS measurements will wait until gfs comes along, but we can do some dlm
measuring now.

Dave

2005-04-27 13:42:00

by Lars Marowsky-Bree

[permalink] [raw]
Subject: Re: [PATCH 0/7] dlm: overview

On 2005-04-26T11:48:45, Mark Fasheh <[email protected]> wrote:

> Resource lookup times, times to deliver events to clients (asts, basts,
> etc) for starters. How long does recovery take after a node crash? How does
> all of this scale as you increase the number of nodes in your cluster?

Well, frankly, recovery time of the DLM mostly will depend on the speed
of the membership algorithm and timings used. My gut feeling is that DLM
recovery time is small compared to membership event detection and the
necessary fencing operation.

But yes, scalability, at least roughly O(foo) guesstimates, for numbers
of locks and/or number of nodes would be helpful, both for a) speed, but
also b) number of network messages involved, for recovery and lock
acquisition.

Mark, do you have the data you ask for for OCFS2's DLM?

(BTW, trimming quotes is considered polite on LKML.)


Sincerely,
Lars Marowsky-Br?e <[email protected]>

--
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business

2005-04-27 13:57:51

by Lars Marowsky-Bree

[permalink] [raw]
Subject: Re: [PATCH 0/7] dlm: overview

On 2005-04-26T01:30:16, Daniel Phillips <[email protected]> wrote:

> > Now that we have two (or three) options with actual users, now is
> > the right time to finally come up with sane and useful abstractions.
> > This is great.
> Great thought, but it won't work unless you actually read them all,
> which I hope is what you're proposing.

Sure. As time permits ;-)

> I'm a little skeptical about the chance of fitting an 11-parameter function
> call into a generic kernel plug-in framework. Are those the exact same 11
> parameters that God intended?

An 11-parameter function, frankly, more often than not indicates that
the interface is wrong. I know it's inherited from VMS, which is a
perfectly legitimate reason, but I assume it might get cleaned / broken
up in the future.

> While it would be great to share a single dlm between gfs and ocfs2 - maybe
> Lustre too - my crystal ball says that that laudable goal is unlikely to be
> achieved in the near future, whereas there isn't much choice but to sort out
> a common membership framework right now.

Oh, sure. I just like to keep a long term vision in mind, an idea I'd
think you approve of. ;-)

Also I didn't say that they should necessarily _share_ a DLM; I assume
there'll be more than one, just like we have more than one filesystem.
But, can this be mapped to a common subset of features which an
application/user/filesystem can assume to be present in every DLM,
accessed via a common API? This does not preclude the option that one
DLM will perform substantially better for some user than another one, or
that one DLM takes advantage of some specific hardware feature which
makes it run a magnitude faster on the z-Series for example.

As I said, this is not something to do right now, but something to keep
in mind going forward, but to keep in mind at every step.

> As far as I can see, only cluster membership wants or needs a common
> framework. And I'm not sure that any of that even needs to be in-kernel.

Well, right now nobody proposes the membership to be in-kernel. What I'd
like to see though is a common way of _feeding_ the membership to a
given kernel component, and being able to query the kind of syntax &
semantics it expects.

Note that I said "given kernel component", because I assume from the
start that a node might be part of several overlapping clusters. The
membership I feed to the GFS DLM might not be the same I feed to OCFS2
for another mount.

Questions which need to be settled, or which the API at least needs to
export so we know what is expected from us:

- How do the node ids look like? Are they sparse integers, continuous
ints, uuids, IPv4 or IPv6 address of the 'primary' IP of a node,
hostnames...?

- How are the communication links configured? How to tell it which
interfaces to use for IP, for example?

- How do we actually deliver the membership events -
echo "current node list" >/sys/cluster/gfs/membership
or...?

- What kind of semantics are expected: Can we deliver the membership
events as they come, do we need to introduce suspend/resume barriers
etc?

- How to security credentials play into this, and where are they
enforced - so that a user-space app doesn't mess with kernel locks?

Maybe initially we'll end up with those being "exported" in
Documentation/{OCFS2,GFS}-DLM/ files, but ultimately it'd be nice if
user-space could auto-discover them and do the right thing w/a minimum
amount of configuration.

Or maybe these will be abstracted by user-space wrapper libraries, and
everybody does in the kernel what they deem best.

It's just something which needs to be answered and decided ;-)


Sincerely,
Lars Marowsky-Br?e <[email protected]>

--
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business

2005-04-27 18:14:15

by Mark Fasheh

[permalink] [raw]
Subject: Re: [PATCH 0/7] dlm: overview

On Wed, Apr 27, 2005 at 03:23:43PM +0200, Lars Marowsky-Bree wrote:
> But yes, scalability, at least roughly O(foo) guesstimates, for numbers
> of locks and/or number of nodes would be helpful, both for a) speed, but
> also b) number of network messages involved, for recovery and lock
> acquisition.
>
> Mark, do you have the data you ask for for OCFS2's DLM?
The short answer is no but that we're collecting them. Right now, I can say
that if you take our whole stack into consideration OCFS2 for things like
kernel untars and builds (a common test over here), is typically almost as
fast as ext3 (single node obviously) even when we have a second or third
node mounted.

As far as specific DLM timings go, we're in the process of collecting them.
As you know, lately we have been deep in a process of stabilizing things :)
While we have collected timings independent of the FS in the past, we
haven't done that recently enough that I'd feel comfortable posting it.

> (BTW, trimming quotes is considered polite on LKML.)
Heh, sorry about that - I'll try to do better in the future :)
--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
[email protected]

2005-04-27 20:01:08

by Daniel Phillips

[permalink] [raw]
Subject: Re: [PATCH 0/7] dlm: overview

On Wednesday 27 April 2005 09:56, Lars Marowsky-Bree wrote:
> An 11-parameter function, frankly, more often than not indicates that
> the interface is wrong. I know it's inherited from VMS, which is a
> perfectly legitimate reason, but I assume it might get cleaned / broken
> up in the future.

To put things in concrete terms, here it is:

int dlm_ls_lock(
/*1*/ dlm_lshandle_t lockspace,
/*2*/ uint32_t mode,
/*3*/ struct dlm_lksb *lksb,
/*4*/ uint32_t flags,
/*5*/ void *name,
/*6*/ unsigned int namelen,
/*7*/ uint32_t parent,
/*8*/ void (*ast) (void *astarg),
/*9*/ void *astarg,
/*10*/ void (*bast) (void *astarg),
/*11*/ struct dlm_range *range);

> Questions which need to be settled, or which the API at least needs to
> export so we know what is expected from us:
>
> - How do the node ids look like? Are they sparse integers, continuous
> ints, uuids, IPv4 or IPv6 address of the 'primary' IP of a node,
> hostnames...?

32 bit integers at the moment. I hope it stays that way.

> - How are the communication links configured? How to tell it which
> interfaces to use for IP, for example?

CMAN provides a PF_CLUSTER. This facility seems cool, but I haven't got much
experience with it, and certainly not enough to know if PF_CLUSTER is really
necessary, or should be put forth as a required component of the common
infrastructure. It is not clear to me that SCTP can't be used directly,
perhaps with some library support.

> - How do we actually deliver the membership events -
> echo "current node list" >/sys/cluster/gfs/membership
> or...?

This is rather nice: event messages are delivered over a socket. The specific
form of the messages sucks somewhat, as do the wrappers provided. These need
some public pondering.

> - What kind of semantics are expected: Can we deliver the membership
> events as they come, do we need to introduce suspend/resume barriers
> etc?

Suspend/resume barriers take the form of a simple message protocol,
administered by CMAN.

> - How to security credentials play into this, and where are they
> enforced - so that a user-space app doesn't mess with kernel locks?

Security? What is that? (Too late for me to win that dinner now...)
Security is currently provided by restricting socket access to root.

> Maybe initially we'll end up with those being "exported" in
> Documentation/{OCFS2,GFS}-DLM/ files, but ultimately it'd be nice if
> user-space could auto-discover them and do the right thing w/a minimum
> amount of configuration.

Yes. For the next month or two it should be ambitious enough just to ensure
that the interfaces are simple, sane, and known to satisfy the base
requirements of everybody with existing cluster code to contribute. And
automagic aspects are also worth discussing, just to be sure we don't set up
roadblocks to that kind of improvement in the future. I don't think we need
too much automagic just now, though.

> Or maybe these will be abstracted by user-space wrapper libraries, and
> everybody does in the kernel what they deem best.

I _hope_ that we can arrive at a base membership infrastructure that is
convenient to use either from kernel or user space. User space libraries
already exist, but with warts of various sizes.

Regards,

Daniel

2005-04-27 20:20:33

by Lars Marowsky-Bree

[permalink] [raw]
Subject: Re: [PATCH 0/7] dlm: overview

On 2005-04-27T16:00:57, Daniel Phillips <[email protected]> wrote:

> > Questions which need to be settled, or which the API at least needs to
> > export so we know what is expected from us:
> >
> > - How do the node ids look like? Are they sparse integers, continuous
> > ints, uuids, IPv4 or IPv6 address of the 'primary' IP of a node,
> > hostnames...?
> 32 bit integers at the moment. I hope it stays that way.

You have just excluded a certain number of clustering stacks from
working. Or at least required them to maintain translation tables. A
UUID has many nice properties; one of the most important ones being that
it is inherently unique (and thus doesn't require an adminstrator to
assign a node id), and that it also happens to be big enough to hold
anything else you might want to, like the primary IPv6 address of a
node.

We've had that discussion on the OCF list, and I think that was one of
the few really good ones.

> > - How are the communication links configured? How to tell it which
> > interfaces to use for IP, for example?
> CMAN provides a PF_CLUSTER. This facility seems cool, but I haven't got much
> experience with it, and certainly not enough to know if PF_CLUSTER is really
> necessary, or should be put forth as a required component of the common
> infrastructure. It is not clear to me that SCTP can't be used directly,
> perhaps with some library support.

You've missed the point of my question. I did not mean "How does an
application use the cluster comm links", but "How is the kernel
component told which paths/IPs it should use".

> > - How do we actually deliver the membership events - echo "current
> > node list" >/sys/cluster/gfs/membership or...?
> This is rather nice: event messages are delivered over a socket. The
> specific form of the messages sucks somewhat, as do the wrappers
> provided. These need some public pondering.

Again, you've told me how user-space learns about the events. This
wasn't the question; I was asking how user-space tells the kernel about
the membership.

> > - What kind of semantics are expected: Can we deliver the membership
> > events as they come, do we need to introduce suspend/resume barriers
> > etc?
> Suspend/resume barriers take the form of a simple message protocol,
> administered by CMAN.

Not what I asked; see the discussion with David.

> > - How to security credentials play into this, and where are they
> > enforced - so that a user-space app doesn't mess with kernel locks?
> Security? What is that? (Too late for me to win that dinner now...)
> Security is currently provided by restricting socket access to root.

So you'd expect a user-level suid daemon of sorts to wrap around this.
Fair enough.

> Yes. For the next month or two it should be ambitious enough just to ensure
> that the interfaces are simple, sane, and known to satisfy the base
> requirements of everybody with existing cluster code to contribute.

Which is what the above questions were about ;-) heartbeat uses UUIDs
for node identification; we've got a pretty strict security model, and
we do not necessarily use IP as the transport mechanism, and our
membership runs in user-space.

The automagic aspects are the icing on the cake ;-)

> > Or maybe these will be abstracted by user-space wrapper libraries, and
> > everybody does in the kernel what they deem best.
> I _hope_ that we can arrive at a base membership infrastructure that is
> convenient to use either from kernel or user space. User space libraries
> already exist, but with warts of various sizes.

... which is why I asked the above questions: User-space needs to
interface with the kernel to tell it the membership (if the membership
is user-space driven), or retrieve it (if it is kernel driven).

This implies we need to understand the expected semantics of the kernel,
and either standarize them, or have a way for user-space to figure out
which are wanted when interfacing with a particular kernel.


Sincerely,
Lars Marowsky-Br?e <[email protected]>

--
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business

2005-04-27 22:39:52

by Daniel Phillips

[permalink] [raw]
Subject: Re: [PATCH 0/7] dlm: overview

On Wednesday 27 April 2005 16:20, Lars Marowsky-Bree wrote:
> > > - How do the node ids look like? Are they sparse integers, continuous
> > > ints, uuids, IPv4 or IPv6 address of the 'primary' IP of a node,
> > > hostnames...?
> >
> > 32 bit integers at the moment. I hope it stays that way.
>
> You have just excluded a certain number of clustering stacks from
> working. Or at least required them to maintain translation tables. A
> UUID has many nice properties; one of the most important ones being that
> it is inherently unique (and thus doesn't require an adminstrator to
> assign a node id), and that it also happens to be big enough to hold
> anything else you might want to, like the primary IPv6 address of a
> node.

Uuids's at this level are inherently bogus, unless of course you have more
than 2**32 cluster nodes. I don't know about you, but I do not have even
half that many nodes over here.

Translation tables are just the thing for people who can't get by without
uuids. (Heck, who needs uuids, just use root's email address.)

> > > - How are the communication links configured? How to tell it which
> > > interfaces to use for IP, for example?
> >
> > CMAN provides a PF_CLUSTER. This facility seems cool, but I haven't got
> > much experience with it, and certainly not enough to know if PF_CLUSTER
> > is really necessary, or should be put forth as a required component of
> > the common infrastructure. It is not clear to me that SCTP can't be used
> > directly, perhaps with some library support.
>
> You've missed the point of my question. I did not mean "How does an
> application use the cluster comm links", but "How is the kernel
> component told which paths/IPs it should use".

I believe cman gives you an address in AF_CLUSTER at the same time it hands
you your event socket. Last time I did this, the actual mechanism was buried
under a wrapper (magma) so I could have that got that slightly wrong.
Anybody want to clarify?

> > > - How do we actually deliver the membership events - echo "current
> > > node list" >/sys/cluster/gfs/membership or...?
> >
> > This is rather nice: event messages are delivered over a socket. The
> > specific form of the messages sucks somewhat, as do the wrappers
> > provided. These need some public pondering.
>
> Again, you've told me how user-space learns about the events. This
> wasn't the question; I was asking how user-space tells the kernel about
> the membership.

Since cman has now moved to user space, userspace does not tell the kernel
about membership, it just gets a socket+address from cman, which tells cman
that the node just joined. Kernel code can also join the cluster if it wants
to, likewise by poking cman. I'm not sure exactly how that works now that
cman has been moved into userspace. (Hopefully, docs will appear here soon.
One could also read the posted patches...)

> > Yes. For the next month or two it should be ambitious enough just to
> > ensure that the interfaces are simple, sane, and known to satisfy the
> > base requirements of everybody with existing cluster code to contribute.
>
> Which is what the above questions were about ;-) heartbeat uses UUIDs
> for node identification; we've got a pretty strict security model, and
> we do not necessarily use IP as the transport mechanism, and our
> membership runs in user-space.

Can we have a list of all the reasons that you cannot wrap your heartbeat
interface around cman, please? You will need translation for the UUIDs, you
will keep your security model as-is (possibly showing everybody how it should
be done) and you are perfectly free to use whatever transport you wish when
you are not talking directly to cman.

Factoid: I do not use PF_CLUSTER for synchronization in my block devices,
simply because regular tcp streams are faster in this context. As far as I
know (g)dlm is the only user of PF_CLUSTER for any purpose other than talking
to cman.

> > I _hope_ that we can arrive at a base membership infrastructure that is
> > convenient to use either from kernel or user space. User space libraries
> > already exist, but with warts of various sizes.
>
> ... which is why I asked the above questions: User-space needs to
> interface with the kernel to tell it the membership (if the membership
> is user-space driven), or retrieve it (if it is kernel driven).

Passing things around via sockets is a powerful model. PF_UNIX can even pass
a socket to kernel, which is how I go about setting up communication for my
block devices. I think (g)dlm calls open() from within kernel, something
like that. The exact method used to get hold of the appropriate socket is a
just matter of taste. Of course, I like to suppose that _my_ method shows
the most taste of all.

> This implies we need to understand the expected semantics of the kernel,
> and either standarize them, or have a way for user-space to figure out
> which are wanted when interfacing with a particular kernel.

Of course, we could always read the patches...

Regards,

Daniel

2005-04-28 12:51:20

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: [PATCH 0/7] dlm: overview

Hi,

On Wed, 2005-04-27 at 14:23, Lars Marowsky-Bree wrote:

> Well, frankly, recovery time of the DLM mostly will depend on the speed
> of the membership algorithm and timings used.

Sometimes. But I've encountered systems with over ten million active
locks in use at once. Those took a _long_ time to recover the DLM;
sorting out membership was a trivial part of the time!

--Stephen


2005-04-28 14:37:03

by Lars Marowsky-Bree

[permalink] [raw]
Subject: Re: [PATCH 0/7] dlm: overview

On 2005-04-27T11:12:45, Mark Fasheh <[email protected]> wrote:

> The short answer is no but that we're collecting them. Right now, I can say
> that if you take our whole stack into consideration OCFS2 for things like
> kernel untars and builds (a common test over here), is typically almost as
> fast as ext3 (single node obviously) even when we have a second or third
> node mounted.

Well, agreed that's great, but that seems to imply just generic sane
design: Why should the presence of another node (which does nothing, or
not with overlapping objects on disk) cause any substantial slowdown?

Admittedly we seem to be really short of meaningful benchmarks for DLMs
and/or clustering filesystems...

Hey. Wait. Benchmarks. Scalability issues. No real coding involved.
Anyone from OSDL listening? ;-)

> As far as specific DLM timings go, we're in the process of collecting them.

Perfect.

> As you know, lately we have been deep in a process of stabilizing things :)

Yes, but this also would be a great time to identify real performance
bugs before shipping - so consider it as part of stress tesing ;-)

> While we have collected timings independent of the FS in the past, we
> haven't done that recently enough that I'd feel comfortable posting it.

Understood.


Sincerely,
Lars Marowsky-Br?e <[email protected]>

--
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business

2005-04-28 14:57:36

by Lars Marowsky-Bree

[permalink] [raw]
Subject: Re: [PATCH 0/7] dlm: overview

On 2005-04-27T18:38:18, Daniel Phillips <[email protected]> wrote:

> Uuids's at this level are inherently bogus, unless of course you have more
> than 2**32 cluster nodes. I don't know about you, but I do not have even
> half that many nodes over here.

This is not quite the argument. With that argument, 16 bit would be
fine. And even then, I'd call you guilty of causing my lights to flicker
;-)

The argument about UUIDs goes a bit beyond that: No admin needed to
assign them; they can stay the same even if clusters/clusters merge (in
theory); they can be used for inter-cluster addressing too, because they
aren't just unique within a single cluster (think clusters of clusters,
grids etc, whatever the topology), and finally, UUID is a big enough
blob to put all other identifiers in, be it a two bit node id, a
nodename, 32bit IPv4 address or a 128bit IPv6.

This piece is important. It defines one of the fundamental objects in
the API.

I recommend you read up on the discussions on the OCF list on this; this
has probably been one of the hottest arguments.

> > > > - How are the communication links configured? How to tell it which
> > > > interfaces to use for IP, for example?
> > >
> > > CMAN provides a PF_CLUSTER. This facility seems cool, but I haven't got
> > > much experience with it, and certainly not enough to know if PF_CLUSTER
> > > is really necessary, or should be put forth as a required component of
> > > the common infrastructure. It is not clear to me that SCTP can't be used
> > > directly, perhaps with some library support.
> >
> > You've missed the point of my question. I did not mean "How does an
> > application use the cluster comm links", but "How is the kernel
> > component told which paths/IPs it should use".
>
> I believe cman gives you an address in AF_CLUSTER at the same time it hands
> you your event socket. Last time I did this, the actual mechanism was buried
> under a wrapper (magma) so I could have that got that slightly wrong.
> Anybody want to clarify?

This still doesn't answer the question. You're telling me how I get my
address in AF_CLUSTER. I was asking, again: "How is the kernel component
configured which paths/IP to use" - ie, the equivalent of ifconfig/route
for the cluster stack, if you so please.

Doing this in a wrapper is one answer - in which case we'd have a
consistent user-space API provided by shared libraries wrapping a
specific kernel component. This places the boundary in user-space.

This seems to be a main point of contention, also applicable to the
first question about node identifiers: What does the kernel/user-space
boundary look like, and is this the one we are aiming to clarify?

Or do we place the boundary in user-space with a specific wrapper around
a given kernel solution.

I can see both or even a mix, but it's an important question.

> Since cman has now moved to user space, userspace does not tell the kernel
> about membership,

That partial sentence already makes no sense. So how does the kernel
(DLM in this case) learn about whether a node is assumed to be up or
down if the membership is in user-space? Right! User-space must tell
it.

Again, this is sort of the question of where the API boundary between
published/internal is.

For example, with OCFS2 (w/user-space membership, which it doesn't yet
have, but which they keep telling me is trivial to add, but we're
busying them with other work right now ;-) it is supposed to work like
this: When a membership event occurs, user-space transfers this event
to the kernel by writing to a configfs mount.

Likewise, the node ids and comm links the kernel DLM uses with OCFS2
are configured via that interface.

If we could standarize at the kernel/user-space boundary for clustering,
like we do for syscalls, this would IMHO be cleaner than having
user-space wrappers.

> Can we have a list of all the reasons that you cannot wrap your heartbeat
> interface around cman, please?

Any API can be mapped to any other API. That wasn't the question. I was
aiming at the kernel/user-space boundary again.

> > ... which is why I asked the above questions: User-space needs to
> > interface with the kernel to tell it the membership (if the membership
> > is user-space driven), or retrieve it (if it is kernel driven).
> Passing things around via sockets is a powerful model.

Passing a socket in to use for communication makes sense. "I want you to
use this transport when talking to the cluster". However, that begs the
question whether you're passing in a unicast peer-to-peer socket or a
multicast one which reaches all of the nodes, and what kind of
security, ordering, and reliability guarantees that transport needs to
provide.

Again, this is (from my side) about understanding the user-space/kernel
boundary, and the APIs used within the kernel.

> > This implies we need to understand the expected semantics of the kernel,
> > and either standarize them, or have a way for user-space to figure out
> > which are wanted when interfacing with a particular kernel.
> Of course, we could always read the patches...

Reading patches is fine for understanding syntax, and spotting some
nits. I find actual discussion with the developers to be invaluable to
figure out the semantics and the _intention_ of the code, which takes
much longer to deduce from the code alone; and you know, just sometimes
the code doesn't actually reflect the intentions of the programmers who
wrote it ;-)


Sincerely,
Lars Marowsky-Br?e <[email protected]>

--
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business

2005-04-28 16:22:16

by David Teigland

[permalink] [raw]
Subject: Re: [PATCH 0/7] dlm: overview

On Wed, Apr 27, 2005 at 03:56:35PM +0200, Lars Marowsky-Bree wrote:

> Questions which need to be settled, or which the API at least needs to
> export so we know what is expected from us:

Here's what the dlm takes from userspace:

- Each lockspace takes a list of nodeid's that are the current members
of that lockspace. Nodeid's are int's. For lockspace "alpha", it looks
like this:
echo "1 2 3 4" > /sys/kernel/dlm/alpha/members

- The dlm comms code needs to map these nodeid's to real IP addresses.
A simple ioctl on a dlm char device passes in nodeid/sockaddr pairs.
e.g. dlm_tool set_node 1 10.0.0.1
to tell the dlm that nodeid 1 has IP address 10.0.0.1

- To suspend the lockspace you'd do (and similar for resuming):
echo 1 > /sys/kernel/dlm/alpha/stop

GFS won't be anything like that. To control gfs file system "alpha":

- To tell it that its mount is completed:
echo 1 > /sys/kernel/gfs/alpha/mounted

- To tell it to suspend operation while recovery is taking place:
echo 1 > /sys/kernel/gfs/alpha/block

- To tell it to recover journal 3:
echo 3 > /sys/kernel/gfs/alpha/recover

There's a dlm daemon in user space that works with the specific sysfs
files above and interfaces with whatever cluster infrastructure exists.
The same goes for gfs, but the gfs user space daemon does quite a lot more
(gfs-specific stuff).

In other words, these aren't external API's; they're internal interfaces
within systems that happen to be split between the kernel and user-space.

Dave

2005-04-28 16:42:50

by Lars Marowsky-Bree

[permalink] [raw]
Subject: Re: [PATCH 0/7] dlm: overview

On 2005-04-29T00:25:52, David Teigland <[email protected]> wrote:

> On Wed, Apr 27, 2005 at 03:56:35PM +0200, Lars Marowsky-Bree wrote:
>
> > Questions which need to be settled, or which the API at least needs to
> > export so we know what is expected from us:
>
> Here's what the dlm takes from userspace:
>
> - Each lockspace takes a list of nodeid's that are the current members
> of that lockspace. Nodeid's are int's. For lockspace "alpha", it looks
> like this:
> echo "1 2 3 4" > /sys/kernel/dlm/alpha/members
>
> - The dlm comms code needs to map these nodeid's to real IP addresses.
> A simple ioctl on a dlm char device passes in nodeid/sockaddr pairs.
> e.g. dlm_tool set_node 1 10.0.0.1
> to tell the dlm that nodeid 1 has IP address 10.0.0.1
>
> - To suspend the lockspace you'd do (and similar for resuming):
> echo 1 > /sys/kernel/dlm/alpha/stop

Ohhh. _NEAT!_ Simple. Me like simple. This will work just perfectly well
with our current approach (well, with some minor adjustments on our side
for the mapping table).

I assume that we're allowed to update the nodeid/sockaddr mapping while
suspended too? ie, if we were to reassign the nodeid to some other
node...?

We can drive this almost directly and completely with a simple plugin.

> In other words, these aren't external API's; they're internal interfaces
> within systems that happen to be split between the kernel and user-space.

Okay, understood. So the boundary is within user-space.


Sincerely,
Lars Marowsky-Br?e <[email protected]>

--
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business

2005-04-28 17:36:01

by Mark Fasheh

[permalink] [raw]
Subject: Re: [PATCH 0/7] dlm: overview

On Thu, Apr 28, 2005 at 04:36:19PM +0200, Lars Marowsky-Bree wrote:
> On 2005-04-27T11:12:45, Mark Fasheh <[email protected]> wrote:
>
> > The short answer is no but that we're collecting them. Right now, I can say
> > that if you take our whole stack into consideration OCFS2 for things like
> > kernel untars and builds (a common test over here), is typically almost as
> > fast as ext3 (single node obviously) even when we have a second or third
> > node mounted.
>
> Well, agreed that's great, but that seems to imply just generic sane
> design: Why should the presence of another node (which does nothing, or
> not with overlapping objects on disk) cause any substantial slowdown?
See my discussion with David regarding LKM_LOCAL to see how the dlm still
comes into play, even when you have few disk structures to ping on. A
cluster file system is more than just sane disk design :)

Other things come into play there too, like how resources masters are
determined, whether they can be migrated etc. These can have an effect even
when you're doing work mostly within your own area of disk.
--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
[email protected]

2005-04-28 20:53:39

by Daniel Phillips

[permalink] [raw]
Subject: Re: [PATCH 0/7] dlm: overview

On Thursday 28 April 2005 10:57, Lars Marowsky-Bree wrote:
> On 2005-04-27T18:38:18, Daniel Phillips <[email protected]> wrote:
> > Uuids's at this level are inherently bogus, unless of course you have
> > more than 2**32 cluster nodes. I don't know about you, but I do not have
> > even half that many nodes over here.
>
> This is not quite the argument. With that argument, 16 bit would be
> fine. And even then, I'd call you guilty of causing my lights to flicker
> ;-)

BlueGene is pushing the 16 bit node number boundary already, 32 bits seems
prudent. More is silly. Think of the node number as more like a PID than a
UUID.

> The argument about UUIDs goes a bit beyond that: No admin needed to
> assign them; they can stay the same even if clusters/clusters merge (in
> theory); they can be used for inter-cluster addressing too, because they
> aren't just unique within a single cluster (think clusters of clusters,
> grids etc, whatever the topology), and finally, UUID is a big enough
> blob to put all other identifiers in, be it a two bit node id, a
> nodename, 32bit IPv4 address or a 128bit IPv6.
>
> This piece is important. It defines one of the fundamental objects in
> the API.
>
> I recommend you read up on the discussions on the OCF list on this; this
> has probably been one of the hottest arguments.

Add a translation layer if you like, and submit it in the form of a user space
library service. Or have it be part of your own layer or application. There
is no compelling argument for embedding such a bloatlet to cman proper (which
is already fatter than it should be).

> "How is the kernel component
> configured which paths/IP to use" - ie, the equivalent of ifconfig/route
> for the cluster stack,

There is a config file in /etc and a (userspace) scheme for distributing the
file around the cluster (ccs - cluster configuration system).

How the configuration gets from the config file to kernel is a mystery to me
at the moment, which I will hopefully solve by reading some code later
today ;-)

> Doing this in a wrapper is one answer - in which case we'd have a
> consistent user-space API provided by shared libraries wrapping a
> specific kernel component. This places the boundary in user-space.

I believe that it is almost entirely in user space now, with the recent move
of cman to user space. I have not yet seen the new code, so I don't know the
details (this egg was hatched by Dave and Patrick).

> This seems to be a main point of contention, also applicable to the
> first question about node identifiers: What does the kernel/user-space
> boundary look like, and is this the one we are aiming to clarify?

Very much so.

> Or do we place the boundary in user-space with a specific wrapper around
> a given kernel solution.

Yes. But let's try and have a good, durable kernel solution right from the
start.

> > Since cman has now moved to user space, userspace does not tell the
> > kernel about membership,
>
> That partial sentence already makes no sense.

Partial?

> So how does the kernel
> (DLM in this case) learn about whether a node is assumed to be up or
> down if the membership is in user-space? Right! User-space must tell
> it.

By a message over a socket, as I said. This is a really nice property of
sockets: when cman moved from kernel to user space, (g)dlm was hardly
affected at all.

> For example, with OCFS2 (w/user-space membership, which it doesn't yet
> have, but which they keep telling me is trivial to add, but we're
> busying them with other work right now ;-) it is supposed to work like
> this: When a membership event occurs, user-space transfers this event
> to the kernel by writing to a configfs mount.

Let me go get my airsick bag right now :-)

Let's have no magical filesystems in the core interface please. We can always
add some later on top of a sane base interface, assuming somebody has too
much time on their hands, and that Andrew was busy doing something else, and
Linus left his taste at home that day.

> Likewise, the node ids and comm links the kernel DLM uses with OCFS2
> are configured via that interface.

I am looking forward to flaming that interface should it dare to rear its ugly
head here :-)

> If we could standarize at the kernel/user-space boundary for clustering,
> like we do for syscalls, this would IMHO be cleaner than having
> user-space wrappers.

I don't see anything wrong with wrapping a sane kernel interface with more
stuff to make it more convenient. Right now, the interface is a socket and a
set of messages for the socket. Pretty elegant, if you ask me.

There are bones to pick at the message syntax level of course.

> > Can we have a list of all the reasons that you cannot wrap your heartbeat
> > interface around cman, please?
>
> Any API can be mapped to any other API.

I meant sanely. Let me try again: can we have a list of all the reasons that
you cannot wrap your heartbeat interface around cman _sanely_, please.

> That wasn't the question. I was aiming at the kernel/user-space boundary
> again.

Me too.

> > > ... which is why I asked the above questions: User-space needs to
> > > interface with the kernel to tell it the membership (if the membership
> > > is user-space driven), or retrieve it (if it is kernel driven).
> >
> > Passing things around via sockets is a powerful model.
>
> Passing a socket in to use for communication makes sense. "I want you to
> use this transport when talking to the cluster". However, that begs the
> question whether you're passing in a unicast peer-to-peer socket or a
> multicast one which reaches all of the nodes,

It was multicast last time I looked. I heard mumblings about changing from a
udp-derived protocol to a sctp-derived one, and I do not know if multicast
got lost in the translation. It would be a shame if it did. Patrick?

> and what kind of
> security, ordering, and reliability guarantees that transport needs to
> provide.

Security is practically nonexistent at the moment, we just keep normal users
away from the socket. Ordering is provided by a barrier facility at a higher
level. Delivery is guaranteed and knows about membership changes.

> > Of course, we could always read the patches...
>
> Reading patches is fine for understanding syntax, and spotting some
> nits. I find actual discussion with the developers to be invaluable to
> figure out the semantics and the _intention_ of the code, which takes
> much longer to deduce from the code alone; and you know, just sometimes
> the code doesn't actually reflect the intentions of the programmers who
> wrote it ;-)

Strongly agreed, and this thread is doing very well in that regard. But we
really, really, need people to read the patches as well, especially people
with a strong background in clustering.

Regards,

Daniel

2005-04-29 00:34:11

by David Lang

[permalink] [raw]
Subject: Re: [PATCH 0/7] dlm: overview

On Thu, 28 Apr 2005, Lars Marowsky-Bree wrote:

> On 2005-04-27T18:38:18, Daniel Phillips <[email protected]> wrote:
>
>> Uuids's at this level are inherently bogus, unless of course you have more
>> than 2**32 cluster nodes. I don't know about you, but I do not have even
>> half that many nodes over here.
>
> This is not quite the argument. With that argument, 16 bit would be
> fine. And even then, I'd call you guilty of causing my lights to flicker
> ;-)
>
> The argument about UUIDs goes a bit beyond that: No admin needed to
> assign them; they can stay the same even if clusters/clusters merge (in
> theory); they can be used for inter-cluster addressing too, because they
> aren't just unique within a single cluster (think clusters of clusters,
> grids etc, whatever the topology), and finally, UUID is a big enough
> blob to put all other identifiers in, be it a two bit node id, a
> nodename, 32bit IPv4 address or a 128bit IPv6.
>
> This piece is important. It defines one of the fundamental objects in
> the API.
>
> I recommend you read up on the discussions on the OCF list on this; this
> has probably been one of the hottest arguments.

how is this UUID that doesn't need to be touched by an admin, and will
always work in all possible networks (including insane things like backup
servers configured with the same name and IP address as the primary with
NAT between them to allow them to communicate) generated?

there are a lot of software packages out there that could make use of
this.

David Lang

2005-04-29 01:49:50

by Bernd Eckenfels

[permalink] [raw]
Subject: Re: [PATCH 0/7] dlm: overview

In article <[email protected]> you wrote:
> how is this UUID that doesn't need to be touched by an admin, and will
> always work in all possible networks (including insane things like backup
> servers configured with the same name and IP address as the primary with
> NAT between them to allow them to communicate) generated?
>
> there are a lot of software packages out there that could make use of
> this.

It is hard to work in all cases :)

if you use v4 UUID they are a 128bit random bitstring, others depend on the
MAC (plus random).

Greetings
Bernd

2005-04-29 01:52:04

by Daniel Phillips

[permalink] [raw]
Subject: Re: [PATCH 0/7] dlm: overview

On Thursday 28 April 2005 20:33, David Lang wrote:
> how is this UUID that doesn't need to be touched by an admin, and will
> always work in all possible networks (including insane things like backup
> servers configured with the same name and IP address as the primary with
> NAT between them to allow them to communicate) generated?
>
> there are a lot of software packages out there that could make use of
> this.

Please do not argue that the 32 bit node ID ints should be changed to uuids,
please find another way to accommodate your uuids.

Regards,

Daniel

2005-04-29 04:23:47

by Daniel Phillips

[permalink] [raw]
Subject: Re: [PATCH 0/7] dlm: overview

On Thursday 28 April 2005 12:25, David Teigland wrote:
> There's a dlm daemon in user space that works with the specific sysfs
> files above and interfaces with whatever cluster infrastructure exists.
> The same goes for gfs, but the gfs user space daemon does quite a lot more
> (gfs-specific stuff).
>
> In other words, these aren't external API's; they're internal interfaces
> within systems that happen to be split between the kernel and user-space.

Traditionally, Linux kernel interfaces have been well-defined. I do not think
we want to break with that tradition now.

So please provide a pointer to the kernel interface you have in mind.

Regards,

Daniel

2005-04-29 17:15:09

by David Lang

[permalink] [raw]
Subject: Re: [PATCH 0/7] dlm: overview

On Thu, 28 Apr 2005, Daniel Phillips wrote:

> On Thursday 28 April 2005 20:33, David Lang wrote:
>> how is this UUID that doesn't need to be touched by an admin, and will
>> always work in all possible networks (including insane things like backup
>> servers configured with the same name and IP address as the primary with
>> NAT between them to allow them to communicate) generated?
>>
>> there are a lot of software packages out there that could make use of
>> this.
>
> Please do not argue that the 32 bit node ID ints should be changed to uuids,
> please find another way to accommodate your uuids.

you misunderstand my question.

the claim was that UUID's are unique and don't have to be assigned by the
admins.

I'm saying that in my experiance there isn't any standard or reliable way
to generate such a UUID and I'm asking for the people makeing the
claim to educate me on what I'm missing becouse a reliable UUID for linux
on all hardware would be extremely useful for many things.

David Lang

--
There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other way is to make it so complicated that there are no obvious deficiencies.
-- C.A.R. Hoare

2005-04-29 20:52:02

by Daniel Phillips

[permalink] [raw]
Subject: Re: [PATCH 0/7] dlm: overview

On Friday 29 April 2005 13:13, David Lang wrote:
> On Thu, 28 Apr 2005, Daniel Phillips wrote:
> > On Thursday 28 April 2005 20:33, David Lang wrote:
> >> how is this UUID that doesn't need to be touched by an admin, and will
> >> always work in all possible networks (including insane things like
> >> backup servers configured with the same name and IP address as the
> >> primary with NAT between them to allow them to communicate) generated?
> >>
> >> there are a lot of software packages out there that could make use of
> >> this.
> >
> > Please do not argue that the 32 bit node ID ints should be changed to
> > uuids, please find another way to accommodate your uuids.
>
> you misunderstand my question.
>
> the claim was that UUID's are unique and don't have to be assigned by the
> admins.
>
> I'm saying that in my experiance there isn't any standard or reliable way
> to generate such a UUID and I'm asking for the people makeing the
> claim to educate me on what I'm missing becouse a reliable UUID for linux
> on all hardware would be extremely useful for many things.

OK, that sound plausible. However, just to be 100% clear, do you agree that
a) simple integer node numbers are better (because simpler) in cman proper
and b) uuids can be layered on top of a simple integer scheme, using a pair
of mappings?

Regards,

Daniel

2005-05-01 03:57:54

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 0/7] dlm: overview

> the claim was that UUID's are unique and don't have to be assigned by the
> admins.
>
> I'm saying that in my experiance there isn't any standard or reliable way
> to generate such a UUID and I'm asking for the people makeing the
> claim to educate me on what I'm missing becouse a reliable UUID for linux
> on all hardware would be extremely useful for many things.

How to reliably generate universally unique ID's have been well
understood for over twenty years, and is implemented on nearly every
Linux system for over ten. For more information I refer you to
doc/draft-leach-uuid-guids-01.txt in the e2fsprogs sources, and for an
implementation, the uuid library in e2fsprogs, which is used by both
GNOME and KDE. UUID's are also used by Apple's Mac OS X (using
libuuid from e2fsprogs), Microsoft Windows, more historically by the
OSF DCE, and even more historically by the Apollo Domain OS (1980 --
1989, RIP). Much of this usage is due to the efforts of Paul Leach, a
key architect at Apollo, and OSF/DCE, before he left and joined the
Dark Side at Microsoft.

Also, FYI the OSF/DCE, including the specification for generating
UUID's, was submitted by OSF to the X/Open where it was standardized,
who in turn submitted it to the ISO where it was approved as
Publically Available Specification (PAS). So technically, there *is*
an internationally standardized way of generating UUID's, and it is
already implemented and deployed on nearly all Linux systems.

- Ted

2005-05-01 04:16:33

by David Lang

[permalink] [raw]
Subject: Re: [PATCH 0/7] dlm: overview

On Sat, 30 Apr 2005, Theodore Ts'o wrote:

>> the claim was that UUID's are unique and don't have to be assigned by the
>> admins.
>>
>> I'm saying that in my experiance there isn't any standard or reliable way
>> to generate such a UUID and I'm asking for the people makeing the
>> claim to educate me on what I'm missing becouse a reliable UUID for linux
>> on all hardware would be extremely useful for many things.
>
> How to reliably generate universally unique ID's have been well
> understood for over twenty years, and is implemented on nearly every
> Linux system for over ten. For more information I refer you to
> doc/draft-leach-uuid-guids-01.txt in the e2fsprogs sources, and for an
> implementation, the uuid library in e2fsprogs, which is used by both
> GNOME and KDE. UUID's are also used by Apple's Mac OS X (using
> libuuid from e2fsprogs), Microsoft Windows, more historically by the
> OSF DCE, and even more historically by the Apollo Domain OS (1980 --
> 1989, RIP). Much of this usage is due to the efforts of Paul Leach, a
> key architect at Apollo, and OSF/DCE, before he left and joined the
> Dark Side at Microsoft.
>
> Also, FYI the OSF/DCE, including the specification for generating
> UUID's, was submitted by OSF to the X/Open where it was standardized,
> who in turn submitted it to the ISO where it was approved as
> Publically Available Specification (PAS). So technically, there *is*
> an internationally standardized way of generating UUID's, and it is
> already implemented and deployed on nearly all Linux systems.

thanks for the pointer. I wasn't aware of this draft (although from a
reasonably short search it appears that this draft was allowed to expire,
with no direct replacement that I could find)

I will say that this wasn't what I thought we was being talked about for
cluster membership, becouse I assumed that the generation of an ID would
be repeatable so that a cluster node could be rebuilt and re-join the
cluster with it's old ID.

David Lang

--
There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other way is to make it so complicated that there are no obvious deficiencies.
-- C.A.R. Hoare

2005-05-02 12:38:21

by Lars Marowsky-Bree

[permalink] [raw]
Subject: Re: [PATCH 0/7] dlm: overview

On 2005-04-30T21:14:45, David Lang <[email protected]> wrote:

> I will say that this wasn't what I thought we was being talked about for
> cluster membership, becouse I assumed that the generation of an ID would
> be repeatable so that a cluster node could be rebuilt and re-join the
> cluster with it's old ID.

Hm? Every node generates its UUID _once_ and stores it on persistent
local storage. It doesn't get regenerated.


Sincerely,
Lars Marowsky-Br?e <[email protected]>

--
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge"