LinuxLists.cc - clustered MD

2015-06-09 18:22:33

Subject: clustered MD

I've just noticed the existence of clustered MD for the first time.
It is a major new user of the dlm, and I have some doubts about it.
When did this appear on the mailing list for review?
Dave

2015-06-09 19:26:40

by Goldwyn Rodrigues

[permalink] [raw]

On 06/09/2015 01:22 PM, David Teigland wrote:
> I've just noticed the existence of clustered MD for the first time.
> It is a major new user of the dlm, and I have some doubts about it.
> When did this appear on the mailing list for review?

It first appeared in December, 2014 on the RAID mailing list.
http://marc.info/?l=linux-raid&m=141891941330336&w=2

--
Goldwyn

2015-06-09 19:45:11

by David Teigland

[permalink] [raw]

Subject: Re: clustered MD

On Tue, Jun 09, 2015 at 02:26:25PM -0500, Goldwyn Rodrigues wrote:
> On 06/09/2015 01:22 PM, David Teigland wrote:
> >I've just noticed the existence of clustered MD for the first time.
> >It is a major new user of the dlm, and I have some doubts about it.
> >When did this appear on the mailing list for review?
>
> It first appeared in December, 2014 on the RAID mailing list.
> http://marc.info/?l=linux-raid&m=141891941330336&w=2

I don't read that mailing list. Searching my archives of linux-kernel, it
has never been mentioned. I can't even find an email for the md pull
request that included it.

The merge commit states:

- "experimental" code for managing md/raid1 across a cluster using
DLM. Code is not ready for general use and triggers a WARNING if
used. However it is looking good and mostly done and having in
mainline will help co-ordinate development.

That falls far short of the bar for adding it to the kernel. It not only
needs to work, it needs to be reviewed and justified, usually by showing
some real world utility to warrant the potential maintenance effort.

2015-06-09 20:08:32

by Goldwyn Rodrigues

[permalink] [raw]

Subject: Re: clustered MD

Hi David,

On 06/09/2015 02:45 PM, David Teigland wrote:
> On Tue, Jun 09, 2015 at 02:26:25PM -0500, Goldwyn Rodrigues wrote:
>> On 06/09/2015 01:22 PM, David Teigland wrote:
>>> I've just noticed the existence of clustered MD for the first time.
>>> It is a major new user of the dlm, and I have some doubts about it.
>>> When did this appear on the mailing list for review?
>>
>> It first appeared in December, 2014 on the RAID mailing list.
>> http://marc.info/?l=linux-raid&m=141891941330336&w=2
>
> I don't read that mailing list. Searching my archives of linux-kernel, it
> has never been mentioned. I can't even find an email for the md pull
> request that included it.

Is this what you are looking for?
http://marc.info/?l=linux-kernel&m=142976971510061&w=2

>
> The merge commit states:
>
> - "experimental" code for managing md/raid1 across a cluster using
> DLM. Code is not ready for general use and triggers a WARNING if
> used. However it is looking good and mostly done and having in
> mainline will help co-ordinate development.
>
> That falls far short of the bar for adding it to the kernel. It not only
> needs to work, it needs to be reviewed and justified, usually by showing

Why do you say it does not work? It did go through it's round of reviews
on the RAID mailing list. I understand that you missed it because you
are not subscribed to the raid mailing list.

> some real world utility to warrant the potential maintenance effort.

We do have a valid real world utility. It is to provide
high-availability of RAID1 storage over the cluster. The distributed
locking is required only during cases of error and superblock updates
and is not required during normal operations, which makes it fast enough
for usual case scenarios.

What are the doubts you have about it?

--
Goldwyn

2015-06-09 20:14:22

by Goldwyn Rodrigues

[permalink] [raw]

Subject: Re: clustered MD

Hi David,

On 06/09/2015 02:45 PM, David Teigland wrote:
> On Tue, Jun 09, 2015 at 02:26:25PM -0500, Goldwyn Rodrigues wrote:
>> On 06/09/2015 01:22 PM, David Teigland wrote:
>>> I've just noticed the existence of clustered MD for the first time.
>>> It is a major new user of the dlm, and I have some doubts about it.
>>> When did this appear on the mailing list for review?
>>
>> It first appeared in December, 2014 on the RAID mailing list.
>> http://marc.info/?l=linux-raid&m=141891941330336&w=2
>
> I don't read that mailing list. Searching my archives of linux-kernel, it
> has never been mentioned. I can't even find an email for the md pull
> request that included it.
>

Is this what you are looking for?
http://marc.info/?l=linux-kernel&m=142976971510061&w=2

> The merge commit states:
>
> - "experimental" code for managing md/raid1 across a cluster using
> DLM. Code is not ready for general use and triggers a WARNING if
> used. However it is looking good and mostly done and having in
> mainline will help co-ordinate development.
>
> That falls far short of the bar for adding it to the kernel. It not only
> needs to work, it needs to be reviewed and justified, usually by showing
> some real world utility to warrant the potential maintenance effort.
>
>

Why do you say it does not work? It did go through it's round of reviews
on the RAID mailing list. I understand that you missed it because you
are not subscribed to the raid mailing list.

We do have a valid real world utility. It is to provide
high-availability of RAID1 storage over the cluster. The distributed
locking is required only during cases of error and superblock updates
and is not required during normal operations, which makes it fast enough
for usual case scenarios.

What are the doubts you have about it?

--
Goldwyn

2015-06-09 20:31:07

by David Teigland

[permalink] [raw]

Subject: Re: clustered MD

On Tue, Jun 09, 2015 at 03:08:11PM -0500, Goldwyn Rodrigues wrote:
> Hi David,
>
> On 06/09/2015 02:45 PM, David Teigland wrote:
> >On Tue, Jun 09, 2015 at 02:26:25PM -0500, Goldwyn Rodrigues wrote:
> >>On 06/09/2015 01:22 PM, David Teigland wrote:
> >>>I've just noticed the existence of clustered MD for the first time.
> >>>It is a major new user of the dlm, and I have some doubts about it.
> >>>When did this appear on the mailing list for review?
> >>
> >>It first appeared in December, 2014 on the RAID mailing list.
> >>http://marc.info/?l=linux-raid&m=141891941330336&w=2
> >
> >I don't read that mailing list. Searching my archives of linux-kernel, it
> >has never been mentioned. I can't even find an email for the md pull
> >request that included it.
>
> Is this what you are looking for?
> http://marc.info/?l=linux-kernel&m=142976971510061&w=2

Yes, I guess gmail lost it, or put it in spam.

> >- "experimental" code for managing md/raid1 across a cluster using
> > DLM. Code is not ready for general use and triggers a WARNING if
> > used. However it is looking good and mostly done and having in
> > mainline will help co-ordinate development.
> >
> >That falls far short of the bar for adding it to the kernel. It not only
> >needs to work, it needs to be reviewed and justified, usually by showing
>
> Why do you say it does not work?

It's just my abbreviation of that summary paragraph.

> It did go through it's round of reviews on the RAID mailing list. I
> understand that you missed it because you are not subscribed to the raid
> mailing list.

I will look for that.

> >some real world utility to warrant the potential maintenance effort.
>
> We do have a valid real world utility. It is to provide
> high-availability of RAID1 storage over the cluster. The
> distributed locking is required only during cases of error and
> superblock updates and is not required during normal operations,
> which makes it fast enough for usual case scenarios.

That's the theory, how much evidence do you have of that in practice?

> What are the doubts you have about it?

Before I begin reviewing the implementation, I'd like to better understand
what it is about the existing raid1 that doesn't work correctly for what
you'd like to do with it, i.e. I don't know what the problem is.

2015-06-09 20:33:21

by David Lang

[permalink] [raw]

Subject: Re: clustered MD

On Tue, 9 Jun 2015, David Teigland wrote:

>> We do have a valid real world utility. It is to provide
>> high-availability of RAID1 storage over the cluster. The
>> distributed locking is required only during cases of error and
>> superblock updates and is not required during normal operations,
>> which makes it fast enough for usual case scenarios.
>
> That's the theory, how much evidence do you have of that in practice?
>
>> What are the doubts you have about it?
>
> Before I begin reviewing the implementation, I'd like to better understand
> what it is about the existing raid1 that doesn't work correctly for what
> you'd like to do with it, i.e. I don't know what the problem is.

As I understand things, the problem is ~providing RAID across multiple machines,
not just across the disks in one machine.

David Lang

2015-06-10 03:33:21

by Goldwyn Rodrigues

[permalink] [raw]

Subject: Re: clustered MD

On 06/09/2015 03:30 PM, David Teigland wrote:
> On Tue, Jun 09, 2015 at 03:08:11PM -0500, Goldwyn Rodrigues wrote:
>> Hi David,
>>
>> On 06/09/2015 02:45 PM, David Teigland wrote:
>>> On Tue, Jun 09, 2015 at 02:26:25PM -0500, Goldwyn Rodrigues wrote:
>>>> On 06/09/2015 01:22 PM, David Teigland wrote:
>>>>> I've just noticed the existence of clustered MD for the first time.
>>>>> It is a major new user of the dlm, and I have some doubts about it.
>>>>> When did this appear on the mailing list for review?
>>>>
>>>> It first appeared in December, 2014 on the RAID mailing list.
>>>> http://marc.info/?l=linux-raid&m=141891941330336&w=2
>>>
>>> I don't read that mailing list. Searching my archives of linux-kernel, it
>>> has never been mentioned. I can't even find an email for the md pull
>>> request that included it.
>>
>> Is this what you are looking for?
>> http://marc.info/?l=linux-kernel&m=142976971510061&w=2
>
> Yes, I guess gmail lost it, or put it in spam.
>
>>> - "experimental" code for managing md/raid1 across a cluster using
>>> DLM. Code is not ready for general use and triggers a WARNING if
>>> used. However it is looking good and mostly done and having in
>>> mainline will help co-ordinate development.
>>>
>>> That falls far short of the bar for adding it to the kernel. It not only
>>> needs to work, it needs to be reviewed and justified, usually by showing
>>
>> Why do you say it does not work?
>
> It's just my abbreviation of that summary paragraph.
>
>> It did go through it's round of reviews on the RAID mailing list. I
>> understand that you missed it because you are not subscribed to the raid
>> mailing list.
>
> I will look for that.
>
>>> some real world utility to warrant the potential maintenance effort.
>>
>> We do have a valid real world utility. It is to provide
>> high-availability of RAID1 storage over the cluster. The
>> distributed locking is required only during cases of error and
>> superblock updates and is not required during normal operations,
>> which makes it fast enough for usual case scenarios.
>
> That's the theory, how much evidence do you have of that in practice?

We wanted to develop a solution which is lock free (or atleast minimum)
for the most common/frequent usage scenario. Also, we compared it with
iozone on top of ocfs2 to find that it is very close to local device
performance numbers. we compared it with cLVM mirroring to find it
better as well. However, in the future we would want to use it with with
other RAID (10?) scenarios which is missing now.

>
>> What are the doubts you have about it?
>
> Before I begin reviewing the implementation, I'd like to better understand
> what it is about the existing raid1 that doesn't work correctly for what
> you'd like to do with it, i.e. I don't know what the problem is.
>

David Lang has already responded: The idea is to use a RAID device
(currently only level 1 mirroring is supported) with multiple nodes of
the cluster.

Here is a description on how to use it:
http://marc.info/?l=linux-raid&m=141935561418770&w=2

--
Goldwyn

2015-06-10 08:00:11

by Richard Weinberger

[permalink] [raw]

Subject: Re: clustered MD

On Wed, Jun 10, 2015 at 5:33 AM, Goldwyn Rodrigues <[email protected]> wrote:
> David Lang has already responded: The idea is to use a RAID device
> (currently only level 1 mirroring is supported) with multiple nodes of the
> cluster.
>
> Here is a description on how to use it:
> http://marc.info/?l=linux-raid&m=141935561418770&w=2

Sorry if this is a stupid question, but how does this compare to DRBD?

--
Thanks,
//richard

2015-06-10 14:00:00

by Goldwyn Rodrigues

[permalink] [raw]

Subject: Re: clustered MD

On 06/10/2015 03:00 AM, Richard Weinberger wrote:
> On Wed, Jun 10, 2015 at 5:33 AM, Goldwyn Rodrigues <[email protected]> wrote:
>> David Lang has already responded: The idea is to use a RAID device
>> (currently only level 1 mirroring is supported) with multiple nodes of the
>> cluster.
>>
>> Here is a description on how to use it:
>> http://marc.info/?l=linux-raid&m=141935561418770&w=2
>
> Sorry if this is a stupid question, but how does this compare to DRBD?
>

No, it is not.

DRBD is for local devices synced over the network. Cluster-md is for
shared devices such as SAN, so all syncs happen over the FC.
DRBD works for two nodes (until recently), cluster-md works for multiple
nodes.

--
Goldwyn

2015-06-10 15:02:04

by David Teigland

[permalink] [raw]

Subject: Re: clustered MD

On Tue, Jun 09, 2015 at 10:33:08PM -0500, Goldwyn Rodrigues wrote:
> >>>some real world utility to warrant the potential maintenance effort.
> >>
> >>We do have a valid real world utility. It is to provide
> >>high-availability of RAID1 storage over the cluster. The
> >>distributed locking is required only during cases of error and
> >>superblock updates and is not required during normal operations,
> >>which makes it fast enough for usual case scenarios.
> >
> >That's the theory, how much evidence do you have of that in practice?
>
> We wanted to develop a solution which is lock free (or atleast
> minimum) for the most common/frequent usage scenario. Also, we
> compared it with iozone on top of ocfs2 to find that it is very
> close to local device performance numbers. we compared it with cLVM
> mirroring to find it better as well. However, in the future we would
> want to use it with with other RAID (10?) scenarios which is missing
> now.

OK, but that's the second time you've missed the question I asked about
examples of real world usage. Given the early stage of development, I'm
supposing there is none, which also implies it's too early for merging.

> >>What are the doubts you have about it?
> >
> >Before I begin reviewing the implementation, I'd like to better understand
> >what it is about the existing raid1 that doesn't work correctly for what
> >you'd like to do with it, i.e. I don't know what the problem is.
>
> David Lang has already responded: The idea is to use a RAID device
> (currently only level 1 mirroring is supported) with multiple nodes
> of the cluster.

That doesn't come close to answering the question: exactly how do you want
to use raid1 (I have no idea from the statements you've made), and exactly
what breaks when you use raid1 in that way? Once we've established the
technical problem, then I can fairly evaluate your solution for it.

Isn't this process what staging is for?

Dave

2015-06-10 16:08:51

by Goldwyn Rodrigues

[permalink] [raw]

Subject: Re: clustered MD

On 06/10/2015 10:01 AM, David Teigland wrote:
> On Tue, Jun 09, 2015 at 10:33:08PM -0500, Goldwyn Rodrigues wrote:
>>>>> some real world utility to warrant the potential maintenance effort.
>>>>
>>>> We do have a valid real world utility. It is to provide
>>>> high-availability of RAID1 storage over the cluster. The
>>>> distributed locking is required only during cases of error and
>>>> superblock updates and is not required during normal operations,
>>>> which makes it fast enough for usual case scenarios.
>>>
>>> That's the theory, how much evidence do you have of that in practice?
>>
>> We wanted to develop a solution which is lock free (or atleast
>> minimum) for the most common/frequent usage scenario. Also, we
>> compared it with iozone on top of ocfs2 to find that it is very
>> close to local device performance numbers. we compared it with cLVM
>> mirroring to find it better as well. However, in the future we would
>> want to use it with with other RAID (10?) scenarios which is missing
>> now.
>
> OK, but that's the second time you've missed the question I asked about
> examples of real world usage. Given the early stage of development, I'm
> supposing there is none, which also implies it's too early for merging.
>

I thought I answered that:
To use a software RAID1 across multiple nodes of a cluster. Let me
explain in more words..

In a cluster with multiple nodes with a shared storage, such as a SAN.
The shared device becomes a single point of failure. If the device loses
power, you will lose everything. A solution proposed is to use software
RAID, say with two SAN switches with different devices and create a
RAID1 on it. So if you lose power on one switch or one of the device is
fails the other is still available. Once you get the other switch/device
back up, it would resync the devices.

>>>> What are the doubts you have about it?
>>>
>>> Before I begin reviewing the implementation, I'd like to better understand
>>> what it is about the existing raid1 that doesn't work correctly for what
>>> you'd like to do with it, i.e. I don't know what the problem is.
>>
>> David Lang has already responded: The idea is to use a RAID device
>> (currently only level 1 mirroring is supported) with multiple nodes
>> of the cluster.
>
> That doesn't come close to answering the question: exactly how do you want
> to use raid1 (I have no idea from the statements you've made)

Using software RAID1 on a cluster with shared devices.

>, and exactly
> what breaks when you use raid1 in that way? Once we've established the
> technical problem, then I can fairly evaluate your solution for it.
>

Data consistency breaks. If node 1 is writing to the RAID1 device, you
have to make sure the data between the two RAID devices is consistent.
With software raid, this is performed with bitmaps. The DLM is used to
maintain data consistency.

Device failure can be partial. Say, only node 1 sees that one of the
device has failed (link break). You need to "tell" other nodes not to
use the device and that the array is degraded.

In case of node failure, the blocks of the failed nodes must be synced
before the cluster can continue operation.

Does that explain the situation?

--
Goldwyn

2015-06-10 15:53:21

by David Teigland

[permalink] [raw]

Subject: Re: clustered MD

On Wed, Jun 10, 2015 at 10:27:27AM -0500, Goldwyn Rodrigues wrote:
> I thought I answered that:
> To use a software RAID1 across multiple nodes of a cluster. Let me
> explain in more words..
>
> In a cluster with multiple nodes with a shared storage, such as a
> SAN. The shared device becomes a single point of failure.

OK, shared storage, that's an important starting point that was never
clear.

> If the
> device loses power, you will lose everything. A solution proposed is
> to use software RAID, say with two SAN switches with different
> devices and create a RAID1 on it. So if you lose power on one switch
> or one of the device is fails the other is still available. Once you
> get the other switch/device back up, it would resync the devices.

OK, MD RAID1 on shared disks.

> >, and exactly
> >what breaks when you use raid1 in that way? Once we've established the
> >technical problem, then I can fairly evaluate your solution for it.
>
> Data consistency breaks. If node 1 is writing to the RAID1 device,
> you have to make sure the data between the two RAID devices is
> consistent. With software raid, this is performed with bitmaps. The
> DLM is used to maintain data consistency.

What's different about disks being on SAN that breaks data consistency vs
disks being locally attached? Where did the dlm come into the picture?

> Device failure can be partial. Say, only node 1 sees that one of the
> device has failed (link break). You need to "tell" other nodes not
> to use the device and that the array is degraded.

Why?

> In case of node failure, the blocks of the failed nodes must be
> synced before the cluster can continue operation.

What do cluster/node failures have to do with syncing mirror copies?

> Does that explain the situation?

No. I don't see what clusters have to do with MD RAID1 devices, they seem
like completely orthogonal concepts.

2015-06-10 16:23:41

by Goldwyn Rodrigues

[permalink] [raw]

Subject: Re: clustered MD

To start with, the goal of (basic) MD RAID1 is to keep the two mirrored
device consistent _all_ of the time. In case of a device failure, it
should degrade the array pointing to the failed device, so it can be
(hot)removed/replaced. Now, take the same concepts to multiple nodes
using the same MD-RAID1 device..

On 06/10/2015 10:48 AM, David Teigland wrote:
> On Wed, Jun 10, 2015 at 10:27:27AM -0500, Goldwyn Rodrigues wrote:
>> I thought I answered that:
>> To use a software RAID1 across multiple nodes of a cluster. Let me
>> explain in more words..
>>
>> In a cluster with multiple nodes with a shared storage, such as a
>> SAN. The shared device becomes a single point of failure.
>
> OK, shared storage, that's an important starting point that was never
> clear.
>
>> If the
>> device loses power, you will lose everything. A solution proposed is
>> to use software RAID, say with two SAN switches with different
>> devices and create a RAID1 on it. So if you lose power on one switch
>> or one of the device is fails the other is still available. Once you
>> get the other switch/device back up, it would resync the devices.
>
> OK, MD RAID1 on shared disks.
>
>>> , and exactly
>>> what breaks when you use raid1 in that way? Once we've established the
>>> technical problem, then I can fairly evaluate your solution for it.
>>
>> Data consistency breaks. If node 1 is writing to the RAID1 device,
>> you have to make sure the data between the two RAID devices is
>> consistent. With software raid, this is performed with bitmaps. The
>> DLM is used to maintain data consistency.
>
> What's different about disks being on SAN that breaks data consistency vs
> disks being locally attached? Where did the dlm come into the picture?

There are multiple nodes using the same shared device. Different nodes
would be writing their own data to the shared device possibly using a
shared filesystem such as ocfs2 on top of it. Each node maintains a
bitmap to co-ordinate syncs between the two devices of the RAID. Since
there are two devices, writes on the two devices can end at different
times and must be co-ordinated.

>
>> Device failure can be partial. Say, only node 1 sees that one of the
>> device has failed (link break). You need to "tell" other nodes not
>> to use the device and that the array is degraded.
>
> Why?

Data consistency. Because the node which continues to "see" the failed
device (on another node) as working will read stale data.

>
>> In case of node failure, the blocks of the failed nodes must be
>> synced before the cluster can continue operation.
>
> What do cluster/node failures have to do with syncing mirror copies?
>

Data consistency. Different nodes will be writing to different blocks.
So, if a node fails, you need to make sure that what the other node has
not synced between the two devices is completed by the one performing
recovery. You need to provide a consistent view to all nodes.

>> Does that explain the situation?
>
> No. I don't see what clusters have to do with MD RAID1 devices, they seem
> like completely orthogonal concepts.

If you need an analogy: cLVM, but with lesser overhead ;)

Also, may I point you to linux/Documentation/md-cluster.txt?

HTH,

--
Goldwyn

2015-06-10 17:05:43

by David Teigland

[permalink] [raw]

Subject: Re: clustered MD

On Wed, Jun 10, 2015 at 11:23:25AM -0500, Goldwyn Rodrigues wrote:
> To start with, the goal of (basic) MD RAID1 is to keep the two
> mirrored device consistent _all_ of the time. In case of a device
> failure, it should degrade the array pointing to the failed device,
> so it can be (hot)removed/replaced. Now, take the same concepts to
> multiple nodes using the same MD-RAID1 device..

"multiple nodes using the same MD-RAID1 device" concurrently!? That's a
crucial piece information that really frames the entire topic. That needs
to be your very first point defining the purpose of this.

How would you use the same MD-RAID1 device concurrently on multiple nodes
without a cluster file system? Does this imply that your work is only
useful for the tiny segment of people who could use MD-RAID1 under a
cluster file system? There was a previous implementation of this in user
space called "cmirror", built on dm, which turned out to be quite useless,
and is being deprecated. Did you talk to cluster file system developers
and users to find out if this is worth doing? Or are you just hoping it
turns out to be worthwhile? That's might be answered by examples of
successful real world usage that I asked about. We don't want to be tied
down with long term maintenance of something that isn't worth it.

> >What's different about disks being on SAN that breaks data consistency vs
> >disks being locally attached? Where did the dlm come into the picture?
>
> There are multiple nodes using the same shared device. Different
> nodes would be writing their own data to the shared device possibly
> using a shared filesystem such as ocfs2 on top of it. Each node
> maintains a bitmap to co-ordinate syncs between the two devices of
> the RAID. Since there are two devices, writes on the two devices can
> end at different times and must be co-ordinated.

Thank you, this is the kind of technical detail that I'm looking for.
Separate bitmaps for each node sounds like a much better design than the
cmirror design which used a single shared bitmap (I argued for using a
single bitmap when cmirror was being designed.)

Given that the cluster file system does locking to prevent concurrent
writes to the same blocks, you shouldn't need any locking in raid1 for
that. Could elaborate on exactly when inter-node locking is needed,
i.e. what specific steps need to be coordinated?

> >>Device failure can be partial. Say, only node 1 sees that one of the
> >>device has failed (link break). You need to "tell" other nodes not
> >>to use the device and that the array is degraded.
> >
> >Why?
>
> Data consistency. Because the node which continues to "see" the
> failed device (on another node) as working will read stale data.

I still don't understand, but I suspect this will become clear from other
examples.

> Different nodes will be writing to different
> blocks. So, if a node fails, you need to make sure that what the
> other node has not synced between the two devices is completed by
> the one performing recovery. You need to provide a consistent view
> to all nodes.

This is getting closer to the kind of detail we need, but it's not quite
there yet. I think a full-blown example is probably required, e.g. in
terms of specific reads and writes

1. node1 writes to block X
2. node2 ...

> Also, may I point you to linux/Documentation/md-cluster.txt?

That looks like it will be very helpful when I get to the point of
reviewing the implementation.

2015-06-10 19:22:40

by David Teigland

[permalink] [raw]

Subject: Re: clustered MD

On Wed, Jun 10, 2015 at 12:05:33PM -0500, David Teigland wrote:
> Separate bitmaps for each node sounds like a much better design than the
> cmirror design which used a single shared bitmap (I argued for using a
> single bitmap when cmirror was being designed.)

Sorry misspoke, I argued for one bitmap per node, like you're doing, so in
general I think you're starting off in a much better direction than I saw
before. (I still doubt there's enough value in this to do it at all,
which is another reason I'm particularly interested to see some real world
success with this.)

2015-06-10 20:31:52

by NeilBrown

[permalink] [raw]

Subject: Re: clustered MD

On Wed, 10 Jun 2015 10:01:51 -0500
David Teigland <[email protected]> wrote:

> Isn't this process what staging is for?

No it isn't.
Staging is useful for code drops. i.e. multiple other developers want to
collaborate to improve some code that the maintainer doesn't want to accept.
So it goes into staging, "the community" gets a chance to use the kernel
development workflow to fix it up, and then it either is accepted by the
maintainer, or is eventually discarded.

cluster-MD has followed a very different process. The relevant maintainer (me)
has been involved from the start, has provided input to the design, and
reviewed various initial implementations, and now thinks that the code is close
enough to ready for it to be included in the kernel (with suitable warnings).
That way I only need to review incremental changes, not the whole set of
patches from scratch each time.

What is your interest in this? I'm always happy for open discussion and
varied input, but it would help to know to what extent you are a stake holder?
Also a slightly less adversarial tone would make me feel more comfortable,
though maybe I'm misreading your intent.

Thanks,
NeilBrown

2015-06-10 21:07:56

by David Teigland

[permalink] [raw]

Subject: Re: clustered MD

On Thu, Jun 11, 2015 at 06:31:31AM +1000, Neil Brown wrote:
> What is your interest in this? I'm always happy for open discussion and
> varied input, but it would help to know to what extent you are a stake
> holder?

Using the dlm correctly is non-trivial and should be reviewed.
If the dlm is misused, some part of that may fall in my lap, if
only so far as having to debug problems to distinguish between dlm
bugs or md-cluster bugs. This has been learned the hard way.

I have yet to find time to look up the previous review discussion.
I will be more than happy if I find the dlm usage has already been
thoroughly reviewed.

> Also a slightly less adversarial tone would make me feel more
> comfortable, though maybe I'm misreading your intent.

You're probably misreading "concerned".

The initial responses to my inquiry were severely lacking in any
substance, even dismissive, which raised "concerned" to "troubled".

2015-06-10 22:11:30

by David Teigland

[permalink] [raw]

Subject: Re: clustered MD

On Wed, Jun 10, 2015 at 04:07:44PM -0500, David Teigland wrote:
> > Also a slightly less adversarial tone would make me feel more
> > comfortable, though maybe I'm misreading your intent.
>
> You're probably misreading "concerned".
>
> The initial responses to my inquiry were severely lacking in any
> substance, even dismissive, which raised "concerned" to "troubled".

Reading those messages again I see what you mean, they don't sound very
nice, so sorry about that. I'll repeat the one positive note, which is
that the brief things I've noticed make it look much better than the dm
approach from several years ago.

2015-06-10 22:50:55

by NeilBrown

[permalink] [raw]

Subject: Re: clustered MD

On Wed, 10 Jun 2015 16:07:44 -0500
David Teigland <[email protected]> wrote:

> On Thu, Jun 11, 2015 at 06:31:31AM +1000, Neil Brown wrote:
> > What is your interest in this? I'm always happy for open discussion and
> > varied input, but it would help to know to what extent you are a stake
> > holder?
>
> Using the dlm correctly is non-trivial and should be reviewed.
> If the dlm is misused, some part of that may fall in my lap, if
> only so far as having to debug problems to distinguish between dlm
> bugs or md-cluster bugs. This has been learned the hard way.
>
> I have yet to find time to look up the previous review discussion.
> I will be more than happy if I find the dlm usage has already been
> thoroughly reviewed.

The DLM usage is the part that I am least comfortable with and I would
certainly welcome review. There was a recent discussion of some issue that I
haven't had a chance to go over yet, but apart from that it has mostly just
been a few developers trying to figure out what we need and how that can be
implemented.

There are (as I recall) two main aspects of the DLM usage.
One is fairly idiomatic locking of the multiple write-intend bitmaps.
Each bitmap can be "active" or "idle". When "idle" all bits are clear.
When "active", one node will usually have an exclusive lock. If/when that node
dies, all other nodes must find out and at least one takes remedial action.
Once the remedial action is taken the bitmap becomes idle. In that state
a new node can claim it. When that happens all other nodes must find out so
they transition to the "watching an active bitmap" state.
This seems to fit well with the shared/exclusive reclaimable locks of DLM.

The other usage is to provide synchronous broadcast message passing between
nodes. When one nodes makes a configuration change it needs to tell all other
nodes and wait for them to acknowledge before the change (such as adding a
spare) is committed. There is a small collections of locks which represent
different states in a broadcast/acknowledge protocol.
This is the part I'm least confident of, but it seems to make sense and seems
to work.

Separately:

> Reading those messages again I see what you mean, they don't sound very
> nice, so sorry about that. I'll repeat the one positive note, which is
> that the brief things I've noticed make it look much better than the dm
> approach from several years ago.

Thanks :-)
In part this effort is a response to "clvm" - which is a completely adequate
solution of clustering when you just need volume management (growing and
shrinking and striping volumes) but doesn't extend very well to RAID.

Look forward to any review comments you find time for:-)

Thanks,
NeilBrown

2015-06-12 18:46:29

by David Teigland

[permalink] [raw]

Subject: Re: clustered MD

When a node fails, its dirty areas get special treatment from other nodes
using the area_resyncing() function. Should the suspend_list be created
before any reads or writes from the file system are processed by md? It
seems to me that gfs journal recovery could read/write to dirty regions
(from the failed node) before md was finished setting up the suspend_list.
md could probably prevent that by using the recover_prep() dlm callback to
set a flag that would block any i/o that arrived before the suspend_list
was ready.

2015-06-14 22:19:42

by Goldwyn Rodrigues

[permalink] [raw]

Subject: Re: clustered MD

On 06/12/2015 01:46 PM, David Teigland wrote:
> When a node fails, its dirty areas get special treatment from other nodes
> using the area_resyncing() function. Should the suspend_list be created
> before any reads or writes from the file system are processed by md? It
> seems to me that gfs journal recovery could read/write to dirty regions
> (from the failed node) before md was finished setting up the suspend_list.
> md could probably prevent that by using the recover_prep() dlm callback to
> set a flag that would block any i/o that arrived before the suspend_list
> was ready.
>
> .

Yes, we should call mddev_suspend() in recover_prep() and mddev_resume()
after suspend_list is created. Thanks for pointing it out.

--
Goldwyn

2015-06-23 01:35:01

by NeilBrown

[permalink] [raw]

Subject: Re: clustered MD

On Sun, 14 Jun 2015 17:19:31 -0500
Goldwyn Rodrigues <[email protected]> wrote:

>
>
> On 06/12/2015 01:46 PM, David Teigland wrote:
> > When a node fails, its dirty areas get special treatment from other nodes
> > using the area_resyncing() function. Should the suspend_list be created
> > before any reads or writes from the file system are processed by md? It
> > seems to me that gfs journal recovery could read/write to dirty regions
> > (from the failed node) before md was finished setting up the suspend_list.
> > md could probably prevent that by using the recover_prep() dlm callback to
> > set a flag that would block any i/o that arrived before the suspend_list
> > was ready.
> >
> > .
>
> Yes, we should call mddev_suspend() in recover_prep() and mddev_resume()
> after suspend_list is created. Thanks for pointing it out.
>

The only thing that nodes need to be careful of between the time when
some other node disappears and when that disappearance has been
completely handled is reads.
md/raid1 must ensure that if/when the filesystem reads from a region
that the missing node was writing to, that the filesystem sees
consistent data - on all nodes.

So it needs to suspend read-balancing while it is uncertain.

Once the bitmap from the node has been loaded, the normal protection
against read-balancing in a "dirty" region is sufficient. While
waiting for the bitmap to be loaded, the safe thing to do would be to
disable read-balancing completely.

So I think that recover_prep() should set a flag which disables all
read balancing, and recover_done() (or similar) should clear that flag.
Probably there should be one flag for each other node.

Calling mddev_suspend to suspect all IO is over-kill. Suspending all
read balancing is all that is needed.

Thanks,
NeilBrown