Message-ID: <5578575F.1040006@suse.com>
Date: Wed, 10 Jun 2015 10:27:27 -0500
From: Goldwyn Rodrigues <rgoldwyn@suse.com>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.7.0
MIME-Version: 1.0
To: David Teigland <teigland@redhat.com>
CC: linux-kernel@vger.kernel.org, NeilBrown <neilb@suse.de>
Subject: Re: clustered MD
References: <20150609182102.GA4305@redhat.com> <55773DE1.7080107@suse.com> <20150609194505.GA17536@redhat.com> <557747AB.7080706@suse.com> <20150609203056.GB17536@redhat.com> <5577AFF4.6020505@suse.com> <20150610150151.GA333@redhat.com>
In-Reply-To: <20150610150151.GA333@redhat.com>
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 3482
Lines: 81


On 06/10/2015 10:01 AM, David Teigland wrote:
> On Tue, Jun 09, 2015 at 10:33:08PM -0500, Goldwyn Rodrigues wrote:
>>>>> some real world utility to warrant the potential maintenance effort.
>>>>
>>>> We do have a valid real world utility. It is to provide
>>>> high-availability of RAID1 storage  over the cluster. The
>>>> distributed locking is required only during cases of error and
>>>> superblock updates and is not required during normal operations,
>>>> which makes it fast enough for usual case scenarios.
>>>
>>> That's the theory, how much evidence do you have of that in practice?
>>
>> We wanted to develop a solution which is lock free (or atleast
>> minimum) for the most common/frequent usage scenario. Also, we
>> compared it with iozone on top of ocfs2 to find that it is very
>> close to local device performance numbers. we compared it with cLVM
>> mirroring to find it better as well. However, in the future we would
>> want to use it with with other RAID (10?) scenarios which is missing
>> now.
>
> OK, but that's the second time you've missed the question I asked about
> examples of real world usage.  Given the early stage of development, I'm
> supposing there is none, which also implies it's too early for merging.
>

I thought I answered that:
To use a software RAID1 across multiple nodes of a cluster. Let me 
explain in more words..

In a cluster with multiple nodes with a shared storage, such as a SAN. 
The shared device becomes a single point of failure. If the device loses 
power, you will lose everything. A solution proposed is to use software 
RAID, say with two SAN switches with different devices and create a 
RAID1 on it. So if you lose power on one switch or one of the device is 
fails the other is still available. Once you get the other switch/device 
back up, it would resync the devices.

>>>> What are the doubts you have about it?
>>>
>>> Before I begin reviewing the implementation, I'd like to better understand
>>> what it is about the existing raid1 that doesn't work correctly for what
>>> you'd like to do with it, i.e. I don't know what the problem is.
>>
>> David Lang has already responded: The idea is to use a RAID device
>> (currently only level 1 mirroring is supported) with multiple nodes
>> of the cluster.
>
> That doesn't come close to answering the question: exactly how do you want
> to use raid1 (I have no idea from the statements you've made)

Using software RAID1 on a cluster with shared devices.


>, and exactly
> what breaks when you use raid1 in that way?  Once we've established the
> technical problem, then I can fairly evaluate your solution for it.
>

Data consistency breaks. If node 1 is writing to the RAID1 device, you 
have to make sure the data between the two RAID devices is consistent. 
With software raid, this is performed with bitmaps. The DLM is used to 
maintain data consistency.

Device failure can be partial. Say, only node 1 sees that one of the 
device has failed (link break).  You need to "tell" other nodes not to 
use the device and that the array is degraded.

In case of node failure, the blocks of the failed nodes must be synced 
before the cluster can continue operation.

Does that explain the situation?


-- 
Goldwyn
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/