Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S965598AbbFJQIv (ORCPT ); Wed, 10 Jun 2015 12:08:51 -0400 Received: from smtp2.provo.novell.com ([137.65.250.81]:52713 "EHLO smtp2.provo.novell.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S964850AbbFJP1f (ORCPT ); Wed, 10 Jun 2015 11:27:35 -0400 Message-ID: <5578575F.1040006@suse.com> Date: Wed, 10 Jun 2015 10:27:27 -0500 From: Goldwyn Rodrigues User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.7.0 MIME-Version: 1.0 To: David Teigland CC: linux-kernel@vger.kernel.org, NeilBrown Subject: Re: clustered MD References: <20150609182102.GA4305@redhat.com> <55773DE1.7080107@suse.com> <20150609194505.GA17536@redhat.com> <557747AB.7080706@suse.com> <20150609203056.GB17536@redhat.com> <5577AFF4.6020505@suse.com> <20150610150151.GA333@redhat.com> In-Reply-To: <20150610150151.GA333@redhat.com> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3482 Lines: 81 On 06/10/2015 10:01 AM, David Teigland wrote: > On Tue, Jun 09, 2015 at 10:33:08PM -0500, Goldwyn Rodrigues wrote: >>>>> some real world utility to warrant the potential maintenance effort. >>>> >>>> We do have a valid real world utility. It is to provide >>>> high-availability of RAID1 storage over the cluster. The >>>> distributed locking is required only during cases of error and >>>> superblock updates and is not required during normal operations, >>>> which makes it fast enough for usual case scenarios. >>> >>> That's the theory, how much evidence do you have of that in practice? >> >> We wanted to develop a solution which is lock free (or atleast >> minimum) for the most common/frequent usage scenario. Also, we >> compared it with iozone on top of ocfs2 to find that it is very >> close to local device performance numbers. we compared it with cLVM >> mirroring to find it better as well. However, in the future we would >> want to use it with with other RAID (10?) scenarios which is missing >> now. > > OK, but that's the second time you've missed the question I asked about > examples of real world usage. Given the early stage of development, I'm > supposing there is none, which also implies it's too early for merging. > I thought I answered that: To use a software RAID1 across multiple nodes of a cluster. Let me explain in more words.. In a cluster with multiple nodes with a shared storage, such as a SAN. The shared device becomes a single point of failure. If the device loses power, you will lose everything. A solution proposed is to use software RAID, say with two SAN switches with different devices and create a RAID1 on it. So if you lose power on one switch or one of the device is fails the other is still available. Once you get the other switch/device back up, it would resync the devices. >>>> What are the doubts you have about it? >>> >>> Before I begin reviewing the implementation, I'd like to better understand >>> what it is about the existing raid1 that doesn't work correctly for what >>> you'd like to do with it, i.e. I don't know what the problem is. >> >> David Lang has already responded: The idea is to use a RAID device >> (currently only level 1 mirroring is supported) with multiple nodes >> of the cluster. > > That doesn't come close to answering the question: exactly how do you want > to use raid1 (I have no idea from the statements you've made) Using software RAID1 on a cluster with shared devices. >, and exactly > what breaks when you use raid1 in that way? Once we've established the > technical problem, then I can fairly evaluate your solution for it. > Data consistency breaks. If node 1 is writing to the RAID1 device, you have to make sure the data between the two RAID devices is consistent. With software raid, this is performed with bitmaps. The DLM is used to maintain data consistency. Device failure can be partial. Say, only node 1 sees that one of the device has failed (link break). You need to "tell" other nodes not to use the device and that the array is degraded. In case of node failure, the blocks of the failed nodes must be synced before the cluster can continue operation. Does that explain the situation? -- Goldwyn -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/