Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934061AbbFJRFn (ORCPT ); Wed, 10 Jun 2015 13:05:43 -0400 Received: from mx1.redhat.com ([209.132.183.28]:46668 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754284AbbFJRFf (ORCPT ); Wed, 10 Jun 2015 13:05:35 -0400 Date: Wed, 10 Jun 2015 12:05:33 -0500 From: David Teigland To: Goldwyn Rodrigues Cc: linux-kernel@vger.kernel.org, NeilBrown Subject: Re: clustered MD Message-ID: <20150610170533.GD333@redhat.com> References: <20150609182102.GA4305@redhat.com> <55773DE1.7080107@suse.com> <20150609194505.GA17536@redhat.com> <557747AB.7080706@suse.com> <20150609203056.GB17536@redhat.com> <5577AFF4.6020505@suse.com> <20150610150151.GA333@redhat.com> <5578575F.1040006@suse.com> <20150610154834.GC333@redhat.com> <5578647D.102@suse.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <5578647D.102@suse.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3722 Lines: 81 On Wed, Jun 10, 2015 at 11:23:25AM -0500, Goldwyn Rodrigues wrote: > To start with, the goal of (basic) MD RAID1 is to keep the two > mirrored device consistent _all_ of the time. In case of a device > failure, it should degrade the array pointing to the failed device, > so it can be (hot)removed/replaced. Now, take the same concepts to > multiple nodes using the same MD-RAID1 device.. "multiple nodes using the same MD-RAID1 device" concurrently!? That's a crucial piece information that really frames the entire topic. That needs to be your very first point defining the purpose of this. How would you use the same MD-RAID1 device concurrently on multiple nodes without a cluster file system? Does this imply that your work is only useful for the tiny segment of people who could use MD-RAID1 under a cluster file system? There was a previous implementation of this in user space called "cmirror", built on dm, which turned out to be quite useless, and is being deprecated. Did you talk to cluster file system developers and users to find out if this is worth doing? Or are you just hoping it turns out to be worthwhile? That's might be answered by examples of successful real world usage that I asked about. We don't want to be tied down with long term maintenance of something that isn't worth it. > >What's different about disks being on SAN that breaks data consistency vs > >disks being locally attached? Where did the dlm come into the picture? > > There are multiple nodes using the same shared device. Different > nodes would be writing their own data to the shared device possibly > using a shared filesystem such as ocfs2 on top of it. Each node > maintains a bitmap to co-ordinate syncs between the two devices of > the RAID. Since there are two devices, writes on the two devices can > end at different times and must be co-ordinated. Thank you, this is the kind of technical detail that I'm looking for. Separate bitmaps for each node sounds like a much better design than the cmirror design which used a single shared bitmap (I argued for using a single bitmap when cmirror was being designed.) Given that the cluster file system does locking to prevent concurrent writes to the same blocks, you shouldn't need any locking in raid1 for that. Could elaborate on exactly when inter-node locking is needed, i.e. what specific steps need to be coordinated? > >>Device failure can be partial. Say, only node 1 sees that one of the > >>device has failed (link break). You need to "tell" other nodes not > >>to use the device and that the array is degraded. > > > >Why? > > Data consistency. Because the node which continues to "see" the > failed device (on another node) as working will read stale data. I still don't understand, but I suspect this will become clear from other examples. > Different nodes will be writing to different > blocks. So, if a node fails, you need to make sure that what the > other node has not synced between the two devices is completed by > the one performing recovery. You need to provide a consistent view > to all nodes. This is getting closer to the kind of detail we need, but it's not quite there yet. I think a full-blown example is probably required, e.g. in terms of specific reads and writes 1. node1 writes to block X 2. node2 ... > Also, may I point you to linux/Documentation/md-cluster.txt? That looks like it will be very helpful when I get to the point of reviewing the implementation. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/