Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932176AbbFWBfB (ORCPT ); Mon, 22 Jun 2015 21:35:01 -0400 Received: from cantor2.suse.de ([195.135.220.15]:43248 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752077AbbFWBew (ORCPT ); Mon, 22 Jun 2015 21:34:52 -0400 Date: Tue, 23 Jun 2015 11:34:43 +1000 From: NeilBrown To: Goldwyn Rodrigues Cc: David Teigland , linux-kernel@vger.kernel.org Subject: Re: clustered MD Message-ID: <20150623113443.42b65439@noble> In-Reply-To: <557DFDF3.2060106@suse.com> References: <20150609182102.GA4305@redhat.com> <55773DE1.7080107@suse.com> <20150609194505.GA17536@redhat.com> <557747AB.7080706@suse.com> <20150609203056.GB17536@redhat.com> <5577AFF4.6020505@suse.com> <20150610150151.GA333@redhat.com> <20150611063131.51fa2ddb@home.neil.brown.name> <20150610210744.GG333@redhat.com> <20150611085034.6b34955c@home.neil.brown.name> <20150612184623.GA5130@redhat.com> <557DFDF3.2060106@suse.com> X-Mailer: Claws Mail 3.11.1 (GTK+ 2.24.28; x86_64-suse-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 1928 Lines: 47 On Sun, 14 Jun 2015 17:19:31 -0500 Goldwyn Rodrigues wrote: > > > On 06/12/2015 01:46 PM, David Teigland wrote: > > When a node fails, its dirty areas get special treatment from other nodes > > using the area_resyncing() function. Should the suspend_list be created > > before any reads or writes from the file system are processed by md? It > > seems to me that gfs journal recovery could read/write to dirty regions > > (from the failed node) before md was finished setting up the suspend_list. > > md could probably prevent that by using the recover_prep() dlm callback to > > set a flag that would block any i/o that arrived before the suspend_list > > was ready. > > > > . > > Yes, we should call mddev_suspend() in recover_prep() and mddev_resume() > after suspend_list is created. Thanks for pointing it out. > The only thing that nodes need to be careful of between the time when some other node disappears and when that disappearance has been completely handled is reads. md/raid1 must ensure that if/when the filesystem reads from a region that the missing node was writing to, that the filesystem sees consistent data - on all nodes. So it needs to suspend read-balancing while it is uncertain. Once the bitmap from the node has been loaded, the normal protection against read-balancing in a "dirty" region is sufficient. While waiting for the bitmap to be loaded, the safe thing to do would be to disable read-balancing completely. So I think that recover_prep() should set a flag which disables all read balancing, and recover_done() (or similar) should clear that flag. Probably there should be one flag for each other node. Calling mddev_suspend to suspect all IO is over-kill. Suspending all read balancing is all that is needed. Thanks, NeilBrown -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in Please read the FAQ at http://www.tux.org/lkml/