Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752521AbcDZPb1 (ORCPT ); Tue, 26 Apr 2016 11:31:27 -0400 Received: from mx2.suse.de ([195.135.220.15]:49395 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752383AbcDZPbX (ORCPT ); Tue, 26 Apr 2016 11:31:23 -0400 Date: Tue, 26 Apr 2016 17:31:18 +0200 From: Jan Kara To: Dan Williams Cc: Dave Chinner , "Verma, Vishal L" , "linux-block@vger.kernel.org" , "jack@suse.cz" , "axboe@fb.com" , "linux-nvdimm@ml01.01.org" , "linux-kernel@vger.kernel.org" , "xfs@oss.sgi.com" , "hch@infradead.org" , "linux-mm@kvack.org" , "Wilcox, Matthew R" , "linux-fsdevel@vger.kernel.org" , "akpm@linux-foundation.org" , "linux-ext4@vger.kernel.org" , "viro@zeniv.linux.org.uk" Subject: Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io Message-ID: <20160426153118.GI27612@quack2.suse.cz> References: <20160425083114.GA27556@infradead.org> <1461604476.3106.12.camel@intel.com> <20160425232552.GD18496@dastard> <20160426001157.GE18496@dastard> <20160426025645.GG18496@dastard> <20160426082711.GC26977@dastard> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2176 Lines: 44 On Tue 26-04-16 07:59:10, Dan Williams wrote: > On Tue, Apr 26, 2016 at 1:27 AM, Dave Chinner wrote: > > On Mon, Apr 25, 2016 at 09:18:42PM -0700, Dan Williams wrote: > [..] > > It seems to me you are focussing on code/technologies that exist > > today instead of trying to define an architecture that is more > > optimal for pmem storage systems. Yes, working code is great, but if > > you can't tell people how things like robust error handling and > > redundancy are going to work in future then it's going to take > > forever for everyone else to handle such errors robustly through the > > storage stack... > > Precisely because higher order redundancy is built on top this baseline. > > MD-RAID can't do it's error recovery if we don't have -EIO and > clear-error-on-write. On the other hand, you're absolutely right that > we have a gaping hole on top of the SIGBUS recovery model, and don't > have a kernel layer we can interpose on top of DAX to provide some > semblance of redundancy. > > In the meantime, a handful of applications with a team of full-time > site-reliability-engineers may be able to plug in external redundancy > infrastructure on top of what is defined in these patches. For > everyone else, the hard problem, we need to do a lot more thinking > about a trap and recover solution. So we could actually implement some kind of redundancy with DAX with reasonable effort. We already do track dirty storage PFNs in the radix tree. After DAX locking patches get merged we also have a reliable way to write-protect them when we decide to do 'writeback' (translates to flushing CPU caches) for them. When we do that, we have all the infrastructure in place to provide 'stable pages' while some mirroring or other redundancy mechanism in kernel works with the data. But as Dave said, we should do some writeup of how this is all supposed to work and e.g. which layer is going to be responsible for the redundancy. Do we want to have that in DAX code? Or just provide stable page guarantees from DAX and do the redundancy from device mapper? This needs more thought... Honza -- Jan Kara SUSE Labs, CR