Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755074AbcKVDxU (ORCPT ); Mon, 21 Nov 2016 22:53:20 -0500 Received: from mail.kernel.org ([198.145.29.136]:59836 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754718AbcKVDxS (ORCPT ); Mon, 21 Nov 2016 22:53:18 -0500 Date: Mon, 21 Nov 2016 18:02:38 -0800 From: Shaohua Li To: NeilBrown Cc: linux-raid@vger.kernel.org, linux-block@vger.kernel.org, Christoph Hellwig , linux-kernel@vger.kernel.org, hare@suse.de Subject: Re: [PATCH/RFC] add "failfast" support for raid1/raid10. Message-ID: <20161122020238.qtuxwo5etcwmts4r@kernel.org> References: <147944614789.3302.1959091446949640579.stgit@noble> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <147944614789.3302.1959091446949640579.stgit@noble> User-Agent: Mutt/1.6.2-neo (2016-08-21) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5126 Lines: 97 On Fri, Nov 18, 2016 at 04:16:11PM +1100, Neil Brown wrote: > Hi, > > I've been sitting on these patches for a while because although they > solve a real problem, it is a fairly limited use-case, and I don't > really like some of the details. > > So I'm posting them as RFC in the hope that a different perspective > might help me like them better, or find a better approach. > > The core idea is that when you have multiple copies of data > (i.e. mirrored drives) it doesn't make sense to wait for a read from > a drive that seems to be having problems. It will probably be faster > to just cancel that read, and read from the other device. > Similarly, in some circumstances, it might be better to fail a drive > that is being slow to respond to writes, rather than cause all writes > to be very slow. > > The particular context where this comes up is when mirroring across > storage arrays, where the storage arrays can temporarily take an > unusually long time to respond to requests (firmware updates have > been mentioned). As the array will have redundancy internally, there > is little risk to the data. The mirrored pair is really only for > disaster recovery, and it is deemed better to lose the last few > minutes of updates in the case of a serious disaster, rather than > occasionally having latency issues because one array needs to do some > maintenance for a few minutes. The particular storage arrays in > question are DASD devices which are part of the s390 ecosystem. > > Linux block layer has "failfast" flags to direct drivers to fail more > quickly. These patches allow devices in an md array to be given a > "failfast" flag, which will cause IO requests to be marked as > "failfast" providing there is another device available. Once the > array becomes degraded, we stop using failfast, as that could result > in data loss. > > I don't like the whole "failfast" concept because it is not at all > clear how fast "fast" is. In fact, these block-layer flags are > really a misnomer. They should be "noretry" flags. > REQ_FAILFAST_DEV means "don't retry requests which reported an error > which seems to come from the device. > REQ_FAILFAST_TRANSPORT means "don't retry requests which seem to > indicate a problem with the transport, rather than the device" > REQ_FAILFAST_DRIVER means .... I'm not exactly sure. I think it > means whatever a particular driver wants it to mean, basically "I > cannot seem to handle this right now, just resend and I'll probably > be more in control next time". It seems to be for internal-use only. > > Multipath code uses REQ_FAILFAST_TRANSPORT only, which makes sense. > btrfs uses REQ_FAILFAST_DEV only (for read-ahead) which doesn't seem > to make sense.... why would you ever use _DEV without _TRANSPORT? > > None of these actually change the timeouts in the driver or in the > device, which is what I would expect for "failfast", so to get real > "fast failure" you need to enable failfast, and adjust the timeouts. > That is what we do for our customers with DASD. > > Anyway, it seems to make sense to use _TRANSPORT and _DEV for > requests from md where there is somewhere to fall-back on. > If we get an error from a "failfast" request, and the array is still > non-degraded, we just fail the device. We don't try to repair read > errors (which is pointless on storage arrays). > > It is assumed that some user-space code will notice the failure, > monitor the device to see when it becomes available again, and then > --re-add it. Assuming the array has a bitmap, the --re-add should be > fast and the array will become optimal again without experiencing > excessive latencies. > > My two main concerns are: > - does this functionality have any use-case outside of mirrored > storage arrays, and are there other storage arrays which > occasionally inserted excessive latency (seems like a serious > misfeature to me, but I know few of the details)? > - would it be at all possible to have "real" failfast functionality > in the block layer? I.e. something that is based on time rather > than retry count. Maybe in some cases a retry would be > appropriate if the first failure was very fast. > I.e. it would reduce timeouts and decide on retries based on > elapsed time rather than number of attempts. > With this would come the question of "how fast is fast" and I > don't have a really good answer. Maybe md would need to set a > timeout, which it would double whenever it got failures on all > drives. Otherwise the timeout would drift towards (say) 10 times > the typical response time. > > So: comments most welcome. As I say, this does address a genuine > need. Just find it hard to like it :-( Patches looks good. As long as these don't break normal raid array (while they don't if the superblock bit isn't set), I have no objection to apply the patches even they are for special usage. I'll add the series to the next tree. Just curious: will the FAILFAST increase the chance IO failure? Thanks, Shaohua