Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1422869AbaGRPjk (ORCPT ); Fri, 18 Jul 2014 11:39:40 -0400 Received: from mx2.parallels.com ([199.115.105.18]:55628 "EHLO mx2.parallels.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1422840AbaGRPjh (ORCPT ); Fri, 18 Jul 2014 11:39:37 -0400 From: James Bottomley To: "Elliott@hp.com" CC: "linux-kernel@vger.kernel.org" , "hch@infradead.org" , "apw@canonical.com" , "devel@linuxdriverproject.org" , "michaelc@cs.wisc.edu" , "kys@microsoft.com" , "axboe@kernel.dk" , "linux-scsi@vger.kernel.org" , "ohering@suse.com" , "gregkh@linuxfoundation.org" , "jasowang@redhat.com" Subject: Re: [PATCH 1/1] [SCSI] Fix a bug in deriving the FLUSH_TIMEOUT from the basic I/O timeout Thread-Topic: [PATCH 1/1] [SCSI] Fix a bug in deriving the FLUSH_TIMEOUT from the basic I/O timeout Thread-Index: AQHPop5zpnJziWUcaUmHYJv2pf72HA== Date: Fri, 18 Jul 2014 15:39:25 +0000 Message-ID: <1405697964.605.27.camel@jarvis.lan> References: <1401899623-24194-1-git-send-email-kys@microsoft.com> <1401901323.17510.23.camel@dabdike> <53911A35.7010805@cs.wisc.edu> <5391F801.4010107@cs.wisc.edu> <1402077167.2207.89.camel@dabdike.int.hansenpartnership.com> <539206FA.1020001@kernel.dk> <5b926a0a9f264edda91c7c2ab0acb7d1@BY2PR03MB299.namprd03.prod.outlook.com> <13807d2cc8744ae1bc374f20d8f9caec@BY2PR0301MB0711.namprd03.prod.outlook.com> <94D0CD8314A33A4D9D801C0FE68B402958BAC6DC@G9W0745.americas.hpqcorp.net> In-Reply-To: <94D0CD8314A33A4D9D801C0FE68B402958BAC6DC@G9W0745.americas.hpqcorp.net> Accept-Language: en-GB, en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [50.46.103.107] Content-Type: text/plain; charset="utf-8" Content-ID: <2F1C3D1BA5238844975A36DC2D078972@sw.swsoft.com> MIME-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from base64 to 8bit by mail.home.local id s6IFdjOn016988 On Fri, 2014-07-18 at 00:51 +0000, Elliott, Robert (Server Storage) wrote: > In sd_sync_cache: > rq->timeout *= SD_FLUSH_TIMEOUT_MULTIPLIER; > > Regardless of the baseline for the multiplication, a magic > number of 2 is too arbitrary. That might work for an > individual drive, but could be far too short for a RAID > controller that runs into worst case error handling for > the drives to which it is flushing (e.g., if its cache > is volatile and the drives all have recoverable errors > during writes). That time goes up with a bigger cache and > with more fragmented write data in the cache requiring > more individual WRITE commands. RAID devices with gigabytes of cache are usually battery backed and lie about their cache type (they usually claim write through). This behaviour is exactly what we want because the flush is used to enforce write barriers for filesystem consistency, so we only want to flush volatile caches. The rare case you cite, where the RAID device is volatile and having difficulty flushing, we do want a failure because otherwise the filesystem will become unusable and the system will live lock ... barriers are sent down frequently under normal filesystem operation. The SCSI standards committees have been struggling with defining and detecting the difference between volatile and non-volatile cache and the heuristics we'd have to use to avoid annoying USB devices with detecting this don't look pretty. We always zero the SYNC_NV bit anyway, so even if the devices stopped lying, we'd be safe. > A better value would be the Recommended Command Timeout field > value reported in the REPORT SUPPORTED OPERATION CODES command, > if reported by the device server. That is supposed to account > for the worst case. > > For cases where that is not reported, exposing the multiplier > in sysfs would let the timeout for simple devices be set to > smaller values than complex devices. > > Also, in both sd_setup_flush_cmnd and sd_sync_cache: > cmd->cmnd[0] = SYNCHRONIZE_CACHE; > cmd->cmd_len = 10; > > SYNCHRONIZE CACHE (16) should be favored over SYNCHRONIZE > CACHE (10) unless SYNCHRONIZE CACHE (10) is not supported. For what reason. We usually go for the safe alternatives, which is 10 byte commands because they have the widest testing and greatest level of support. We don't do range flushes currently, so there doesn't seem to be a practical different. If we did support range flushes, we'd likely only use SC(16) on >2TB devices. James ????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?