Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753587AbcD2MRT (ORCPT ); Fri, 29 Apr 2016 08:17:19 -0400 Received: from userp1040.oracle.com ([156.151.31.81]:34960 "EHLO userp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753325AbcD2MRR (ORCPT ); Fri, 29 Apr 2016 08:17:17 -0400 To: Rafael David Tinoco Cc: Ben Hutchings , linux-kernel@vger.kernel.org, stable@vger.kernel.org, akpm@linux-foundation.org, Paolo Bonzini , "Martin K. Petersen" , Christoph Hellwig Subject: Re: [PATCH 3.16 106/217] sd: disable discard_zeroes_data for UNMAP From: "Martin K. Petersen" Organization: Oracle Corporation References: <1461859893.1169.49.camel@decadent.org.uk> Date: Fri, 29 Apr 2016 08:16:54 -0400 In-Reply-To: (Rafael David Tinoco's message of "Fri, 29 Apr 2016 01:00:37 -0300") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.5 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-Source-IP: userv0022.oracle.com [156.151.31.74] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2753 Lines: 64 >>>>> "Rafael" == Rafael David Tinoco writes: Rafael, Rafael> And this happened because the storage in question didn't set Rafael> properly "max_ws_blocks" (it was 0) from VPD 0xb0 (max_ws_blocks Rafael> is calculated using it). We appear to be talking about a device that has an internal limit but does not advertise it in the dedicated field. Please send me the output of: sg_inq sg_readcap -l sg_vpd -p lbpv sg_vpd -p bl and we'll quirk it. Or feel free to submit a patch. Rafael> Anyway, 2 examples of disk servers that had problems after this Rafael> change. IMHO the change is good for regular kernel development, Rafael> and it does guarantee further READ CMDs to read zeros from LBAs, Rafael> but, it jeopardises already functioning storage servers. ... as opposed to jeopardizing the integrity of people's data? Rafael> Without properly setting NDOB bit in WRITESAME(16), the data Rafael> buffer will be read on every SCSI WRITESAME(16) command and that Rafael> will impact the "discard method" performance (will probably be Rafael> slower than regular UNMAP command). NDOB is a recent performance optimization for high performance SSD devices that allows us to skip sending the zeroed data buffer. However, there is no way to tell whether a device supports NDOB without sending a WRITE SAME command which has the nasty side effect of being destructive. I am also not aware of any devices that actually support it yet. Whether to set NDOB or not is completely orthogonal to the issue at hand which is whether to use the UNMAP command or WRITE SAME with the UNMAP bit set. WRITE SAME w/ UNMAP has been around for a long, long time. WRITE SAME was used for thin provisioned devices before the UNMAP command even existed. Currently the filesystem code relies on being able to get predictable results for zeroing metadata blocks etc. So if a device advertises that it supports LBPRZ we'll use WRITE SAME with the UNMAP bit set since it provides hard guarantees. Unlike the UNMAP command which by definition is advisory. As I mentioned earlier, I have some changes in the pipeline that will separate the ioctl implementation from the library functions. That will allow us to use WRITE SAME w/ UNMAP for zeroouts and UNMAP for discards on the same device. However, this is still not going to solve the problem with your device that fails WRITE SAME w/ UNMAP. Because we are not going to discontinue using that combination. Quite the contrary, we are increasingly depending on it. Consequently, if you have a device that has problems in that area we need to quirk it rather than try to work around it in the core code heuristics. -- Martin K. Petersen Oracle Linux Engineering