MIME-Version: 1.0
In-Reply-To: <1461859893.1169.49.camel@decadent.org.uk>
References: <lsq.1461711740.434915673@decadent.org.uk> <lsq.1461711741.56727123@decadent.org.uk>
 <CAJE_dJxxKVdKy7E=zb3byA9pqJ5v7H+qdx9CTaZ_yof-9n3NgA@mail.gmail.com> <1461859893.1169.49.camel@decadent.org.uk>
From: Rafael David Tinoco <rafael.tinoco@canonical.com>
Date: Fri, 29 Apr 2016 01:00:37 -0300
Message-ID: <CAJE_dJzD6CxJXjUwXnAZHOdA__3Sx-E5dTWcDb0+CGqO2-7HgQ@mail.gmail.com>
Subject: Re: [PATCH 3.16 106/217] sd: disable discard_zeroes_data for UNMAP
To: Ben Hutchings <ben@decadent.org.uk>
Cc: linux-kernel@vger.kernel.org, stable@vger.kernel.org,
        akpm@linux-foundation.org, Paolo Bonzini <pbonzini@redhat.com>,
        "Martin K. Petersen" <martin.petersen@oracle.com>,
        Christoph Hellwig <hch@lst.de>
Content-Type: text/plain; charset=UTF-8
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 3152
Lines: 73

Actually, It was an objection.

Knowing that WRITESAME(16), used as the discard mechanism, can cause
storage servers to misbehave (like QEMU's SCSI WRITESAME
implementation, workaround-ed by commit e461338b6cd4) and those
storage servers can't  rely on LBPRZ flag to opt out from WRITESAME as
discard mechanism (like QEMU does) since it is out of spec...

I have also seen storage servers miss-behaving with this specific
change (when changing from kernel 3.13 to 3.19, for example):

[21354.827291] Write same(16): 93 08 00 00 00 00 00 00 80 00 00 40 00 00 00 00
...
[21420.471648] sd 0:0:2:1: [sdw] FAILED Result: hostbyte=DID_OK
driverbyte=DRIVER_SENSE
[21420.471665] sd 0:0:2:1: [sdw] Sense Key : Illegal Request [current]
[21420.471670] sd 0:0:2:1: [sdw] Add. Sense: Invalid field in cdb

And this happened because the storage in question didn't set properly
"max_ws_blocks" (it was 0) from VPD 0xb0 (max_ws_blocks is calculated
using it).

Anyway, 2 examples of disk servers that had problems after this
change. IMHO the change is good for regular kernel development, and it
does guarantee further READ CMDs to read zeros from LBAs, but, it
jeopardises already functioning storage servers.

If that argument isn't enough,

Without properly setting NDOB bit in WRITESAME(16), the data buffer
will be read on every SCSI WRITESAME(16) command and that will impact
the "discard method" performance (will probably be slower than regular
UNMAP command).

So far, I'm seeing 2 motives why it shouldn't be on older kernels.

On Thu, Apr 28, 2016 at 1:11 PM, Ben Hutchings <ben@decadent.org.uk> wrote:
> On Wed, 2016-04-27 at 17:43 -0300, Rafael David Tinoco wrote:
>> It seems that changing discard method from UNMAP to WRITE SAME(16)
>> without using NDOB bit (as first described in sbc3r35b.pdf) can cause
>> performance problems on big discards (since data-out buffer will be
>> checked for every WRITE SAME command). I think this is happening after
>> this commit, since NDOB bit wasn't implemented with this change
>> (afaik, iirc).
>
> Is that an objection, or just a comment?
>
> I only picked this commit for backporting because it was referenced by
> later fixes (commits 397737223c59, f4327a95dd08) and I read the commit
> message as saying that it fixes data corruption (sd claims to be
> writing zeroes but the whole area might not read back as zeroes).  Is
> my understanding correct?
>
> Ben.
>
>> From the spec:
>> """
>> To ensure that subsequent read operations return all zeros in a
>> logical block, use the WRITE SAME (16)
>> command with the NDOB bit set to one. If the UNMAP bit is set to one,
>> then the device server may unmap the logical blocks specified by the
>> WRITE SAME (16)
>> """
>>
>> And there were some problems with this change (specifically QEMU SCSI
>> WRITE SAME implementation). So the change (commit e461338b6cd4) was
>> made to guarantee that if LBPRZ=0, after VPD 0xB2, UNMAP is still
>> picked. WRITESAME(16) is picked only if LBPRZ=1. This last commit
>> violated spec in favor of a WRITE SAME "optout" approach for QEMU.
>>
>> I wonder if this should be taken to previous versions ...
>>
>> -Rafael Tinoco