Date: Tue, 16 Jan 2018 10:49:07 +0100
From: Lars Ellenberg <lars.ellenberg@linbit.com>
To: Christoph Hellwig <hch@infradead.org>
Cc: Eric Wheeler <drbd-dev@lists.ewheeler.net>,
        drbd-dev@lists.linbit.com, Eric Wheeler <git@linux.ewheeler.net>,
        Eric Wheeler <drbd@linux.ewheeler.net>, stable@vger.kernel.org,
        Philipp Reisner <philipp.reisner@linbit.com>,
        linux-kernel@vger.kernel.org
Subject: Re: [PATCH] drbd: fix discard_zeroes_if_aligned regression
Message-ID: <20180116094907.GD4107@soda.linbit>
Mail-Followup-To: Christoph Hellwig <hch@infradead.org>,
        Eric Wheeler <drbd-dev@lists.ewheeler.net>,
        drbd-dev@lists.linbit.com, Eric Wheeler <git@linux.ewheeler.net>,
        Eric Wheeler <drbd@linux.ewheeler.net>, stable@vger.kernel.org,
        Philipp Reisner <philipp.reisner@linbit.com>,
        linux-kernel@vger.kernel.org
References: <15124635.GA4107@soda.linbit>
 <1516057231-21756-1-git-send-email-drbd-dev@lists.ewheeler.net>
 <20180116072615.GA3940@infradead.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20180116072615.GA3940@infradead.org>
User-Agent: Mutt/1.5.24 (2015-08-30)
Sender: linux-kernel-owner@vger.kernel.org

On Mon, Jan 15, 2018 at 11:26:15PM -0800, Christoph Hellwig wrote:
> NAK.  Calling a discard and expecting zeroing is simply buggy.

What he said.

The bug/misunderstanding was that we now use zeroout even for discards,
with the assumption that it would try to do discards where it can.

Which is even true.

But our expectations of when zeroout "should" use unmap,
and where it actually can do that safely
based on the information it has,
don't really match:
our expectations where wrong, we assumed too much.
(in the general case).

Which means in DRBD we have to stop use zeroout for discards,
and again pass down normal discard for discards.

In the specific case where the backend to DRBD is lvm-thin,
AND it does de-alloc blocks on discard,
AND it does NOT have skip_block_zeroing set or it's backend
does zero on discard and it does discard passdown, and we tell
DRBD about all of that (using the discard_zeroes_if_aligned flag)
then we can do this "zero head and tail, discard full blocks",
and expect them to be zero.

If skip_block_zeroing is set however, and there is no discard
passdown in thin, or the backend of thin does not zero on discard,
DRBD can still pass down discards to thin, and expect them do be
de-allocated, but can not expect discarded ranges to remain
"zero", any later partial write to an unallocated area could pull
in different "undefined" garbage on different replicas for the
not-written-to part of a new allocated block.

The end result is that you have areas of the block device
that return different data depending on which replica you
read from.

But that is the case even eithout discard in that setup,
don't do that then or live with it.

"undefined data" is undefined, you have that directly on thin
in that setup already, with DRBD on top you now have several
versions of "undefined".

Anything on top of such a setup must not do "read-modify-write"
of "undefined" data and expect a defined result, adding DRBD
on top does not change that.

DRBD can deal with that just fine, but our "online verify" will
report loads of boring "mismatches" for those areas.

TL;DR: we'll stop doing "discard-is-zeroout"
(our assumptions were wrong).
We still won't do exactly "discard-is-discard", but re-enable our
"discard-is-discard plus zeroout on head and tail", because in
some relevant setups, this gives us the best result, and avoids
the false positives in our online-verify.

    Lars