From:   Benjamin Coddington <bcodding@redhat.com>
To:     Trond Myklebust <trondmy@kernel.org>
Cc:     Anna Schumaker <anna@kernel.org>, linux-nfs@vger.kernel.org
Subject: Re: [PATCH] NFSv4: Ensure we revalidate data after OPEN expired
Date:   Tue, 14 Feb 2023 16:00:29 -0500
Message-ID: <EF0BC155-418B-4047-B162-6A829C693974@redhat.com>
In-Reply-To: <a44dfa9367526593f1be28dad281e2b6d50aaa2e.camel@kernel.org>
References: <7e97897a29878a56236ef8e15bce7a295d5e8a41.1676403514.git.bcodding@redhat.com>
 <a44dfa9367526593f1be28dad281e2b6d50aaa2e.camel@kernel.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Precedence: bulk

On 14 Feb 2023, at 15:30, Trond Myklebust wrote:

> On Tue, 2023-02-14 at 14:39 -0500, Benjamin Coddington wrote:
>> We've observed that if the NFS client experiences a network partition
>> and
>> the server revokes the client's state, the client may not revalidate
>> cached
>> data for an open file during recovery.  If the file is extended by a
>> second
>> client during this network partition, the first client will correctly
>> update the file's size and attributes during recovery, but another
>> extending write will discard the second client's data.
>
> I'm having trouble fully understanding your problem description. Is the
> issue that both clients are opening the same file with something like
> O_WRONLY|O_DSYNC|O_APPEND?

Yes.

> If so, what if the network partition happens during the write() system
> call of client 1, so that the page cache is updated but the flush of
> the write data ends up being delayed by the partition?
> In that case, client 2 doesn't know that client 1 has writes
> outstanding so it may write its data to where the server thinks the eof
> offset is. However once client 1 is able to recover its open state, it
> will still have dirty page cache data that is going to overwrite that
> same offset.

Ah, yes.  :(   In this case we might be safe saying that close-to-open
consistency is preserved, though, if the second client has closed the file.
At least client 1 will have the data laid out in the file the way it
expected.

>> In the case where another client opened the file during the network
>> partition and the server revoked the first client's state, the
>> recovery can
>> forego optimizations and instead attempt to avoid corruption.
>>
>> It's a little tricky to solve this in a per-file way during recovery
>> without plumbing arguments or introducing new flags.  This patch
>> side-steps
>> the per-file complexity by simply checking if the client is within a
>> NOGRACE recovery window, and if so, invalidates data during the open
>> recovery.
>>
>
> I don't see how this scenario can ever be made fully safe. If people
> care, then we should probably have the open recovery of client 1 fail
> altogether in this case (subject to some switch similar to the existing
> 'recover_lost_locks' kernel option).

Its quite a corner case.

I actually don't have a broken setup yet.  We noticed this behavior appear
in a regression test now that knfsd will both: 1) give a read delegation for
RW open and 2) forget client state if a delegation callback times out.
Otherwise, without the delegation, the server doesn't send the client back
through recovery and the appending open/write happens at the correct eof.

Maybe that server behavior will trigger some real problem reports, and I
should look into making the server a little nicer to the client for a short
partition.

Ben