Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) client-ip=2620:137:e000::1:18;
MIME-Version: 1.0
References: <20220426145445.2282274-1-agruenba@redhat.com> <CAHk-=wi7o+fHYBTuCQQdHD112YHQtO21Y3+wxNYypjdo8feKFg@mail.gmail.com>
 <CAHc6FU48681X8aUK+g7UUN7q5b6rkVBzTP7h_zbE4XqZYAiw3g@mail.gmail.com> <CAHk-=wjMB1-xCOCBtsSMmQuFV9G+vNyCY1O_LsoqOd=0QS4yYg@mail.gmail.com>
In-Reply-To: <CAHk-=wjMB1-xCOCBtsSMmQuFV9G+vNyCY1O_LsoqOd=0QS4yYg@mail.gmail.com>
From:   Andreas Gruenbacher <agruenba@redhat.com>
Date:   Wed, 27 Apr 2022 14:29:18 +0200
Message-ID: <CAHc6FU5Bag5W2t79+WzUq=NibtEF+7z6=jyNCkLMMp9Yqvpmqw@mail.gmail.com>
Subject: Re: [GIT PULL] gfs2 fix
To:     Linus Torvalds <torvalds@linux-foundation.org>
Cc:     cluster-devel <cluster-devel@redhat.com>,
        Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Content-Type: text/plain; charset="UTF-8"
Precedence: bulk

On Wed, Apr 27, 2022 at 1:33 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Tue, Apr 26, 2022 at 2:28 PM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> >
> > Btrfs has a comment in that place that reads:
> >
> >     /* No increment (+=) because iomap returns a cumulative value. */
>
> What a truly horrid interface. But you are triggering repressed
> childhood memories:
>
> > That's so that we can complete the tail of an asynchronous write
> > asynchronously after a failed page fault and subsequent fault-in.
>
> yeah, that makes me go "I remember something like that".
>
> > That would be great, but applications don't seem to be able to cope
> > with short direct writes, so we must turn partial failure into total
> > failure here. There's at least one xfstest that checks for that as
> > well.
>
> What a complete crock. You're telling me that you did some writes, but
> then you can't tell user space that writes happened because that would
> confuse things.
>
> Direct-IO is some truly hot disgusting garbage.
>
> Happily it's only used for things like databases that nobody sane
> would care about anyway.
>
> Anyway, none of that makes any sense, since you do this:
>
>                 ret = gfs2_file_direct_write(iocb, from, &gh);
>                 if (ret < 0 || !iov_iter_count(from))
>                         goto out_unlock;
>
>                 iocb->ki_flags |= IOCB_DSYNC;
>                 buffered = gfs2_file_buffered_write(iocb, from, &gh);
>
> so you're saying that the direct write will never partially succeed,
> but then in gfs2_file_write_iter() it very much looks like it's
> falling back to buffered writes for that case.
>
> Hmm. Very odd.
>
> I assume this is all due to that
>
>         /* Silently fall back to buffered I/O when writing beyond EOF */
>         if (iocb->ki_pos + iov_iter_count(from) > i_size_read(&ip->i_inode))
>
> thing, so this all makes some perverse kind of sense, but it still
> looks like this code just needs some serious serious commentary.

Yes, that, as well as writing into holes.

During direct writes, gfs2 is holding the inode glock in a special
"shared writable" mode which works like the usual "shared readable"
mode as far as metadata goes, but can be held by multiple "shared
writers" at the same time. This allows multiple nodes to write to the
storage device concurrently as long as the file is preallocated (i.e.,
databases). When it comes to allocations, it falls back to "exclusive"
locking and buffered writes.

> So you *can* have a partial write if you hit the end of the file, and
> then you'll continue that partial write with the buffered code.
>
> But an actual _failure_ will not do that, and instead return an error
> even if the write was partially done.
>
> But then *some* failures aren't failures at all, and without any
> comments do this
>
>         if (ret == -ENOTBLK)
>                 ret = 0;
>
>
> And remind me again - this all is meant for applications that
> supposedly care about consistency on disk?
>
> And the xfs tests actually have a *test* for that case, to make sure
> that nobody can sanely *really* know how much of the write succeeded
> if it was a DIO write?
>
> Gotcha.

I agree that it's pretty sad.

> > > The reason I think I'm full of sh*t is that you say that the problem
> > > occurs in gfs2_file_buffered_write(), not in that
> > > gfs2_file_direct_write() case.
> >
> > Right, we're having that issue with buffered writes.
>
> I have to say, compared to all the crazy things I see in the DIO path,
> the buffered write path actually looks almost entirely sane.
>
> Of course, gfs2_file_read_iter() counts how many bytes it has read in
> a variable called 'written', and gfs2_file_buffered_write() counts the
> bytes it has written in a variable called 'read', so "entirely sane"
> is all very very relative.
>
> I'm sure there's some good reason (job security?) for all this insanity.

Point taken; I'll fix this up.

> But I now have to go dig my eyes out with a rusty spoon.
>
> But before I do that, I have one more question (I'm going to regret
> this, aren't I?):
>
> In gfs2_file_buffered_write(), when it has done that
> fault_in_iov_iter_readable(), and it decides that that succeeded, it
> does
>
>                         if (gfs2_holder_queued(gh))
>                                 goto retry_under_glock;
>                         if (read && !(iocb->ki_flags & IOCB_DIRECT))
>                                 goto out_uninit;
>                         goto retry;
>
> so if it still has that lock (if I understand correctly), it will always retry.
>
> But if it *lost* the lock, it will retry only if was a direct write,
> and return a partial success for a regular write() rather than
> continue the write.
>
> Whaa? I'm assuming that this is more of that "direct writes have to
> succeed fully or not at all", but according to POSIX *regular* writes
> should also succeed fully, unless some error happens.
>
> Losing the lock doesn't sound like an error to me.

Regular (buffered) reads and writes are expected to be atomic with
respect to each other. That atomicity comes from holding the lock.
When we lose the lock, we can observe atomicity and return a short
result, ignore atomicity and return the full result, or retry the
entire operation. Which of those three options would you prefer?

For what it's worth, none of this matters as long as there's no lock
contention across the cluster.

> And there are a lot of applications that do assume "write to a regular
> file didn't complete fully means that the disk is full" etc. Because
> that's the traditional meaning.
>
> This doesn't seem to be related to any data corruption issue, but it does smell.
>
>              Linus

Thanks,
Andreas