2022-04-27 09:23:19

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] gfs2 fix

On Tue, Apr 26, 2022 at 2:28 PM Andreas Gruenbacher <[email protected]> wrote:
>
> Btrfs has a comment in that place that reads:
>
> /* No increment (+=) because iomap returns a cumulative value. */

What a truly horrid interface. But you are triggering repressed
childhood memories:

> That's so that we can complete the tail of an asynchronous write
> asynchronously after a failed page fault and subsequent fault-in.

yeah, that makes me go "I remember something like that".

> That would be great, but applications don't seem to be able to cope
> with short direct writes, so we must turn partial failure into total
> failure here. There's at least one xfstest that checks for that as
> well.

What a complete crock. You're telling me that you did some writes, but
then you can't tell user space that writes happened because that would
confuse things.

Direct-IO is some truly hot disgusting garbage.

Happily it's only used for things like databases that nobody sane
would care about anyway.

Anyway, none of that makes any sense, since you do this:

ret = gfs2_file_direct_write(iocb, from, &gh);
if (ret < 0 || !iov_iter_count(from))
goto out_unlock;

iocb->ki_flags |= IOCB_DSYNC;
buffered = gfs2_file_buffered_write(iocb, from, &gh);

so you're saying that the direct write will never partially succeed,
but then in gfs2_file_write_iter() it very much looks like it's
falling back to buffered writes for that case.

Hmm. Very odd.

I assume this is all due to that

/* Silently fall back to buffered I/O when writing beyond EOF */
if (iocb->ki_pos + iov_iter_count(from) > i_size_read(&ip->i_inode))

thing, so this all makes some perverse kind of sense, but it still
looks like this code just needs some serious serious commentary.

So you *can* have a partial write if you hit the end of the file, and
then you'll continue that partial write with the buffered code.

But an actual _failure_ will not do that, and instead return an error
even if the write was partially done.

But then *some* failures aren't failures at all, and without any
comments do this

if (ret == -ENOTBLK)
ret = 0;


And remind me again - this all is meant for applications that
supposedly care about consistency on disk?

And the xfs tests actually have a *test* for that case, to make sure
that nobody can sanely *really* know how much of the write succeeded
if it was a DIO write?

Gotcha.

> > The reason I think I'm full of sh*t is that you say that the problem
> > occurs in gfs2_file_buffered_write(), not in that
> > gfs2_file_direct_write() case.
>
> Right, we're having that issue with buffered writes.

I have to say, compared to all the crazy things I see in the DIO path,
the buffered write path actually looks almost entirely sane.

Of course, gfs2_file_read_iter() counts how many bytes it has read in
a variable called 'written', and gfs2_file_buffered_write() counts the
bytes it has written in a variable called 'read', so "entirely sane"
is all very very relative.

I'm sure there's some good reason (job security?) for all this insanity.

But I now have to go dig my eyes out with a rusty spoon.

But before I do that, I have one more question (I'm going to regret
this, aren't I?):

In gfs2_file_buffered_write(), when it has done that
fault_in_iov_iter_readable(), and it decides that that succeeded, it
does

if (gfs2_holder_queued(gh))
goto retry_under_glock;
if (read && !(iocb->ki_flags & IOCB_DIRECT))
goto out_uninit;
goto retry;

so if it still has that lock (if I understand correctly), it will always retry.

But if it *lost* the lock, it will retry only if was a direct write,
and return a partial success for a regular write() rather than
continue the write.

Whaa? I'm assuming that this is more of that "direct writes have to
succeed fully or not at all", but according to POSIX *regular* writes
should also succeed fully, unless some error happens.

Losing the lock doesn't sound like an error to me.

And there are a lot of applications that do assume "write to a regular
file didn't complete fully means that the disk is full" etc. Because
that's the traditional meaning.

This doesn't seem to be related to any data corruption issue, but it does smell.

Linus


2022-04-27 12:53:57

by Andreas Gruenbacher

[permalink] [raw]
Subject: Re: [GIT PULL] gfs2 fix

On Wed, Apr 27, 2022 at 1:33 AM Linus Torvalds
<[email protected]> wrote:
> On Tue, Apr 26, 2022 at 2:28 PM Andreas Gruenbacher <[email protected]> wrote:
> >
> > Btrfs has a comment in that place that reads:
> >
> > /* No increment (+=) because iomap returns a cumulative value. */
>
> What a truly horrid interface. But you are triggering repressed
> childhood memories:
>
> > That's so that we can complete the tail of an asynchronous write
> > asynchronously after a failed page fault and subsequent fault-in.
>
> yeah, that makes me go "I remember something like that".
>
> > That would be great, but applications don't seem to be able to cope
> > with short direct writes, so we must turn partial failure into total
> > failure here. There's at least one xfstest that checks for that as
> > well.
>
> What a complete crock. You're telling me that you did some writes, but
> then you can't tell user space that writes happened because that would
> confuse things.
>
> Direct-IO is some truly hot disgusting garbage.
>
> Happily it's only used for things like databases that nobody sane
> would care about anyway.
>
> Anyway, none of that makes any sense, since you do this:
>
> ret = gfs2_file_direct_write(iocb, from, &gh);
> if (ret < 0 || !iov_iter_count(from))
> goto out_unlock;
>
> iocb->ki_flags |= IOCB_DSYNC;
> buffered = gfs2_file_buffered_write(iocb, from, &gh);
>
> so you're saying that the direct write will never partially succeed,
> but then in gfs2_file_write_iter() it very much looks like it's
> falling back to buffered writes for that case.
>
> Hmm. Very odd.
>
> I assume this is all due to that
>
> /* Silently fall back to buffered I/O when writing beyond EOF */
> if (iocb->ki_pos + iov_iter_count(from) > i_size_read(&ip->i_inode))
>
> thing, so this all makes some perverse kind of sense, but it still
> looks like this code just needs some serious serious commentary.

Yes, that, as well as writing into holes.

During direct writes, gfs2 is holding the inode glock in a special
"shared writable" mode which works like the usual "shared readable"
mode as far as metadata goes, but can be held by multiple "shared
writers" at the same time. This allows multiple nodes to write to the
storage device concurrently as long as the file is preallocated (i.e.,
databases). When it comes to allocations, it falls back to "exclusive"
locking and buffered writes.

> So you *can* have a partial write if you hit the end of the file, and
> then you'll continue that partial write with the buffered code.
>
> But an actual _failure_ will not do that, and instead return an error
> even if the write was partially done.
>
> But then *some* failures aren't failures at all, and without any
> comments do this
>
> if (ret == -ENOTBLK)
> ret = 0;
>
>
> And remind me again - this all is meant for applications that
> supposedly care about consistency on disk?
>
> And the xfs tests actually have a *test* for that case, to make sure
> that nobody can sanely *really* know how much of the write succeeded
> if it was a DIO write?
>
> Gotcha.

I agree that it's pretty sad.

> > > The reason I think I'm full of sh*t is that you say that the problem
> > > occurs in gfs2_file_buffered_write(), not in that
> > > gfs2_file_direct_write() case.
> >
> > Right, we're having that issue with buffered writes.
>
> I have to say, compared to all the crazy things I see in the DIO path,
> the buffered write path actually looks almost entirely sane.
>
> Of course, gfs2_file_read_iter() counts how many bytes it has read in
> a variable called 'written', and gfs2_file_buffered_write() counts the
> bytes it has written in a variable called 'read', so "entirely sane"
> is all very very relative.
>
> I'm sure there's some good reason (job security?) for all this insanity.

Point taken; I'll fix this up.

> But I now have to go dig my eyes out with a rusty spoon.
>
> But before I do that, I have one more question (I'm going to regret
> this, aren't I?):
>
> In gfs2_file_buffered_write(), when it has done that
> fault_in_iov_iter_readable(), and it decides that that succeeded, it
> does
>
> if (gfs2_holder_queued(gh))
> goto retry_under_glock;
> if (read && !(iocb->ki_flags & IOCB_DIRECT))
> goto out_uninit;
> goto retry;
>
> so if it still has that lock (if I understand correctly), it will always retry.
>
> But if it *lost* the lock, it will retry only if was a direct write,
> and return a partial success for a regular write() rather than
> continue the write.
>
> Whaa? I'm assuming that this is more of that "direct writes have to
> succeed fully or not at all", but according to POSIX *regular* writes
> should also succeed fully, unless some error happens.
>
> Losing the lock doesn't sound like an error to me.

Regular (buffered) reads and writes are expected to be atomic with
respect to each other. That atomicity comes from holding the lock.
When we lose the lock, we can observe atomicity and return a short
result, ignore atomicity and return the full result, or retry the
entire operation. Which of those three options would you prefer?

For what it's worth, none of this matters as long as there's no lock
contention across the cluster.

> And there are a lot of applications that do assume "write to a regular
> file didn't complete fully means that the disk is full" etc. Because
> that's the traditional meaning.
>
> This doesn't seem to be related to any data corruption issue, but it does smell.
>
> Linus

Thanks,
Andreas

2022-04-27 17:40:48

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] gfs2 fix

On Wed, Apr 27, 2022 at 5:29 AM Andreas Gruenbacher <[email protected]> wrote:
>
> Regular (buffered) reads and writes are expected to be atomic with
> respect to each other.

Linux has actually never honored that completely broken POSIX
requirement, although I think some filesystems (notably XFS) have
tried.

It's a completely broken concept. It's not possible to honor atomicity
with mmap(), and nobody has *ever* cared.

And it causes huge amounts of problems and basically makes any sane
locking entirely impossible.

The fact that you literally broke regular file writes in ways that are
incompatible with (much MUCH more important) POSIX file behavior to
try to get that broken read/write atomicity is only one example among
many for why that alleged rule just has to be ignored.

We do honor the PIPE_BUF atomicity on pipes, which is a completely
different kind of atomicity wrt read/write, and doesn't have the
fundamental issues that arbitrary regular file reads/writes have.

There is absolutely no sane way to do that file atomicity wrt
arbitrary read/write calls (*), and you shouldn't even try.

That rule needs to be forgotten about, and buried 6ft deep.

So please scrub any mention of that idiotic rule from documentation,
and from your brain.

And please don't break "partial write means disk full or IO error" due
to trying to follow this broken rule, which was apparently what you
did.

Because that "regular file read/write is done in full" is a *MUCH*
more important rule, and there is a shitton of applications that most
definitely depend on *that* rule.

Just go to debian code search, and look for

"if (write("

and you'll get thousands of hits, and on the first page of hits 9 out
of 10 of the hits are literally about that "partial write is an
error", eg code like this:

if (write(fd,&triple,sizeof(triple)) != sizeof(triple))
reporterr(1,NULL);

from libreoffice.

Linus

(*) Yeah, if you never care about performance(**) of mixed read/write,
and you don't care about mmap, and you have no other locking issues,
it's certainly possible. The old rule came about from original UNIX
literally taking an inode lock around the whole IO access, because
that was simple, and back in the days you'd never have multiple
concurrent readers/writers anyway.

(**) It's also instructive how O_DIRECT literally throws that rule
away, and then some direct-IO people said for years that direct-IO is
superior and used this as one of their arguments. Probably the same
people who thought that "oh, don't report partial success", because we
can't deal with it.

2022-04-27 22:40:25

by Andreas Gruenbacher

[permalink] [raw]
Subject: Re: [GIT PULL] gfs2 fix

On Wed, Apr 27, 2022 at 7:13 PM Linus Torvalds
<[email protected]> wrote:
> On Wed, Apr 27, 2022 at 5:29 AM Andreas Gruenbacher <[email protected]> wrote:
> >
> > Regular (buffered) reads and writes are expected to be atomic with
> > respect to each other.
>
> Linux has actually never honored that completely broken POSIX
> requirement, although I think some filesystems (notably XFS) have
> tried.

Okay, I can happily live with that.

I wonder if this could be documented in the read and write manual
pages. Or would that be asking too much?

> It's a completely broken concept. It's not possible to honor atomicity
> with mmap(), and nobody has *ever* cared.
>
> And it causes huge amounts of problems and basically makes any sane
> locking entirely impossible.
>
> The fact that you literally broke regular file writes in ways that are
> incompatible with (much MUCH more important) POSIX file behavior to
> try to get that broken read/write atomicity is only one example among
> many for why that alleged rule just has to be ignored.
>
> We do honor the PIPE_BUF atomicity on pipes, which is a completely
> different kind of atomicity wrt read/write, and doesn't have the
> fundamental issues that arbitrary regular file reads/writes have.
>
> There is absolutely no sane way to do that file atomicity wrt
> arbitrary read/write calls (*), and you shouldn't even try.
>
> That rule needs to be forgotten about, and buried 6ft deep.
>
> So please scrub any mention of that idiotic rule from documentation,
> and from your brain.
>
> And please don't break "partial write means disk full or IO error" due
> to trying to follow this broken rule, which was apparently what you
> did.
>
> Because that "regular file read/write is done in full" is a *MUCH*
> more important rule, and there is a shitton of applications that most
> definitely depend on *that* rule.
>
> Just go to debian code search, and look for
>
> "if (write("
>
> and you'll get thousands of hits, and on the first page of hits 9 out
> of 10 of the hits are literally about that "partial write is an
> error", eg code like this:
>
> if (write(fd,&triple,sizeof(triple)) != sizeof(triple))
> reporterr(1,NULL);
>
> from libreoffice.
>
> Linus
>
> (*) Yeah, if you never care about performance(**) of mixed read/write,
> and you don't care about mmap, and you have no other locking issues,
> it's certainly possible. The old rule came about from original UNIX
> literally taking an inode lock around the whole IO access, because
> that was simple, and back in the days you'd never have multiple
> concurrent readers/writers anyway.
>
> (**) It's also instructive how O_DIRECT literally throws that rule
> away, and then some direct-IO people said for years that direct-IO is
> superior and used this as one of their arguments. Probably the same
> people who thought that "oh, don't report partial success", because we
> can't deal with it.
>

Thanks a lot,
Andreas