2022-04-27 11:17:27

by Andreas Gruenbacher

[permalink] [raw]
Subject: [GIT PULL] gfs2 fix

Hi Linus,

please consider pulling the following gfs2 fix for 5.18-rc5.

Also, note that we're fighting with a rare case of data corruption that
seems to have been introduced by commit 00bfe02f4796 ("gfs2: Fix mmap +
page fault deadlocks for buffered I/O"; merged in v5.16). The
corruption seems to occur in gfs2_file_buffered_write() when
fault_in_iov_iter_readable() is involved. We do end up with data that's
written at an offset, as if after a fault-in, the position in the iocb
got out of sync with the position in the iov_iter. The user buffer the
iov_iter points at isn't page aligned, so the corruption also isn't page
aligned. The code all seems correct and we couldn't figure out what's
going on so far.

Thanks,
Andreas

The following changes since commit af2d861d4cd2a4da5137f795ee3509e6f944a25b:

Linux 5.18-rc4 (2022-04-24 14:51:22 -0700)

are available in the Git repository at:

https://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2.git tags/gfs2-v5.18-rc4-fix

for you to fetch changes up to e57f9af73d6b0ffb5f1aeaf6cec9a751dd8535c9:

gfs2: Don't re-check for write past EOF unnecessarily (2022-04-26 15:38:00 +0200)

----------------------------------------------------------------
gfs2 fix

- Only re-check for direct I/O writes past the end of the file after
re-acquiring the inode glock.

----------------------------------------------------------------
Andreas Gruenbacher (1):
gfs2: Don't re-check for write past EOF unnecessarily

fs/gfs2/file.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)


2022-04-27 11:21:58

by Andreas Gruenbacher

[permalink] [raw]
Subject: Re: [GIT PULL] gfs2 fix

On Tue, Apr 26, 2022 at 8:31 PM Linus Torvalds
<[email protected]> wrote:
> On Tue, Apr 26, 2022 at 7:54 AM Andreas Gruenbacher <[email protected]> wrote:
> >
> > Also, note that we're fighting with a rare case of data corruption that
> > seems to have been introduced by commit 00bfe02f4796 ("gfs2: Fix mmap +
> > page fault deadlocks for buffered I/O"; merged in v5.16). The
> > corruption seems to occur in gfs2_file_buffered_write() when
> > fault_in_iov_iter_readable() is involved. We do end up with data that's
> > written at an offset, as if after a fault-in, the position in the iocb
> > got out of sync with the position in the iov_iter. The user buffer the
> > iov_iter points at isn't page aligned, so the corruption also isn't page
> > aligned. The code all seems correct and we couldn't figure out what's
> > going on so far.
>
> So this may be stupid, but looking at gfs2_file_direct_write(), I see
> what looks like two different obvious bugs.
>
> This looks entirely wrong:
>
> if (ret > 0)
> read = ret;
>
> because this code is *repeated*.
>
> I think it should use "read += ret;" so that if we have a few
> successful reads, they add up.

Btrfs has a comment in that place that reads:

/* No increment (+=) because iomap returns a cumulative value. */

That's so that we can complete the tail of an asynchronous write
asynchronously after a failed page fault and subsequent fault-in.

> And then at the end:
>
> if (ret < 0)
> return ret;
> return read;
>
> looks wrong too, since maybe the *last* iteration failed, but an
> earlier succeeded, so I think it should be
>
> if (read)
> return read;
> return ret;
>
> But hey, what do I know. I say "looks like obvious bugs", but I don't
> really know the code, and I may just be completely full of sh*t.

That would be great, but applications don't seem to be able to cope
with short direct writes, so we must turn partial failure into total
failure here. There's at least one xfstest that checks for that as
well.

> The reason I think I'm full of sh*t is that you say that the problem
> occurs in gfs2_file_buffered_write(), not in that
> gfs2_file_direct_write() case.

Right, we're having that issue with buffered writes.

> And the *buffered* case seems to get both of those "obvious bugs" right.
>
> So I must be wrong, but I have to say, that looks odd to me.
>
> Now hopefully somebody who knows the code will explain to me why I'm
> wrong, and in the process go "Duh, but.." and see what's up.

Thanks for pointing out the places that seem odd to you. You'll not be
the only one.

Andreas

2022-04-27 11:33:47

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] gfs2 fix

On Tue, Apr 26, 2022 at 7:54 AM Andreas Gruenbacher <[email protected]> wrote:
>
> Also, note that we're fighting with a rare case of data corruption that
> seems to have been introduced by commit 00bfe02f4796 ("gfs2: Fix mmap +
> page fault deadlocks for buffered I/O"; merged in v5.16). The
> corruption seems to occur in gfs2_file_buffered_write() when
> fault_in_iov_iter_readable() is involved. We do end up with data that's
> written at an offset, as if after a fault-in, the position in the iocb
> got out of sync with the position in the iov_iter. The user buffer the
> iov_iter points at isn't page aligned, so the corruption also isn't page
> aligned. The code all seems correct and we couldn't figure out what's
> going on so far.

So this may be stupid, but looking at gfs2_file_direct_write(), I see
what looks like two different obvious bugs.

This looks entirely wrong:

if (ret > 0)
read = ret;

because this code is *repeated*.

I think it should use "read += ret;" so that if we have a few
successful reads, they add up.

And then at the end:

if (ret < 0)
return ret;
return read;

looks wrong too, since maybe the *last* iteration failed, but an
earlier succeeded, so I think it should be

if (read)
return read;
return ret;

But hey, what do I know. I say "looks like obvious bugs", but I don't
really know the code, and I may just be completely full of sh*t.

The reason I think I'm full of sh*t is that you say that the problem
occurs in gfs2_file_buffered_write(), not in that
gfs2_file_direct_write() case.

And the *buffered* case seems to get both of those "obvious bugs" right.

So I must be wrong, but I have to say, that looks odd to me.

Now hopefully somebody who knows the code will explain to me why I'm
wrong, and in the process go "Duh, but.." and see what's up.

Linus

2022-04-27 21:34:40

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] gfs2 fix

On Wed, Apr 27, 2022 at 12:41 PM Andreas Gruenbacher
<[email protected]> wrote:
>
> I wonder if this could be documented in the read and write manual
> pages. Or would that be asking too much?

I don't think it would be asking too much, since it's basically just
describing what Linux has always done in all the major filesystems.

Eg look at filemap_read(), which is basically the canonical read
function, and note how it doesn't take a single lock at that level.

We *do* have synchronization at a page level, though, ie we've always
had that page-level "uptodate" bit, of course (ok, so "always" isn't
true - back in the distant past it was the 'struct buffer_head' that
was the synchronization point).

That said, even that is not synchronizing against "new writes", but
only against "new creations" (which may, of course, be writers, but is
equally likely to be just reading the contents from disk).

That said:

(a) different filesystems can and will do different things.

Not all filesystems use filemap_read() at all, and even the ones that
do often have their own wrappers. Such wrappers *can* do extra
serialization, and have their own rules. But ext4 does not, for
example (see ext4_file_read_iter()).

And as mentioned, I *think* XFS honors that old POSIX rule for
historical reasons.

(b) we do have *different* locking

for example, we these days do actually serialize properly on the
file->f_pos, which means that a certain *class* of read/write things
are atomic wrt each other, because we actually hold that f_pos lock
over the whole operation and so if you do file reads and writes using
the same file descriptor, they'll be disjoint.

That, btw, hasn't always been true. If you had multiple threads using
the same file pointer, I think we used to get basically random
results. So we have actually strengthened our locking in this area,
and made it much better.

But note how even if you have the same file descriptor open, and then
do pread/pwrite, those can and will happen concurrently.

And mmap accesses and modifications are obviously *always* concurrent,
even if the fault itself - but not the accesses - might end up being
serialized due to some filesystem locking implementation detail.

End result: the exact serialization is complex, depends on the
filesystem, and is just not really something that should be described
or even relied on (eg that f_pos serialization is something we do
properly now, but didn't necessarily do in the past, so ..)

Is it then worth pointing out one odd POSIX rule that basically nobody
but some very low-level filesystem people have ever heard about, and
that no version of Linux has ever conformed to in the main default
filesystems, and that no user has ever cared about?

Linus

2022-04-27 23:40:16

by Andreas Gruenbacher

[permalink] [raw]
Subject: Re: [GIT PULL] gfs2 fix

On Wed, Apr 27, 2022 at 10:26 PM Linus Torvalds
<[email protected]> wrote:
> On Wed, Apr 27, 2022 at 12:41 PM Andreas Gruenbacher <[email protected]> wrote:
> >
> > I wonder if this could be documented in the read and write manual
> > pages. Or would that be asking too much?
>
> I don't think it would be asking too much, since it's basically just
> describing what Linux has always done in all the major filesystems.
>
> Eg look at filemap_read(), which is basically the canonical read
> function, and note how it doesn't take a single lock at that level.
>
> We *do* have synchronization at a page level, though, ie we've always
> had that page-level "uptodate" bit, of course (ok, so "always" isn't
> true - back in the distant past it was the 'struct buffer_head' that
> was the synchronization point).
>
> That said, even that is not synchronizing against "new writes", but
> only against "new creations" (which may, of course, be writers, but is
> equally likely to be just reading the contents from disk).
>
> That said:
>
> (a) different filesystems can and will do different things.
>
> Not all filesystems use filemap_read() at all, and even the ones that
> do often have their own wrappers. Such wrappers *can* do extra
> serialization, and have their own rules. But ext4 does not, for
> example (see ext4_file_read_iter()).
>
> And as mentioned, I *think* XFS honors that old POSIX rule for
> historical reasons.
>
> (b) we do have *different* locking
>
> for example, we these days do actually serialize properly on the
> file->f_pos, which means that a certain *class* of read/write things
> are atomic wrt each other, because we actually hold that f_pos lock
> over the whole operation and so if you do file reads and writes using
> the same file descriptor, they'll be disjoint.
>
> That, btw, hasn't always been true. If you had multiple threads using
> the same file pointer, I think we used to get basically random
> results. So we have actually strengthened our locking in this area,
> and made it much better.
>
> But note how even if you have the same file descriptor open, and then
> do pread/pwrite, those can and will happen concurrently.
>
> And mmap accesses and modifications are obviously *always* concurrent,
> even if the fault itself - but not the accesses - might end up being
> serialized due to some filesystem locking implementation detail.
>
> End result: the exact serialization is complex, depends on the
> filesystem, and is just not really something that should be described
> or even relied on (eg that f_pos serialization is something we do
> properly now, but didn't necessarily do in the past, so ..)
>
> Is it then worth pointing out one odd POSIX rule that basically nobody
> but some very low-level filesystem people have ever heard about, and
> that no version of Linux has ever conformed to in the main default
> filesystems, and that no user has ever cared about?

Well, POSIX explicitly mentions those atomicity expectations, e.g.,
for read [1]:

"I/O is intended to be atomic to ordinary files and pipes and FIFOs. Atomic
means that all the bytes from a single operation that started out together
end up together, without interleaving from other I/O operations."

Users who hear about it from POSIX are led to assume that this atomic
behavior is "real", and the Linux man pages do nothing to rob them of
that illusion. They do document that the file offset aspect has been
fixed though, which only makes things more confusing.

So from that point of view, I think it would be worthwhile to mention
that most if not all filesystems ignore the "non-interleaving" aspect.

[1] https://pubs.opengroup.org/onlinepubs/9699919799.2018edition/functions/read.html
[2] https://pubs.opengroup.org/onlinepubs/9699919799.2018edition/functions/write.html
[3] https://pubs.opengroup.org/onlinepubs/9699919799.2018edition/functions/V2_chap02.html#tag_15_09_07

Thanks,
Andreas

2022-04-28 02:15:55

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] gfs2 fix

On Wed, Apr 27, 2022 at 2:26 PM Andreas Gruenbacher <[email protected]> wrote:
>
> Well, POSIX explicitly mentions those atomicity expectations, e.g.,
> for read [1]:

Yes. I'm aware. And my point is that we've never done _that_ kind of atomicity.

It's also somewhat ambiguous what it actually means, since what it
then talks about is "all bytes that started out together ends
together" and "interleaving".

That all implies that it's about the *position* of the reads and
writes being atomic, not the *data* of the reads and writes.

That, btw, was something we honored even before we had the locking
around f_pos accesses - a read or write system call would get its own
local *copy* the file position, the read or write would then do that
IO based on that copied position - so that things that "started out
together ends together" - and then after the operation is done it
would *update* the file position atomically.

Note that that is exactly so that data would end up "together". But it
would mean that two concurrent reads using the same file position
might read the *same* area of the file.

Which still honors that "the read is atomic wrt the range", but
obviously the actual values of "f_pos" is basically random after the
read (ie is it the end of the first read, or the end of the second
read?).

The same paragraph also explicitly mentions pipes and FIFOs, despite
an earlier paragraph dismissing them, which is all just a sign of
things being very confused.

Anyway, I'm not objecting very sternously to making it very clear in
some documentation that this "data atomicity" is not what Linux has
ever done. If you do overlapping IO, you get what you deserve.

But I do have objections.

On one hand, it's not all that different from some of the other notes
we have in the man-pages (ie documenting that whole "just under 2GB"
limit on the read size, although that's actually using the wrong
constant: it's not 0x7ffff000 bytes, it's MAX_RW_COUNT, which is
"INT_MAX & PAGE_MASK" and that constant in the man-page is as such
only true on a system with 4kB page sizes)

BUT! I'm 100% convinced that NOBODY HAS EVER given the kind of
atomicity guarantees that you would see from reading that document as
a language-lawyer.

For example, that section "2.9.7 Thread Interactions with Regular File
Operations" says that "fstat()" is atomic wrt "write()", and that you
should see "all or nothing".

I *GUARANTEE* that no operating system ever has done that, and I
further claim that reading it the way you read it is not only against
reality, it's against sanity.

Example: if I do a big write to a file that I just created, do you
really want "fstat()" in another thread or process to not even be able
to see how the file grows as the write happens?

It's not what anybody has *EVER* done, I'm pretty sure.

So I really think

(a) you are mis-reading the standard by attributing too strong logic
to paperwork that is English prose and not so exact

(b) documenting Linux as not doing what you are mis-reading it for is
only encouraging others to mis-read it too

The whole "arbitrary writes have to be all-or-nothing wrt all other
system calls" is simply not realistic, and has never been. Not just
not in Linux, but in *ANY* operating system that POSIX was meant to
describe.

And equally importantly: if some crazy person were to actually try to
implement such "true atomicity" things, the end result would be
objectively worse. Because you literally *want* to see a big write()
updating the file length as the write happens.

The fact that the standard then doesn't take those kinds of details
into account is simply because the standard isn't meant to be read as
a language lawyer, but as a "realistically .."

Linus

2022-04-28 04:59:52

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] gfs2 fix

On Wed, Apr 27, 2022 at 3:20 PM Linus Torvalds
<[email protected]> wrote:
>
> So I really think
>
> (a) you are mis-reading the standard by attributing too strong logic
> to paperwork that is English prose and not so exact
>
> (b) documenting Linux as not doing what you are mis-reading it for is
> only encouraging others to mis-read it too
>
> The whole "arbitrary writes have to be all-or-nothing wrt all other
> system calls" is simply not realistic, and has never been. Not just
> not in Linux, but in *ANY* operating system that POSIX was meant to
> describe.

Side note: a lot of those "atomic" things in that documentation have
come from a history of signal handling atomicity issues, and from all
the issues people had with (a) user-space threading implementations
and (b) emulation layers from non-Unixy environments.

So when they say that things like "rename()" has to be all-or-nothing,
it's to clarify that you can't emulate it as a "link and delete
original" kind of operation (which old UNIX *did* do) and claim to be
POSIX.

Because while the end result of rename() and link()+unlink()might be
similar, people did rely on that whole "use rename as a way to create
an atomic marker in the filesystem" (which is a very traditional UNIX
pattern).

So "rename()" has to be atomic, and the legacy behavior of link+unlink
is not valid in POSIX.

Similarly, you can't implement "pread()" as a "lseek+read+lseek back",
because that doesn't work if somebody else is doing another "pread()"
on the same file descriptor concurrently.

Again, people *did* implement exactly those kinds of implementations
of "pread()", and yes, they were broken for both signals and for
threading.

So there's "atomicity" and then there is "atomicity".

That "all or nothing" can be a very practical thing to describe
*roughly* how it must work on a higher level, or it can be a
theoretical "transactional" thing that works literally like a database
where the operation happens in full and you must not see any
intermediate state.

And no, "write()" and friends have never ever been about some
transactional operation where you can't see how the file grows as it
is being written to. That kind of atomicity has simply never existed,
not even in theory.

So when you see POSIX saying that a "read()" system call is "atomic",
you should *not* see it as a transaction thing, but see it in the
historical context of "people used to do threading libraries in user
space, and since they didn't want a big read() to block all other
threads, they'd split it up into many smaller reads and now another
thread *also* doing 'read()' system calls would see the data it read
being not one contiguous region, but multiple regions where the file
position changed in the middle".

Similarly, a "read()" system call will not be interrupted by a signal
in the middle, where the signal handler would do a "lseek()" or
another "read()", and now the original "read()" data suddenly is
affected.

That's why things like that whole "f_pos is atomic" is a big deal.

Because there literally were threading libraries (and badly emulated
environments) where that *WASN'T* the case, and _that_ is why POSIX
then talks about it.

So think of POSIX not as some hard set of "this is exactly how things
work and we describe every detail".

Instead, treat it a bit like historians treat Herodotus - interpreting
his histories by taking the issues of the time into account. POSIX is
trying to clarify and document the problems of the time it was
written, and taking other things for granted.

Linus

2022-04-29 00:15:37

by pr-tracker-bot

[permalink] [raw]
Subject: Re: Re: [GIT PULL] gfs2 fix

The pull request you sent on Thu, 28 Apr 2022 15:26:58 +0200:

> https://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2.git tags/gfs2-v5.18-rc4-fix2

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/4a2316a1eda4ef3ced18c7f08f7cb3726bcae44b

Thank you!

--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html

2022-04-29 00:48:21

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] gfs2 fix

On Thu, Apr 28, 2022 at 10:09 AM Linus Torvalds
<[email protected]> wrote:
>
> I'll look at that copy_page_to_iter_iovec() some more regardless, but
> doing that "let's double-check it's not somethign else" would be good.

Oh, and as I do that, it strikes me: What platform do you see the problem on?

Because the code for that function is very different if HIGHMEM is
enabled, so if you see this on x86-32 but not x86-64, for example,
that is relevant.

I *assume* nobody uses x86-32 any more, but just checking...

Linus

2022-04-29 01:05:36

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] gfs2 fix

On Thu, Apr 28, 2022 at 6:27 AM Andreas Gruenbacher <[email protected]> wrote:
>
> The data corruption we've been getting unfortunately didn't have to do
> with lock contention (we already knew that); it still occurs. I'm
> running out of ideas on what to try there.

Hmm.

I don't see the bug, but I do have a suggestion on something to try.

In particular, you said the problem started with commit 00bfe02f4796
("gfs2: Fix mmap + page fault deadlocks for buffered I/O").

And to me, I see two main things that are going on

(a) the obvious "calling generic IO functions with pagefault disabled" thing

(b) the "allow demotion" thing

And I wonder if you could at least pinpoint which of the cases it is
that triggers it.

So I'd love to see you try three things:

(1) just remove the "allow demotion" cases.

This will re-introduce the deadlock the commit is trying to fix,
but that's such a special case that I assume you can run your
test-suite that shows the problem even without that fix in place?

This would just pinpoint whether it's due to some odd locking issue or not.

Honestly, from how you describe the symptoms, I don't think (1) is the
cause, but I think making sure is good.

It sounds much more likely that it's one of those generic vfs
functions that screws up when a page fault happens and it gets a
partial result instead of handling the fault.

Which gets us to

(2) remove the pagefault_disable/enable() around just the
generic_file_read_iter() case in gfs2_file_read_iter().

and

(3) finally, remove the pagefault_disable/enable() around the
iomap_file_buffered_write() case in gfs2_file_buffered_write()

Yeah, yeah, you say it's just the read that fails, but humor me on
(3), just in case it's an earlier write in your test-suite and the
read just then uncovered it.

But I put it as (3) so that you'd do the obvious (2) case first, and
narrow it down (ie if (1) still shows the bug, then do (2), and if
that fixes the bug it will be fairly well pinpointed to
generic_file_read_iter().

Looking around, gfs2 is the only thing that obviously calls
generic_file_read_iter() with pagefoaults disabled, so it does smell
like filemap_read() might have some issue, but the only thing that
does is basically that

copied = copy_folio_to_iter(folio, offset, bytes, iter);

which should just become copy_page_to_iter_iovec(), which you'd hope
would get things right.

But it would be good to just narrow things down a bit.

I'll look at that copy_page_to_iter_iovec() some more regardless, but
doing that "let's double-check it's not somethign else" would be good.

Linus

2022-04-29 04:15:26

by Andreas Gruenbacher

[permalink] [raw]
Subject: Re: [GIT PULL] gfs2 fix

On Thu, Apr 28, 2022 at 2:00 AM Linus Torvalds
<[email protected]> wrote:
> On Wed, Apr 27, 2022 at 3:20 PM Linus Torvalds <[email protected]> wrote:
> >
> > So I really think
> >
> > (a) you are mis-reading the standard by attributing too strong logic
> > to paperwork that is English prose and not so exact
> >
> > (b) documenting Linux as not doing what you are mis-reading it for is
> > only encouraging others to mis-read it too
> >
> > The whole "arbitrary writes have to be all-or-nothing wrt all other
> > system calls" is simply not realistic, and has never been. Not just
> > not in Linux, but in *ANY* operating system that POSIX was meant to
> > describe.
>
> Side note: a lot of those "atomic" things in that documentation have
> come from a history of signal handling atomicity issues, and from all
> the issues people had with (a) user-space threading implementations
> and (b) emulation layers from non-Unixy environments.
>
> So when they say that things like "rename()" has to be all-or-nothing,
> it's to clarify that you can't emulate it as a "link and delete
> original" kind of operation (which old UNIX *did* do) and claim to be
> POSIX.
>
> Because while the end result of rename() and link()+unlink()might be
> similar, people did rely on that whole "use rename as a way to create
> an atomic marker in the filesystem" (which is a very traditional UNIX
> pattern).
>
> So "rename()" has to be atomic, and the legacy behavior of link+unlink
> is not valid in POSIX.
>
> Similarly, you can't implement "pread()" as a "lseek+read+lseek back",
> because that doesn't work if somebody else is doing another "pread()"
> on the same file descriptor concurrently.
>
> Again, people *did* implement exactly those kinds of implementations
> of "pread()", and yes, they were broken for both signals and for
> threading.
>
> So there's "atomicity" and then there is "atomicity".
>
> That "all or nothing" can be a very practical thing to describe
> *roughly* how it must work on a higher level, or it can be a
> theoretical "transactional" thing that works literally like a database
> where the operation happens in full and you must not see any
> intermediate state.
>
> And no, "write()" and friends have never ever been about some
> transactional operation where you can't see how the file grows as it
> is being written to. That kind of atomicity has simply never existed,
> not even in theory.
>
> So when you see POSIX saying that a "read()" system call is "atomic",
> you should *not* see it as a transaction thing, but see it in the
> historical context of "people used to do threading libraries in user
> space, and since they didn't want a big read() to block all other
> threads, they'd split it up into many smaller reads and now another
> thread *also* doing 'read()' system calls would see the data it read
> being not one contiguous region, but multiple regions where the file
> position changed in the middle".
>
> Similarly, a "read()" system call will not be interrupted by a signal
> in the middle, where the signal handler would do a "lseek()" or
> another "read()", and now the original "read()" data suddenly is
> affected.
>
> That's why things like that whole "f_pos is atomic" is a big deal.
>
> Because there literally were threading libraries (and badly emulated
> environments) where that *WASN'T* the case, and _that_ is why POSIX
> then talks about it.
>
> So think of POSIX not as some hard set of "this is exactly how things
> work and we describe every detail".
>
> Instead, treat it a bit like historians treat Herodotus - interpreting
> his histories by taking the issues of the time into account. POSIX is
> trying to clarify and document the problems of the time it was
> written, and taking other things for granted.

Okay fine, thanks for elaborating.

Would you mind pulling the following fix to straighten this out?

The data corruption we've been getting unfortunately didn't have to do
with lock contention (we already knew that); it still occurs. I'm
running out of ideas on what to try there.

Thanks a lot,
Andreas

--

The following changes since commit 4fad37d595b9d9a2996467d780cb2e7a1b08b2c0:

Merge tag 'gfs2-v5.18-rc4-fix' of
git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
(2022-04-26 11:17:18 -0700)

are available in the Git repository at:

https://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2.git
tags/gfs2-v5.18-rc4-fix2

for you to fetch changes up to 296abc0d91d8b65d42224dd33452ace14491ad08:

gfs2: No short reads or writes upon glock contention (2022-04-28
15:14:48 +0200)

----------------------------------------------------------------
gfs2 fix

- No short reads or writes upon glock contention

----------------------------------------------------------------
Andreas Gruenbacher (1):
gfs2: No short reads or writes upon glock contention

fs/gfs2/file.c | 4 ----
1 file changed, 4 deletions(-)

2022-04-29 06:58:40

by Andreas Gruenbacher

[permalink] [raw]
Subject: Re: [GIT PULL] gfs2 fix

On Thu, Apr 28, 2022 at 7:09 PM Linus Torvalds
<[email protected]> wrote:
> On Thu, Apr 28, 2022 at 6:27 AM Andreas Gruenbacher <[email protected]> wrote:
> >
> > The data corruption we've been getting unfortunately didn't have to do
> > with lock contention (we already knew that); it still occurs. I'm
> > running out of ideas on what to try there.
>
> Hmm.
>
> I don't see the bug, but I do have a suggestion on something to try.
>
> In particular, you said the problem started with commit 00bfe02f4796
> ("gfs2: Fix mmap + page fault deadlocks for buffered I/O").

Yes, but note that it's gfs2_file_buffered_write() that fails. When
the pagefault_disable/enable() around iomap_file_buffered_write() is
removed, the corruption goes away.

> And to me, I see two main things that are going on
>
> (a) the obvious "calling generic IO functions with pagefault disabled" thing
>
> (b) the "allow demotion" thing
>
> And I wonder if you could at least pinpoint which of the cases it is
> that triggers it.
>
> So I'd love to see you try three things:
>
> (1) just remove the "allow demotion" cases.
>
> This will re-introduce the deadlock the commit is trying to fix,
> but that's such a special case that I assume you can run your
> test-suite that shows the problem even without that fix in place?
>
> This would just pinpoint whether it's due to some odd locking issue or not.
>
> Honestly, from how you describe the symptoms, I don't think (1) is the
> cause, but I think making sure is good.
>
> It sounds much more likely that it's one of those generic vfs
> functions that screws up when a page fault happens and it gets a
> partial result instead of handling the fault.

The test should run just fine without allowing demotion. I'll try (1),
but I don't expect the outcome to change.

> Which gets us to
>
> (2) remove the pagefault_disable/enable() around just the
> generic_file_read_iter() case in gfs2_file_read_iter().
>
> and
>
> (3) finally, remove the pagefault_disable/enable() around the
> iomap_file_buffered_write() case in gfs2_file_buffered_write()
>
> Yeah, yeah, you say it's just the read that fails, but humor me on
> (3), just in case it's an earlier write in your test-suite and the
> read just then uncovered it.
>
> But I put it as (3) so that you'd do the obvious (2) case first, and
> narrow it down (ie if (1) still shows the bug, then do (2), and if
> that fixes the bug it will be fairly well pinpointed to
> generic_file_read_iter().

As mentioned above, we already did (3) and it didn't help. I'll do (1)
now, and then (2).

> Looking around, gfs2 is the only thing that obviously calls
> generic_file_read_iter() with pagefaults disabled, so it does smell
> like filemap_read() might have some issue, but the only thing that
> does is basically that
>
> copied = copy_folio_to_iter(folio, offset, bytes, iter);
>
> which should just become copy_page_to_iter_iovec(), which you'd hope
> would get things right.
>
> But it would be good to just narrow things down a bit.
>
> I'll look at that copy_page_to_iter_iovec() some more regardless, but
> doing that "let's double-check it's not somethign else" would be good.

We've actually been running most of our experiments on a 5.14-based
kernel with a plethora of backports, so pre-folio. Sorry I forgot to
mention that. I'll reproduce with mainline as well.

Thanks,
Andreas

2022-04-29 16:15:19

by Andreas Gruenbacher

[permalink] [raw]
Subject: Re: [GIT PULL] gfs2 fix

On Thu, Apr 28, 2022 at 7:17 PM Linus Torvalds
<[email protected]> wrote:
> On Thu, Apr 28, 2022 at 10:09 AM Linus Torvalds <[email protected]> wrote:
> >
> > I'll look at that copy_page_to_iter_iovec() some more regardless, but
> > doing that "let's double-check it's not somethign else" would be good.
>
> Oh, and as I do that, it strikes me: What platform do you see the problem on?
>
> Because the code for that function is very different if HIGHMEM is
> enabled, so if you see this on x86-32 but not x86-64, for example,
> that is relevant.
>
> I *assume* nobody uses x86-32 any more, but just checking...

This is on x86_64.

Andreas

2022-05-03 00:10:10

by Andreas Gruenbacher

[permalink] [raw]
Subject: Re: [GIT PULL] gfs2 fix

On Mon, May 2, 2022 at 8:58 PM Linus Torvalds
<[email protected]> wrote:
> On Mon, May 2, 2022 at 11:31 AM Linus Torvalds <[email protected]> wrote:
> >
> > NOTE! This patch is entirely untested. I also didn't actually yet go
> > look at what gfs2 does when 'bytes' and 'copied' are different.
>
> Oh, it's a lot of generic iomap_write_end() code, so I guess it's just
> as well that I brought in the iomap people.
>
> And the iomap code has many different cases. Some of them do
>
> if (unlikely(copied < len && !folio_test_uptodate(folio)))
> return 0;
>
> to force the whole IO to be re-done (looks sane - that's the "the
> underlying folio wasn't uptodate, because we expected the write to
> make it so").
>
> And that might not have happened before, but it looks like gfs2 does
> actually try to deal with that case.
>
> But since Andreas said originally that the IO wasn't aligned, I don't
> think that "not uptodate" case is what is going on, and it's more
> about some "partial write in the middle of a buffer succeeded"
>
> And the code also has things like
>
> if (ret < len)
> iomap_write_failed(iter->inode, pos, len);
>
> which looks very wrong - it's not that the write failed, it's just
> incomplete because it was done with page faults disabled. It seems to
> try to do some page cache truncation based on the original 'len', but
> not taking the successful part into account. Which all sounds
> horrifically wrong.
>
> But I don't know the code well enough to really judge. It just makes
> me uncomfortable, and I do suspect this code may be quite buggy if the
> copy of the full 'len' doesn't succeed.

This has thrown me off in the past as well; it should be changed to
iomap_write_failed(iter->inode, pos + ret, len - ret) for legibility.
However, iomap_write_failed() only truncates past EOF and is preceded
by i_size_write(iter->inode, pos + ret) here, so it's not strictly a
bug.

> Again, the patch I sent only _hides_ any issues and makes them
> practically impossible to see. It doesn't really _fix_ anything, since
> - as mentioned - regardless of fault_in_iov_iter_readable()
> succeeding, racing with page-out could then cause the later
> copy_page_from_iter_atomic() to have a partial copy anyway.

Indeed. Let's see what we'll get with it.

In the meantime, we've reproduced with 5.18-rc4 + commit 296abc0d91d8
("gfs2: No short reads or writes upon glock contention"), and it still
has the data corruption.

> And hey, maybe there's something entirely different going on, and my
> "Heureka! It might be explained by that partial write_end that
> generally didn't happen before" is only my shouting at windmills.
>
> Linus
>

2022-05-03 00:41:04

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] gfs2 fix

On Mon, May 2, 2022 at 11:31 AM Linus Torvalds
<[email protected]> wrote:
>
> NOTE! This patch is entirely untested. I also didn't actually yet go
> look at what gfs2 does when 'bytes' and 'copied' are different.

Oh, it's a lot of generic iomap_write_end() code, so I guess it's just
as well that I brought in the iomap people.

And the iomap code has many different cases. Some of them do

if (unlikely(copied < len && !folio_test_uptodate(folio)))
return 0;

to force the whole IO to be re-done (looks sane - that's the "the
underlying folio wasn't uptodate, because we expected the write to
make it so").

And that might not have happened before, but it looks like gfs2 does
actually try to deal with that case.

But since Andreas said originally that the IO wasn't aligned, I don't
think that "not uptodate" case is what is going on, and it's more
about some "partial write in the middle of a buffer succeeded"

And the code also has things like

if (ret < len)
iomap_write_failed(iter->inode, pos, len);

which looks very wrong - it's not that the write failed, it's just
incomplete because it was done with page faults disabled. It seems to
try to do some page cache truncation based on the original 'len', but
not taking the successful part into account. Which all sounds
horrifically wrong.

But I don't know the code well enough to really judge. It just makes
me uncomfortable, and I do suspect this code may be quite buggy if the
copy of the full 'len' doesn't succeed.

Again, the patch I sent only _hides_ any issues and makes them
practically impossible to see. It doesn't really _fix_ anything, since
- as mentioned - regardless of fault_in_iov_iter_readable()
succeeding, racing with page-out could then cause the later
copy_page_from_iter_atomic() to have a partial copy anyway.

And hey, maybe there's something entirely different going on, and my
"Heureka! It might be explained by that partial write_end that
generally didn't happen before" is only my shouting at windmills.

Linus

2022-05-03 00:55:49

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] gfs2 fix

On Thu, Apr 28, 2022 at 10:39 AM Andreas Gruenbacher
<[email protected]> wrote:
>
> Yes, but note that it's gfs2_file_buffered_write() that fails. When
> the pagefault_disable/enable() around iomap_file_buffered_write() is
> removed, the corruption goes away.

I looked some more at this on and off, and ended up even more confused.

For some reason, I'd mostly looked at the read case, because I had
mis-read some of your emails and thought it was the buffered reads
that caused problems.

Then I went back more carefully, and realized you had always said
gfs2_file_buffered_write() was where the issues happened, and looked
at that path more, and that confused me even *MORE*.

Because that case has always done the copy from user space with page
faults disabled, because of the traditional deadlock with reading from
user space while holding the page lock on the target page cache page.

So that is not really about the new deadlock with filesystem locks,
that was fixed by 00bfe02f4796 ("gfs2: Fix mmap + page fault deadlocks
for buffered I/O").

So now that I'm looking at the right function (maybe) I'm going "huh",
because it's none of the complex cases that would seem to fail, it's
literally just the fault_in_iov_iter_readable() that we've always done
in iomap_write_iter() that presumably starts failing.

But *that* old code seems bogus too. It's doing

if (unlikely(fault_in_iov_iter_readable(i, bytes) == bytes)) {
status = -EFAULT;
break;
}

which on the face of it is sane: it's saying "if we can't fault in any
bytes, then stop trying".

And it's good, and correct, but it does leave one case open.

Because what if the result is "we can fault things in _partially_"?

The code blithely goes on and tries to do the whole 'bytes' range _anyway_.

Now, with a bug-free filesystem, this really shouldn't matter, since
the later copy_page_from_iter_atomic() thing should then DTRT anyway,
but this does mean that one fundamental thing that that commit
00bfe02f4796 changed is that it basically disabled that
fault_in_iov_iter_readable() that *used* to fault in the whole range,
and now potentially only faults in a small area.

That, in turn, means that in practice it *used* to do "write_end()"
with a fully successful range, ie when it did that

status = a_ops->write_end(file, mapping, pos, bytes, copied,
page, fsdata);

then "bytes" and "copied" were the same.

But now that commit 00bfe02f4796 added the "disable_pagefault()"
around the whole thing, fault_in_iov_iter_readable() will easily fail
half-way instead of bringing the next page in, and then that
->write_begin() to ->write_end() sequence will see the copy in the
middle failing half-way too, and you'll have that write_end()
condition with the write _partially_ succeeding.

Which is the complex case for write_end() that you practically
speaking never saw before (it *could* happen with a race with swap-out
or similar, but it was not really something you could trigger in real
life.

And I suspect this is what bites you with gfs2

To *test* that hypothesis, how about you try this attached patch? The
generic_perform_write() function in mm/filemap.c has the same exact
pattern, but as mentioned, a filesystem really needs to be able to
handle the partial write_end() case, so it's not a *bug* in that code,
but it migth be triggering a bug in gfs2.

And gfs2 only uses the iomap_write_iter() case, I think. So that's the
only case this attached patch changes.

Again - I think the unpatched iomap_write_iter() code is fine, but I
think it may be what then triggers the real bug in gfs2. So this patch
is not wrong per se, but this patch is basically a "hide the problem"
patch, and it would be very interesting to hear if it does indeed fix
your test-case.

Because that would pinpoint exactly what the bug is.

I'm adding Christoph and Darrick as iomap maintainers here to the
participants (and Dave Chinner in case he's also the temporary
maintainer because Darrick is doing reviews) not because they
necessarily care, but just because this test-patch obviously involves
the iomap code.

NOTE! This patch is entirely untested. I also didn't actually yet go
look at what gfs2 does when 'bytes' and 'copied' are different. But
since I finally think I figured out what might be going on, I decided
I'd send this out sooner rather than later.

Because this is the first thing that makes me go "Aaahh.. This might
explain it".

Linus


Attachments:
patch.diff (906.00 B)

2022-05-03 11:23:11

by Andreas Gruenbacher

[permalink] [raw]
Subject: Re: [GIT PULL] gfs2 fix

On Mon, May 2, 2022 at 8:32 PM Linus Torvalds
<[email protected]> wrote:
> On Thu, Apr 28, 2022 at 10:39 AM Andreas Gruenbacher <[email protected]> wrote:
> >
> > Yes, but note that it's gfs2_file_buffered_write() that fails. When
> > the pagefault_disable/enable() around iomap_file_buffered_write() is
> > removed, the corruption goes away.
>
> I looked some more at this on and off, and ended up even more confused.
>
> For some reason, I'd mostly looked at the read case, because I had
> mis-read some of your emails and thought it was the buffered reads
> that caused problems.
>
> Then I went back more carefully, and realized you had always said
> gfs2_file_buffered_write() was where the issues happened, and looked
> at that path more, and that confused me even *MORE*.
>
> Because that case has always done the copy from user space with page
> faults disabled, because of the traditional deadlock with reading from
> user space while holding the page lock on the target page cache page.
>
> So that is not really about the new deadlock with filesystem locks,
> that was fixed by 00bfe02f4796 ("gfs2: Fix mmap + page fault deadlocks
> for buffered I/O").
>
> So now that I'm looking at the right function (maybe) I'm going "huh",
> because it's none of the complex cases that would seem to fail, it's
> literally just the fault_in_iov_iter_readable() that we've always done
> in iomap_write_iter() that presumably starts failing.
>
> But *that* old code seems bogus too. It's doing
>
> if (unlikely(fault_in_iov_iter_readable(i, bytes) == bytes)) {
> status = -EFAULT;
> break;
> }
>
> which on the face of it is sane: it's saying "if we can't fault in any
> bytes, then stop trying".
>
> And it's good, and correct, but it does leave one case open.
>
> Because what if the result is "we can fault things in _partially_"?
>
> The code blithely goes on and tries to do the whole 'bytes' range _anyway_.
>
> Now, with a bug-free filesystem, this really shouldn't matter, since
> the later copy_page_from_iter_atomic() thing should then DTRT anyway,
> but this does mean that one fundamental thing that that commit
> 00bfe02f4796 changed is that it basically disabled that
> fault_in_iov_iter_readable() that *used* to fault in the whole range,
> and now potentially only faults in a small area.
>
> That, in turn, means that in practice it *used* to do "write_end()"
> with a fully successful range, ie when it did that
>
> status = a_ops->write_end(file, mapping, pos, bytes, copied,
> page, fsdata);
>
> then "bytes" and "copied" were the same.
>
> But now that commit 00bfe02f4796 added the "disable_pagefault()"
> around the whole thing, fault_in_iov_iter_readable() will easily fail
> half-way instead of bringing the next page in, and then that
> ->write_begin() to ->write_end() sequence will see the copy in the
> middle failing half-way too, and you'll have that write_end()
> condition with the write _partially_ succeeding.
>
> Which is the complex case for write_end() that you practically
> speaking never saw before (it *could* happen with a race with swap-out
> or similar, but it was not really something you could trigger in real
> life.
>
> And I suspect this is what bites you with gfs2
>
> To *test* that hypothesis, how about you try this attached patch? The
> generic_perform_write() function in mm/filemap.c has the same exact
> pattern, but as mentioned, a filesystem really needs to be able to
> handle the partial write_end() case, so it's not a *bug* in that code,
> but it migth be triggering a bug in gfs2.
>
> And gfs2 only uses the iomap_write_iter() case, I think. So that's the
> only case this attached patch changes.
>
> Again - I think the unpatched iomap_write_iter() code is fine, but I
> think it may be what then triggers the real bug in gfs2. So this patch
> is not wrong per se, but this patch is basically a "hide the problem"
> patch, and it would be very interesting to hear if it does indeed fix
> your test-case.

We still get data corruption with the patch applied. The
WARN_ON_ONCE(!bytes) doesn't trigger.

As an additional experiment, I've added code to check the iterator
position that iomap_file_buffered_write() returns, and it's all
looking good as well: an iov_iter_advance(orig_from, written) from the
original position always gets us to the same iterator.

This points at gfs2 getting things wrong after a short write, for
example, marking a page / folio uptodate that isn't. But the uptodate
handling happens at the iomap layer, so this doesn't leave me with an
immediate suspect.

We're on filesystems with block size == page size, so none of the
struct iomap_page uptodata handling should be involved, either.

> Because that would pinpoint exactly what the bug is.
>
> I'm adding Christoph and Darrick as iomap maintainers here to the
> participants (and Dave Chinner in case he's also the temporary
> maintainer because Darrick is doing reviews) not because they
> necessarily care, but just because this test-patch obviously involves
> the iomap code.
>
> NOTE! This patch is entirely untested. I also didn't actually yet go
> look at what gfs2 does when 'bytes' and 'copied' are different. But
> since I finally think I figured out what might be going on, I decided
> I'd send this out sooner rather than later.
>
> Because this is the first thing that makes me go "Aaahh.. This might
> explain it".
>
> Linus

Thanks,
Andreas

2022-05-03 17:34:33

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] gfs2 fix

On Tue, May 3, 2022 at 9:41 AM Andreas Gruenbacher <[email protected]> wrote:
>
> That's happening in iomap_file_buffered_write() and iomap_iter():
>
> while ((ret = iomap_iter(&iter, ops)) > 0)
> iter.processed = iomap_write_iter(&iter, i);
>
> Here, iomap_write_iter() returns how much progress it has made, which
> is stored in iter.processed, and iomap_iter() -> iomap_iter_advance()
> then updates iter.pos and iter.len based on iter.processed.

Ahh. I even had that file open in my editor in another window, but missed it.

So much for that theory.

Linus

2022-05-03 21:26:11

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] gfs2 fix

On Tue, May 3, 2022 at 1:56 AM Andreas Gruenbacher <[email protected]> wrote:
>
> We still get data corruption with the patch applied. The
> WARN_ON_ONCE(!bytes) doesn't trigger.

Oh well. I was so sure that I'd finally found something.. That partial
write case has had bugs before.

> As an additional experiment, I've added code to check the iterator
> position that iomap_file_buffered_write() returns, and it's all
> looking good as well: an iov_iter_advance(orig_from, written) from the
> original position always gets us to the same iterator.

Yeah, I've looked at the iterator parts (and iov_iter_revert() in
particular) multiple times, because that too is an area where we've
had bugs before.

That too may be easy to get wrong, but I couldn't for the life of me
see any issues there.

> This points at gfs2 getting things wrong after a short write, for
> example, marking a page / folio uptodate that isn't. But the uptodate
> handling happens at the iomap layer, so this doesn't leave me with an
> immediate suspect.

Yeah, the uptodate setting looked safe, particularly with that "if we
copied less than we thought we would, and it wasn't uptodate, just
claim we didn't do anything at all".

That said, I now have a *new* suspect: the 'iter->pos' handling in
iomap_write_iter().

In particular, let's look at iomap_file_buffered_write(), which does:

while ((ret = iomap_iter(&iter, ops)) > 0)
iter.processed = iomap_write_iter(&iter, i);

and then look at what happens to iter.pos here.

iomap_write_iter() does this:

loff_t pos = iter->pos;
...
pos += status;

but it never seems to write the updated position back to the iterator.

So what happens next time iomap_write_iter() gets called?

This looks like such a huge bug that I'm probably missing something,
but I wonder if this is normally hidden by the fact that usually
iomap_write_iter() consumes the whole 'iter', so despite the 'while()'
loop, it's actually effectively only called once.

Except if it gets a short write due to an unhandled page fault..

Am I entirely blind, and that 'iter.pos' is updated somewhere and I
just missed it?

Or is this maybe the reason for it all?

Linus

2022-05-03 23:17:04

by Andreas Gruenbacher

[permalink] [raw]
Subject: Re: [GIT PULL] gfs2 fix

On Tue, May 3, 2022 at 10:56 AM Andreas Gruenbacher <[email protected]> wrote:
> On Mon, May 2, 2022 at 8:32 PM Linus Torvalds <[email protected]> wrote:
> > On Thu, Apr 28, 2022 at 10:39 AM Andreas Gruenbacher <[email protected]> wrote:
> > >
> > > Yes, but note that it's gfs2_file_buffered_write() that fails. When
> > > the pagefault_disable/enable() around iomap_file_buffered_write() is
> > > removed, the corruption goes away.
> >
> > I looked some more at this on and off, and ended up even more confused.
> >
> > For some reason, I'd mostly looked at the read case, because I had
> > mis-read some of your emails and thought it was the buffered reads
> > that caused problems.
> >
> > Then I went back more carefully, and realized you had always said
> > gfs2_file_buffered_write() was where the issues happened, and looked
> > at that path more, and that confused me even *MORE*.
> >
> > Because that case has always done the copy from user space with page
> > faults disabled, because of the traditional deadlock with reading from
> > user space while holding the page lock on the target page cache page.
> >
> > So that is not really about the new deadlock with filesystem locks,
> > that was fixed by 00bfe02f4796 ("gfs2: Fix mmap + page fault deadlocks
> > for buffered I/O").
> >
> > So now that I'm looking at the right function (maybe) I'm going "huh",
> > because it's none of the complex cases that would seem to fail, it's
> > literally just the fault_in_iov_iter_readable() that we've always done
> > in iomap_write_iter() that presumably starts failing.
> >
> > But *that* old code seems bogus too. It's doing
> >
> > if (unlikely(fault_in_iov_iter_readable(i, bytes) == bytes)) {
> > status = -EFAULT;
> > break;
> > }
> >
> > which on the face of it is sane: it's saying "if we can't fault in any
> > bytes, then stop trying".
> >
> > And it's good, and correct, but it does leave one case open.
> >
> > Because what if the result is "we can fault things in _partially_"?
> >
> > The code blithely goes on and tries to do the whole 'bytes' range _anyway_.
> >
> > Now, with a bug-free filesystem, this really shouldn't matter, since
> > the later copy_page_from_iter_atomic() thing should then DTRT anyway,
> > but this does mean that one fundamental thing that that commit
> > 00bfe02f4796 changed is that it basically disabled that
> > fault_in_iov_iter_readable() that *used* to fault in the whole range,
> > and now potentially only faults in a small area.
> >
> > That, in turn, means that in practice it *used* to do "write_end()"
> > with a fully successful range, ie when it did that
> >
> > status = a_ops->write_end(file, mapping, pos, bytes, copied,
> > page, fsdata);
> >
> > then "bytes" and "copied" were the same.
> >
> > But now that commit 00bfe02f4796 added the "disable_pagefault()"
> > around the whole thing, fault_in_iov_iter_readable() will easily fail
> > half-way instead of bringing the next page in, and then that
> > ->write_begin() to ->write_end() sequence will see the copy in the
> > middle failing half-way too, and you'll have that write_end()
> > condition with the write _partially_ succeeding.
> >
> > Which is the complex case for write_end() that you practically
> > speaking never saw before (it *could* happen with a race with swap-out
> > or similar, but it was not really something you could trigger in real
> > life.
> >
> > And I suspect this is what bites you with gfs2
> >
> > To *test* that hypothesis, how about you try this attached patch? The
> > generic_perform_write() function in mm/filemap.c has the same exact
> > pattern, but as mentioned, a filesystem really needs to be able to
> > handle the partial write_end() case, so it's not a *bug* in that code,
> > but it migth be triggering a bug in gfs2.
> >
> > And gfs2 only uses the iomap_write_iter() case, I think. So that's the
> > only case this attached patch changes.
> >
> > Again - I think the unpatched iomap_write_iter() code is fine, but I
> > think it may be what then triggers the real bug in gfs2. So this patch
> > is not wrong per se, but this patch is basically a "hide the problem"
> > patch, and it would be very interesting to hear if it does indeed fix
> > your test-case.
>
> We still get data corruption with the patch applied. The
> WARN_ON_ONCE(!bytes) doesn't trigger.
>
> As an additional experiment, I've added code to check the iterator
> position that iomap_file_buffered_write() returns, and it's all
> looking good as well: an iov_iter_advance(orig_from, written) from the
> original position always gets us to the same iterator.
>
> This points at gfs2 getting things wrong after a short write, for
> example, marking a page / folio uptodate that isn't. But the uptodate
> handling happens at the iomap layer, so this doesn't leave me with an
> immediate suspect.
>
> We're on filesystems with block size == page size, so none of the
> struct iomap_page uptodata handling should be involved, either.

The rounding around the hole punching in gfs2_iomap_end() looks wrong.
I'm trying a fix now.

> > Because that would pinpoint exactly what the bug is.
> >
> > I'm adding Christoph and Darrick as iomap maintainers here to the
> > participants (and Dave Chinner in case he's also the temporary
> > maintainer because Darrick is doing reviews) not because they
> > necessarily care, but just because this test-patch obviously involves
> > the iomap code.
> >
> > NOTE! This patch is entirely untested. I also didn't actually yet go
> > look at what gfs2 does when 'bytes' and 'copied' are different. But
> > since I finally think I figured out what might be going on, I decided
> > I'd send this out sooner rather than later.
> >
> > Because this is the first thing that makes me go "Aaahh.. This might
> > explain it".
> >
> > Linus
>
> Thanks,
> Andreas

2022-05-04 08:50:12

by Andreas Gruenbacher

[permalink] [raw]
Subject: Re: [GIT PULL] gfs2 fix

On Tue, May 3, 2022 at 3:30 PM Andreas Gruenbacher <[email protected]> wrote:

> On Tue, May 3, 2022 at 10:56 AM Andreas Gruenbacher <[email protected]> wrote:

> > On Mon, May 2, 2022 at 8:32 PM Linus Torvalds <[email protected]> wrote:

> > > On Thu, Apr 28, 2022 at 10:39 AM Andreas Gruenbacher <[email protected]> wrote:

> > > >

> > > > Yes, but note that it's gfs2_file_buffered_write() that fails. When

> > > > the pagefault_disable/enable() around iomap_file_buffered_write() is

> > > > removed, the corruption goes away.

> > >

> > > I looked some more at this on and off, and ended up even more confused.

> > >

> > > For some reason, I'd mostly looked at the read case, because I had

> > > mis-read some of your emails and thought it was the buffered reads

> > > that caused problems.

> > >

> > > Then I went back more carefully, and realized you had always said

> > > gfs2_file_buffered_write() was where the issues happened, and looked

> > > at that path more, and that confused me even *MORE*.

> > >

> > > Because that case has always done the copy from user space with page

> > > faults disabled, because of the traditional deadlock with reading from

> > > user space while holding the page lock on the target page cache page.

> > >

> > > So that is not really about the new deadlock with filesystem locks,

> > > that was fixed by 00bfe02f4796 ("gfs2: Fix mmap + page fault deadlocks

> > > for buffered I/O").

> > >

> > > So now that I'm looking at the right function (maybe) I'm going "huh",

> > > because it's none of the complex cases that would seem to fail, it's

> > > literally just the fault_in_iov_iter_readable() that we've always done

> > > in iomap_write_iter() that presumably starts failing.

> > >

> > > But *that* old code seems bogus too. It's doing

> > >

> > > if (unlikely(fault_in_iov_iter_readable(i, bytes) == bytes)) {

> > > status = -EFAULT;

> > > break;

> > > }

> > >

> > > which on the face of it is sane: it's saying "if we can't fault in any

> > > bytes, then stop trying".

> > >

> > > And it's good, and correct, but it does leave one case open.

> > >

> > > Because what if the result is "we can fault things in _partially_"?

> > >

> > > The code blithely goes on and tries to do the whole 'bytes' range _anyway_.

> > >

> > > Now, with a bug-free filesystem, this really shouldn't matter, since

> > > the later copy_page_from_iter_atomic() thing should then DTRT anyway,

> > > but this does mean that one fundamental thing that that commit

> > > 00bfe02f4796 changed is that it basically disabled that

> > > fault_in_iov_iter_readable() that *used* to fault in the whole range,

> > > and now potentially only faults in a small area.

> > >

> > > That, in turn, means that in practice it *used* to do "write_end()"

> > > with a fully successful range, ie when it did that

> > >

> > > status = a_ops->write_end(file, mapping, pos, bytes, copied,

> > > page, fsdata);

> > >

> > > then "bytes" and "copied" were the same.

> > >

> > > But now that commit 00bfe02f4796 added the "disable_pagefault()"

> > > around the whole thing, fault_in_iov_iter_readable() will easily fail

> > > half-way instead of bringing the next page in, and then that

> > > ->write_begin() to ->write_end() sequence will see the copy in the

> > > middle failing half-way too, and you'll have that write_end()

> > > condition with the write _partially_ succeeding.

> > >

> > > Which is the complex case for write_end() that you practically

> > > speaking never saw before (it *could* happen with a race with swap-out

> > > or similar, but it was not really something you could trigger in real

> > > life.

> > >

> > > And I suspect this is what bites you with gfs2

> > >

> > > To *test* that hypothesis, how about you try this attached patch? The

> > > generic_perform_write() function in mm/filemap.c has the same exact

> > > pattern, but as mentioned, a filesystem really needs to be able to

> > > handle the partial write_end() case, so it's not a *bug* in that code,

> > > but it migth be triggering a bug in gfs2.

> > >

> > > And gfs2 only uses the iomap_write_iter() case, I think. So that's the

> > > only case this attached patch changes.

> > >

> > > Again - I think the unpatched iomap_write_iter() code is fine, but I

> > > think it may be what then triggers the real bug in gfs2. So this patch

> > > is not wrong per se, but this patch is basically a "hide the problem"

> > > patch, and it would be very interesting to hear if it does indeed fix

> > > your test-case.

> >

> > We still get data corruption with the patch applied. The

> > WARN_ON_ONCE(!bytes) doesn't trigger.

> >

> > As an additional experiment, I've added code to check the iterator

> > position that iomap_file_buffered_write() returns, and it's all

> > looking good as well: an iov_iter_advance(orig_from, written) from the

> > original position always gets us to the same iterator.

> >

> > This points at gfs2 getting things wrong after a short write, for

> > example, marking a page / folio uptodate that isn't. But the uptodate

> > handling happens at the iomap layer, so this doesn't leave me with an

> > immediate suspect.

> >

> > We're on filesystems with block size == page size, so none of the

> > struct iomap_page uptodata handling should be involved, either.

>

> The rounding around the hole punching in gfs2_iomap_end() looks wrong.

> I'm trying a fix now.



More testing still ongoing, but the following patch seems to fix the

data corruption. Provided that things go well, I'll send a pull request

tomorrow.



Thanks a lot for the huge amount of help!



Andreas



-



gfs2: Short write fix



When a write cannot be carried out in full, gfs2_iomap_end() releases

blocks that have been allocated for this write but haven't been used.



To compute the end of the allocation, gfs2_iomap_end() incorrectly

rounded the end of the attempted write down to the next block boundary

to arrive at the end of the allocation. It would have to round up, but

the end of the allocation is also available as iomap->offset +

iomap->length, so just use that instead.



In addition, use round_up() for computing the start of the unused range.



Signed-off-by: Andreas Gruenbacher <[email protected]>



diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c

index 39080b2d6cf8..6abd044a176d 100644

--- a/fs/gfs2/bmap.c

+++ b/fs/gfs2/bmap.c

@@ -1153,14 +1153,12 @@ static int gfs2_iomap_end(struct inode *inode, loff_t pos, loff_t length,



if (length != written && (iomap->flags & IOMAP_F_NEW)) {

/* Deallocate blocks that were just allocated. */

- loff_t blockmask = i_blocksize(inode) - 1;

- loff_t end = (pos + length) & ~blockmask;

+ loff_t hstart = round_up(pos + written, i_blocksize(inode));

+ loff_t hend = iomap->offset + iomap->length;



- pos = (pos + written + blockmask) & ~blockmask;

- if (pos < end) {

- truncate_pagecache_range(inode, pos, end - 1);

- punch_hole(ip, pos, end - pos);

- }

+ truncate_pagecache_range(inode, hstart, hend - 1);

+ if (hstart < hend)

+ punch_hole(ip, hstart, hend - hstart);

}



if (unlikely(!written))



2022-05-04 13:30:08

by Andreas Gruenbacher

[permalink] [raw]
Subject: Re: [GIT PULL] gfs2 fix

On Tue, May 3, 2022 at 6:19 PM Linus Torvalds
<[email protected]> wrote:
>
> On Tue, May 3, 2022 at 1:56 AM Andreas Gruenbacher <[email protected]> wrote:
> >
> > We still get data corruption with the patch applied. The
> > WARN_ON_ONCE(!bytes) doesn't trigger.
>
> Oh well. I was so sure that I'd finally found something.. That partial
> write case has had bugs before.
>
> > As an additional experiment, I've added code to check the iterator
> > position that iomap_file_buffered_write() returns, and it's all
> > looking good as well: an iov_iter_advance(orig_from, written) from the
> > original position always gets us to the same iterator.
>
> Yeah, I've looked at the iterator parts (and iov_iter_revert() in
> particular) multiple times, because that too is an area where we've
> had bugs before.
>
> That too may be easy to get wrong, but I couldn't for the life of me
> see any issues there.
>
> > This points at gfs2 getting things wrong after a short write, for
> > example, marking a page / folio uptodate that isn't. But the uptodate
> > handling happens at the iomap layer, so this doesn't leave me with an
> > immediate suspect.
>
> Yeah, the uptodate setting looked safe, particularly with that "if we
> copied less than we thought we would, and it wasn't uptodate, just
> claim we didn't do anything at all".
>
> That said, I now have a *new* suspect: the 'iter->pos' handling in
> iomap_write_iter().
>
> In particular, let's look at iomap_file_buffered_write(), which does:
>
> while ((ret = iomap_iter(&iter, ops)) > 0)
> iter.processed = iomap_write_iter(&iter, i);
>
> and then look at what happens to iter.pos here.
>
> iomap_write_iter() does this:
>
> loff_t pos = iter->pos;
> ...
> pos += status;
>
> but it never seems to write the updated position back to the iterator.
>
> So what happens next time iomap_write_iter() gets called?
>
> This looks like such a huge bug that I'm probably missing something,
> but I wonder if this is normally hidden by the fact that usually
> iomap_write_iter() consumes the whole 'iter', so despite the 'while()'
> loop, it's actually effectively only called once.
>
> Except if it gets a short write due to an unhandled page fault..
>
> Am I entirely blind, and that 'iter.pos' is updated somewhere and I
> just missed it?

That's happening in iomap_file_buffered_write() and iomap_iter():

while ((ret = iomap_iter(&iter, ops)) > 0)
iter.processed = iomap_write_iter(&iter, i);

Here, iomap_write_iter() returns how much progress it has made, which
is stored in iter.processed, and iomap_iter() -> iomap_iter_advance()
then updates iter.pos and iter.len based on iter.processed.

Andreas


2022-05-04 13:44:32

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] gfs2 fix

On Tue, May 3, 2022 at 2:35 PM Andreas Gruenbacher <[email protected]> wrote:
>
> More testing still ongoing, but the following patch seems to fix the
> data corruption.

Fingers crossed.

> + truncate_pagecache_range(inode, hstart, hend - 1);
> + if (hstart < hend)
> + punch_hole(ip, hstart, hend - hstart);

Why doesn't that "hstart < hend" condition cover both the truncate and
the hole punch?

Linus

2022-05-05 11:38:41

by Andreas Gruenbacher

[permalink] [raw]
Subject: Re: [GIT PULL] gfs2 fix

On Wed, May 4, 2022 at 12:42 AM Linus Torvalds
<[email protected]> wrote:
> On Tue, May 3, 2022 at 2:35 PM Andreas Gruenbacher <[email protected]> wrote:
> >
> > More testing still ongoing, but the following patch seems to fix the
> > data corruption.
>
> Fingers crossed.

It turns out that crossing fingers wasn't enough and we still get
corruption, but less frequently than before. We're going in the right
direction.

My working theory is that this is due to a subtle bug in the hole
punching done by gfs2_iomap_end() to get rid of unused blocks. With
the test case that fails, gfs2_iomap_end() is punching holes way more
often than I was expecting, and way more often than it should.
Remember that the test case is doing 32-MiB writes of a user buffer
that usually isn't entirely in memory. The first
iomap_file_buffered_write() allocates filesystem blocks for the entire
buffer, and when it finds that it could only do a partial write, it
frees a large part of those blocks. It will then call
fault_in_iov_iter_readable() for the next chunk, and the next call to
iomap_file_buffered_write() will then usually be able to write that
chunk entirely.

So it seems that we should always call fault_in_iov_iter_readable()
before calling into iomap_file_buffered_write(). This will probably
hide whatever is going wrong in punch_hole(), but we'll get to that
later ...

(Side note: the chunk size should be aligned to the page cache, not to
the iov_iter as in the current code.)

> > + truncate_pagecache_range(inode, hstart, hend - 1);
> > + if (hstart < hend)
> > + punch_hole(ip, hstart, hend - hstart);
>
> Why doesn't that "hstart < hend" condition cover both the truncate and
> the hole punch?

That was a leftover from a previous experiment in which I did the
truncate_pagecache_range() on the unaligned boundaries. Which turned
out to be pointless. I'll clean that up.

Thanks,
Andreas