by Linus Torvalds

[permalink] [raw]

Subject: Re: [GIT PULL] gfs2 fix

On Mon, May 2, 2022 at 11:31 AM Linus Torvalds
<[email protected]> wrote:
>
> NOTE! This patch is entirely untested. I also didn't actually yet go
> look at what gfs2 does when 'bytes' and 'copied' are different.

Oh, it's a lot of generic iomap_write_end() code, so I guess it's just
as well that I brought in the iomap people.

And the iomap code has many different cases. Some of them do

if (unlikely(copied < len && !folio_test_uptodate(folio)))
return 0;

to force the whole IO to be re-done (looks sane - that's the "the
underlying folio wasn't uptodate, because we expected the write to
make it so").

And that might not have happened before, but it looks like gfs2 does
actually try to deal with that case.

But since Andreas said originally that the IO wasn't aligned, I don't
think that "not uptodate" case is what is going on, and it's more
about some "partial write in the middle of a buffer succeeded"

And the code also has things like

if (ret < len)
iomap_write_failed(iter->inode, pos, len);

which looks very wrong - it's not that the write failed, it's just
incomplete because it was done with page faults disabled. It seems to
try to do some page cache truncation based on the original 'len', but
not taking the successful part into account. Which all sounds
horrifically wrong.

But I don't know the code well enough to really judge. It just makes
me uncomfortable, and I do suspect this code may be quite buggy if the
copy of the full 'len' doesn't succeed.

Again, the patch I sent only _hides_ any issues and makes them
practically impossible to see. It doesn't really _fix_ anything, since
- as mentioned - regardless of fault_in_iov_iter_readable()
succeeding, racing with page-out could then cause the later
copy_page_from_iter_atomic() to have a partial copy anyway.

And hey, maybe there's something entirely different going on, and my
"Heureka! It might be explained by that partial write_end that
generally didn't happen before" is only my shouting at windmills.

Linus

2022-05-03 00:55:49

On Wed, May 4, 2022 at 12:42 AM Linus Torvalds
<[email protected]> wrote:
> On Tue, May 3, 2022 at 2:35 PM Andreas Gruenbacher <[email protected]> wrote:
> >
> > More testing still ongoing, but the following patch seems to fix the
> > data corruption.
>
> Fingers crossed.

It turns out that crossing fingers wasn't enough and we still get
corruption, but less frequently than before. We're going in the right
direction.

My working theory is that this is due to a subtle bug in the hole
punching done by gfs2_iomap_end() to get rid of unused blocks. With
the test case that fails, gfs2_iomap_end() is punching holes way more
often than I was expecting, and way more often than it should.
Remember that the test case is doing 32-MiB writes of a user buffer
that usually isn't entirely in memory. The first
iomap_file_buffered_write() allocates filesystem blocks for the entire
buffer, and when it finds that it could only do a partial write, it
frees a large part of those blocks. It will then call
fault_in_iov_iter_readable() for the next chunk, and the next call to
iomap_file_buffered_write() will then usually be able to write that
chunk entirely.

So it seems that we should always call fault_in_iov_iter_readable()
before calling into iomap_file_buffered_write(). This will probably
hide whatever is going wrong in punch_hole(), but we'll get to that
later ...

(Side note: the chunk size should be aligned to the page cache, not to
the iov_iter as in the current code.)

> > + truncate_pagecache_range(inode, hstart, hend - 1);
> > + if (hstart < hend)
> > + punch_hole(ip, hstart, hend - hstart);
>
> Why doesn't that "hstart < hend" condition cover both the truncate and
> the hole punch?

That was a leftover from a previous experiment in which I did the
truncate_pagecache_range() on the unaligned boundaries. Which turned
out to be pointless. I'll clean that up.

Thanks,
Andreas