As mentioned in today's ext4 concall, I've seen generic/418 fail from time to
time when run on 5.12-rc3 and 5.12-rc1 kernels. This first occurred when
running the 1k test case using kvm-xfstests. I was then able to bisect the
failure to a patch landed in the -rc1 merge window:
(bd8a1f3655a7) mm/filemap: support readpage splitting a page
Typical test output resulting from a failure looks like:
QA output created by 418
+cmpbuf: offset 0: Expected: 0x1, got 0x0
+[6:0] FAIL - comparison failed, offset 3072
+diotest -w -b 512 -n 8 -i 4 failed at loop 0
Silence is golden
...
I've also been able to reproduce the failure on -rc3 in the 4k test case as
well. The failure frequency there was 10 out of 100 runs. It was anywhere
from 2 to 8 failures out of 100 runs in the 1k case.
So, the failure isn't dependent upon block size less than page size.
Eric
On Thu, Mar 18, 2021 at 02:16:13PM -0400, Eric Whitney wrote:
> As mentioned in today's ext4 concall, I've seen generic/418 fail from time to
> time when run on 5.12-rc3 and 5.12-rc1 kernels. This first occurred when
> running the 1k test case using kvm-xfstests. I was then able to bisect the
> failure to a patch landed in the -rc1 merge window:
>
> (bd8a1f3655a7) mm/filemap: support readpage splitting a page
>
> Typical test output resulting from a failure looks like:
>
> QA output created by 418
> +cmpbuf: offset 0: Expected: 0x1, got 0x0
> +[6:0] FAIL - comparison failed, offset 3072
> +diotest -w -b 512 -n 8 -i 4 failed at loop 0
> Silence is golden
> ...
>
> I've also been able to reproduce the failure on -rc3 in the 4k test case as
> well. The failure frequency there was 10 out of 100 runs. It was anywhere
> from 2 to 8 failures out of 100 runs in the 1k case.
FWIW, testing on a kernel which is -rc2 based (ext4.git's tip) I
wasn't able to see a failure using gce-xfstests using the ext4/4k,
ext4/1k, and xfs/1k test scenarios. This may be because of the I/O
timing for the persistent disk block device in GCE, or differences in
the number of CPU's or amount of memory available --- or in the kernel
configuration that was used to build it.
I'm currently retrying with -rc3, with and without the kernel debug
configs, to see if that makes any difference...
- Ted
On Thu, Mar 18, 2021 at 02:16:13PM -0400, Eric Whitney wrote:
> As mentioned in today's ext4 concall, I've seen generic/418 fail from time to
> time when run on 5.12-rc3 and 5.12-rc1 kernels. This first occurred when
> running the 1k test case using kvm-xfstests. I was then able to bisect the
> failure to a patch landed in the -rc1 merge window:
>
> (bd8a1f3655a7) mm/filemap: support readpage splitting a page
Thanks for letting me know. This failure is new to me.
I don't understand it; this patch changes the behaviour of buffered reads
from waiting on a page with a refcount held to waiting on a page without
the refcount held, then starting the lookup from scratch once the page
is unlocked. I find it hard to believe this introduces a /new/ failure.
Either it makes an existing failure easier to hit, or there's a subtle
bug in the retry logic that I'm not seeing.
> Typical test output resulting from a failure looks like:
>
> QA output created by 418
> +cmpbuf: offset 0: Expected: 0x1, got 0x0
> +[6:0] FAIL - comparison failed, offset 3072
> +diotest -w -b 512 -n 8 -i 4 failed at loop 0
> Silence is golden
> ...
>
> I've also been able to reproduce the failure on -rc3 in the 4k test case as
> well. The failure frequency there was 10 out of 100 runs. It was anywhere
> from 2 to 8 failures out of 100 runs in the 1k case.
>
> So, the failure isn't dependent upon block size less than page size.
That's a good data point. I'll take a look at g/418 and see if i can
figure out what race we're hitting. Nice that it happens so often.
I suppose I could get you to put some debugging in -- maybe dumping the
page if we hit a contended case, then again if we're retrying?
I presume it doesn't always happen at the same offset or anything
convenient like that.
* Matthew Wilcox <[email protected]>:
> On Thu, Mar 18, 2021 at 02:16:13PM -0400, Eric Whitney wrote:
> > As mentioned in today's ext4 concall, I've seen generic/418 fail from time to
> > time when run on 5.12-rc3 and 5.12-rc1 kernels. This first occurred when
> > running the 1k test case using kvm-xfstests. I was then able to bisect the
> > failure to a patch landed in the -rc1 merge window:
> >
> > (bd8a1f3655a7) mm/filemap: support readpage splitting a page
>
> Thanks for letting me know. This failure is new to me.
Sure - it's useful to know that it's new to you. Ted said he's also going
to test XFS with a large number of generic/418 trials which would be a
useful comparison. However, he's had no luck as yet reproducing what I've
seen on his Google compute engine test setup running ext4.
>
> I don't understand it; this patch changes the behaviour of buffered reads
> from waiting on a page with a refcount held to waiting on a page without
> the refcount held, then starting the lookup from scratch once the page
> is unlocked. I find it hard to believe this introduces a /new/ failure.
> Either it makes an existing failure easier to hit, or there's a subtle
> bug in the retry logic that I'm not seeing.
>
For keeping Murphy at bay I'm rerunning the bisection from scratch just
to make sure I come out at the same patch. The initial bisection looked
clean, but when dealing with a failure that occurs probabilistically it's
easy enough to get it wrong. Is this patch revertable in -rc1 or -rc3?
Ordinarily I like to do that for confirmation.
And there's always the chance that a latent ext4 bug is being hit.
> > Typical test output resulting from a failure looks like:
> >
> > QA output created by 418
> > +cmpbuf: offset 0: Expected: 0x1, got 0x0
> > +[6:0] FAIL - comparison failed, offset 3072
> > +diotest -w -b 512 -n 8 -i 4 failed at loop 0
> > Silence is golden
> > ...
> >
> > I've also been able to reproduce the failure on -rc3 in the 4k test case as
> > well. The failure frequency there was 10 out of 100 runs. It was anywhere
> > from 2 to 8 failures out of 100 runs in the 1k case.
> >
> > So, the failure isn't dependent upon block size less than page size.
>
> That's a good data point. I'll take a look at g/418 and see if i can
> figure out what race we're hitting. Nice that it happens so often.
> I suppose I could get you to put some debugging in -- maybe dumping the
> page if we hit a contended case, then again if we're retrying?
>
> I presume it doesn't always happen at the same offset or anything
> convenient like that.
I'd be very happy to run whatever debugging patches you might want, though
you might want to wait until I've reproduced the bisection result. The
offsets vary, unfortunately - I've seen 1024, 2048, and 3072 reported when
running a file system with 4k blocks.
Thanks,
Eric
On Thu, Mar 18, 2021 at 05:38:08PM -0400, Eric Whitney wrote:
> * Matthew Wilcox <[email protected]>:
> > On Thu, Mar 18, 2021 at 02:16:13PM -0400, Eric Whitney wrote:
> > > As mentioned in today's ext4 concall, I've seen generic/418 fail from time to
> > > time when run on 5.12-rc3 and 5.12-rc1 kernels. This first occurred when
> > > running the 1k test case using kvm-xfstests. I was then able to bisect the
> > > failure to a patch landed in the -rc1 merge window:
> > >
> > > (bd8a1f3655a7) mm/filemap: support readpage splitting a page
> >
> > Thanks for letting me know. This failure is new to me.
>
> Sure - it's useful to know that it's new to you. Ted said he's also going
> to test XFS with a large number of generic/418 trials which would be a
> useful comparison. However, he's had no luck as yet reproducing what I've
> seen on his Google compute engine test setup running ext4.
>
> >
> > I don't understand it; this patch changes the behaviour of buffered reads
> > from waiting on a page with a refcount held to waiting on a page without
> > the refcount held, then starting the lookup from scratch once the page
> > is unlocked. I find it hard to believe this introduces a /new/ failure.
> > Either it makes an existing failure easier to hit, or there's a subtle
> > bug in the retry logic that I'm not seeing.
> >
>
> For keeping Murphy at bay I'm rerunning the bisection from scratch just
> to make sure I come out at the same patch. The initial bisection looked
> clean, but when dealing with a failure that occurs probabilistically it's
> easy enough to get it wrong. Is this patch revertable in -rc1 or -rc3?
> Ordinarily I like to do that for confirmation.
Alas, not easily. I've built a lot on top of it since then. I could
probably come up with a moral reversion (and will have to if we can't
figure out why it's causing a problem!)
> And there's always the chance that a latent ext4 bug is being hit.
That would also be valuable information to find out. If this
patch is exposing a latent bug, I can't think what it might be.
> I'd be very happy to run whatever debugging patches you might want, though
> you might want to wait until I've reproduced the bisection result. The
> offsets vary, unfortunately - I've seen 1024, 2048, and 3072 reported when
> running a file system with 4k blocks.
As I expected, but thank you for being willing to run debug patches.
I'll wait for you to confirm the bisection and then work up something
that'll help figure out what's going on.
* Matthew Wilcox <[email protected]>:
> On Thu, Mar 18, 2021 at 05:38:08PM -0400, Eric Whitney wrote:
> > * Matthew Wilcox <[email protected]>:
> > > On Thu, Mar 18, 2021 at 02:16:13PM -0400, Eric Whitney wrote:
> > > > As mentioned in today's ext4 concall, I've seen generic/418 fail from time to
> > > > time when run on 5.12-rc3 and 5.12-rc1 kernels. This first occurred when
> > > > running the 1k test case using kvm-xfstests. I was then able to bisect the
> > > > failure to a patch landed in the -rc1 merge window:
> > > >
> > > > (bd8a1f3655a7) mm/filemap: support readpage splitting a page
> > >
> > > Thanks for letting me know. This failure is new to me.
> >
> > Sure - it's useful to know that it's new to you. Ted said he's also going
> > to test XFS with a large number of generic/418 trials which would be a
> > useful comparison. However, he's had no luck as yet reproducing what I've
> > seen on his Google compute engine test setup running ext4.
> >
> > >
> > > I don't understand it; this patch changes the behaviour of buffered reads
> > > from waiting on a page with a refcount held to waiting on a page without
> > > the refcount held, then starting the lookup from scratch once the page
> > > is unlocked. I find it hard to believe this introduces a /new/ failure.
> > > Either it makes an existing failure easier to hit, or there's a subtle
> > > bug in the retry logic that I'm not seeing.
> > >
> >
> > For keeping Murphy at bay I'm rerunning the bisection from scratch just
> > to make sure I come out at the same patch. The initial bisection looked
> > clean, but when dealing with a failure that occurs probabilistically it's
> > easy enough to get it wrong. Is this patch revertable in -rc1 or -rc3?
> > Ordinarily I like to do that for confirmation.
>
> Alas, not easily. I've built a lot on top of it since then. I could
> probably come up with a moral reversion (and will have to if we can't
> figure out why it's causing a problem!)
>
> > And there's always the chance that a latent ext4 bug is being hit.
>
> That would also be valuable information to find out. If this
> patch is exposing a latent bug, I can't think what it might be.
>
> > I'd be very happy to run whatever debugging patches you might want, though
> > you might want to wait until I've reproduced the bisection result. The
> > offsets vary, unfortunately - I've seen 1024, 2048, and 3072 reported when
> > running a file system with 4k blocks.
>
> As I expected, but thank you for being willing to run debug patches.
> I'll wait for you to confirm the bisection and then work up something
> that'll help figure out what's going on.
An update:
I've been able to rerun the bisection on a freshly installed test appliance
using a kernel built from a new tree using different test hardware. It ended
at the same patch - "mm/filemap: support readpage splitting a page".
I'd thought I'd been able to reproduce the failure using ext4's 4K block size,
but my test logs say otherwise, as do a few thousand trial runs of generic/418.
So, at the moment I can only say definitively that I've seen the failures when
running with 1k blocks (block size < page size).
I've been able to simplify generic/418 by reducing the set of test cases it
uses to just "-w", or direct I/O writes. When I modify the test to remove
just that test case and leaving the remainder, I no longer see test failures.
A beneficial side effect of the simplification is that it reduces the running
time by an order of magnitude.
When it fails, it looks like the test is writing via dio a chunk of bytes all
of which have been set to 0x1 and then reading back via dio the same chunk with
all those bytes equal to 0x0.
I'm attempting to create a simpler reproducer. generic/418 actually has a
number of processes simultaneously writing to and reading from disjoint
regions of the same file, and I'd like to see if the failure is dependent
upon that behavior, or whether it can be isolated to a single writer/reader.
Eric
On Mon, Mar 22, 2021 at 12:37:03PM -0400, Eric Whitney wrote:
> * Matthew Wilcox <[email protected]>:
> > On Thu, Mar 18, 2021 at 05:38:08PM -0400, Eric Whitney wrote:
> > > * Matthew Wilcox <[email protected]>:
> > > > On Thu, Mar 18, 2021 at 02:16:13PM -0400, Eric Whitney wrote:
> > > > > As mentioned in today's ext4 concall, I've seen generic/418 fail from time to
> > > > > time when run on 5.12-rc3 and 5.12-rc1 kernels. This first occurred when
> > > > > running the 1k test case using kvm-xfstests. I was then able to bisect the
> > > > > failure to a patch landed in the -rc1 merge window:
> > > > >
> > > > > (bd8a1f3655a7) mm/filemap: support readpage splitting a page
> > > >
> > > > Thanks for letting me know. This failure is new to me.
> > >
> > > Sure - it's useful to know that it's new to you. Ted said he's also going
> > > to test XFS with a large number of generic/418 trials which would be a
> > > useful comparison. However, he's had no luck as yet reproducing what I've
> > > seen on his Google compute engine test setup running ext4.
> > >
> > > >
> > > > I don't understand it; this patch changes the behaviour of buffered reads
> > > > from waiting on a page with a refcount held to waiting on a page without
> > > > the refcount held, then starting the lookup from scratch once the page
> > > > is unlocked. I find it hard to believe this introduces a /new/ failure.
> > > > Either it makes an existing failure easier to hit, or there's a subtle
> > > > bug in the retry logic that I'm not seeing.
> > > >
> > >
> > > For keeping Murphy at bay I'm rerunning the bisection from scratch just
> > > to make sure I come out at the same patch. The initial bisection looked
> > > clean, but when dealing with a failure that occurs probabilistically it's
> > > easy enough to get it wrong. Is this patch revertable in -rc1 or -rc3?
> > > Ordinarily I like to do that for confirmation.
> >
> > Alas, not easily. I've built a lot on top of it since then. I could
> > probably come up with a moral reversion (and will have to if we can't
> > figure out why it's causing a problem!)
> >
> > > And there's always the chance that a latent ext4 bug is being hit.
> >
> > That would also be valuable information to find out. If this
> > patch is exposing a latent bug, I can't think what it might be.
> >
> > > I'd be very happy to run whatever debugging patches you might want, though
> > > you might want to wait until I've reproduced the bisection result. The
> > > offsets vary, unfortunately - I've seen 1024, 2048, and 3072 reported when
> > > running a file system with 4k blocks.
> >
> > As I expected, but thank you for being willing to run debug patches.
> > I'll wait for you to confirm the bisection and then work up something
> > that'll help figure out what's going on.
>
> An update:
>
> I've been able to rerun the bisection on a freshly installed test appliance
> using a kernel built from a new tree using different test hardware. It ended
> at the same patch - "mm/filemap: support readpage splitting a page".
>
> I'd thought I'd been able to reproduce the failure using ext4's 4K block size,
> but my test logs say otherwise, as do a few thousand trial runs of generic/418.
> So, at the moment I can only say definitively that I've seen the failures when
> running with 1k blocks (block size < page size).
>
> I've been able to simplify generic/418 by reducing the set of test cases it
> uses to just "-w", or direct I/O writes. When I modify the test to remove
> just that test case and leaving the remainder, I no longer see test failures.
> A beneficial side effect of the simplification is that it reduces the running
> time by an order of magnitude.
>
> When it fails, it looks like the test is writing via dio a chunk of bytes all
> of which have been set to 0x1 and then reading back via dio the same chunk with
> all those bytes equal to 0x0.
>
> I'm attempting to create a simpler reproducer. generic/418 actually has a
> number of processes simultaneously writing to and reading from disjoint
> regions of the same file, and I'd like to see if the failure is dependent
> upon that behavior, or whether it can be isolated to a single writer/reader.
I've managed to reproduce this bug here now. There are _two_ places
in xfstests that I need to specify "-b 1024" *grumble*.
You've definitely identified the correct commit. I can go back-and-forth
applying it and removing it, and the failure definitely tracks this
commit.
I think what's going on is that we have a dio write vs a buffered read.
That is, we're only specifying -w to diotest and not -w -r. So we have
a read-vs-invalidate situation.
What used to happen:
- Look up page in page cache, happens to be there
- Check page is uptodate (it isn't; another thread started IO that hasn't
finished yet)
- Wait for page to be unlocked when the read finishes
- See if page is now uptodate (it is)
- Copy data from it
What happens now:
- Look up page in page cache, happens to be there
- Check page is uptodate (it isn't; another thread started IO that hasn't
finished yet)
- Wait for page to be unlocked when the read finishes
- Go back and look up the page again
- Allocate a new page and call readpage
- Wait for IO to complete
- Copy data from it
So if there's a subtle race in ext4 between DIO writes and ->readpage,
this might be exposing it?
I think this theory is given more credence by going to the parent commit
(480546259811) and applying this:
+++ b/mm/filemap.c
@@ -2301,6 +2301,10 @@ static struct page *filemap_update_page(struct kiocb *iocb, struct file *filp,
put_page(page);
return ERR_PTR(error);
}
+ if (!page->mapping) {
+ put_page(page);
+ return NULL;
+ }
if (PageUptodate(page))
return page;
ie refusing to use a truncated page, even if it's still marked uptodate.
That reproduces the g/418 crash within a minute for me.
Note that if the read had come in a few instructions later, it wouldn't've
found a page in the page cache, so this is just widening the window in
which this race can happen.
On Sun 28-03-21 03:41:06, Matthew Wilcox wrote:
> On Mon, Mar 22, 2021 at 12:37:03PM -0400, Eric Whitney wrote:
> > * Matthew Wilcox <[email protected]>:
> > > On Thu, Mar 18, 2021 at 05:38:08PM -0400, Eric Whitney wrote:
> > > > * Matthew Wilcox <[email protected]>:
> > > > > On Thu, Mar 18, 2021 at 02:16:13PM -0400, Eric Whitney wrote:
> > > > > > As mentioned in today's ext4 concall, I've seen generic/418 fail from time to
> > > > > > time when run on 5.12-rc3 and 5.12-rc1 kernels. This first occurred when
> > > > > > running the 1k test case using kvm-xfstests. I was then able to bisect the
> > > > > > failure to a patch landed in the -rc1 merge window:
> > > > > >
> > > > > > (bd8a1f3655a7) mm/filemap: support readpage splitting a page
> > > > >
> > > > > Thanks for letting me know. This failure is new to me.
> > > >
> > > > Sure - it's useful to know that it's new to you. Ted said he's also going
> > > > to test XFS with a large number of generic/418 trials which would be a
> > > > useful comparison. However, he's had no luck as yet reproducing what I've
> > > > seen on his Google compute engine test setup running ext4.
> > > >
> > > > >
> > > > > I don't understand it; this patch changes the behaviour of buffered reads
> > > > > from waiting on a page with a refcount held to waiting on a page without
> > > > > the refcount held, then starting the lookup from scratch once the page
> > > > > is unlocked. I find it hard to believe this introduces a /new/ failure.
> > > > > Either it makes an existing failure easier to hit, or there's a subtle
> > > > > bug in the retry logic that I'm not seeing.
> > > > >
> > > >
> > > > For keeping Murphy at bay I'm rerunning the bisection from scratch just
> > > > to make sure I come out at the same patch. The initial bisection looked
> > > > clean, but when dealing with a failure that occurs probabilistically it's
> > > > easy enough to get it wrong. Is this patch revertable in -rc1 or -rc3?
> > > > Ordinarily I like to do that for confirmation.
> > >
> > > Alas, not easily. I've built a lot on top of it since then. I could
> > > probably come up with a moral reversion (and will have to if we can't
> > > figure out why it's causing a problem!)
> > >
> > > > And there's always the chance that a latent ext4 bug is being hit.
> > >
> > > That would also be valuable information to find out. If this
> > > patch is exposing a latent bug, I can't think what it might be.
> > >
> > > > I'd be very happy to run whatever debugging patches you might want, though
> > > > you might want to wait until I've reproduced the bisection result. The
> > > > offsets vary, unfortunately - I've seen 1024, 2048, and 3072 reported when
> > > > running a file system with 4k blocks.
> > >
> > > As I expected, but thank you for being willing to run debug patches.
> > > I'll wait for you to confirm the bisection and then work up something
> > > that'll help figure out what's going on.
> >
> > An update:
> >
> > I've been able to rerun the bisection on a freshly installed test appliance
> > using a kernel built from a new tree using different test hardware. It ended
> > at the same patch - "mm/filemap: support readpage splitting a page".
> >
> > I'd thought I'd been able to reproduce the failure using ext4's 4K block size,
> > but my test logs say otherwise, as do a few thousand trial runs of generic/418.
> > So, at the moment I can only say definitively that I've seen the failures when
> > running with 1k blocks (block size < page size).
> >
> > I've been able to simplify generic/418 by reducing the set of test cases it
> > uses to just "-w", or direct I/O writes. When I modify the test to remove
> > just that test case and leaving the remainder, I no longer see test failures.
> > A beneficial side effect of the simplification is that it reduces the running
> > time by an order of magnitude.
> >
> > When it fails, it looks like the test is writing via dio a chunk of bytes all
> > of which have been set to 0x1 and then reading back via dio the same chunk with
> > all those bytes equal to 0x0.
> >
> > I'm attempting to create a simpler reproducer. generic/418 actually has a
> > number of processes simultaneously writing to and reading from disjoint
> > regions of the same file, and I'd like to see if the failure is dependent
> > upon that behavior, or whether it can be isolated to a single writer/reader.
>
> I've managed to reproduce this bug here now. There are _two_ places
> in xfstests that I need to specify "-b 1024" *grumble*.
>
> You've definitely identified the correct commit. I can go back-and-forth
> applying it and removing it, and the failure definitely tracks this
> commit.
>
> I think what's going on is that we have a dio write vs a buffered read.
> That is, we're only specifying -w to diotest and not -w -r. So we have
> a read-vs-invalidate situation.
>
> What used to happen:
>
> - Look up page in page cache, happens to be there
> - Check page is uptodate (it isn't; another thread started IO that hasn't
> finished yet)
> - Wait for page to be unlocked when the read finishes
> - See if page is now uptodate (it is)
> - Copy data from it
>
> What happens now:
>
> - Look up page in page cache, happens to be there
> - Check page is uptodate (it isn't; another thread started IO that hasn't
> finished yet)
> - Wait for page to be unlocked when the read finishes
> - Go back and look up the page again
> - Allocate a new page and call readpage
> - Wait for IO to complete
> - Copy data from it
>
> So if there's a subtle race in ext4 between DIO writes and ->readpage,
> this might be exposing it?
>
> I think this theory is given more credence by going to the parent commit
> (480546259811) and applying this:
>
> +++ b/mm/filemap.c
> @@ -2301,6 +2301,10 @@ static struct page *filemap_update_page(struct kiocb *iocb, struct file *filp,
> put_page(page);
> return ERR_PTR(error);
> }
> + if (!page->mapping) {
> + put_page(page);
> + return NULL;
> + }
> if (PageUptodate(page))
> return page;
>
> ie refusing to use a truncated page, even if it's still marked uptodate.
> That reproduces the g/418 crash within a minute for me.
>
> Note that if the read had come in a few instructions later, it wouldn't've
> found a page in the page cache, so this is just widening the window in
> which this race can happen.
I was looking into this for a while after Eric was talking about this
problem on ext4 concall and I think I see where the problem is. The problem
is with handling of extending direct IO write. For extending dio write
ext4_dio_write_iter() ends up doing:
iomap_dio_rw(..., IOMAP_DIO_FORCE_WAIT);
ext4_handle_inode_extension()
-> updates i_size
Now the problem is that i_size gets update only after we invalidate page
cache in iomap_dio_rw() after direct IO write. So we can race like:
CPU1 - write at offset 1k CPU2 - read from offset 0
iomap_dio_rw(..., IOMAP_DIO_FORCE_WAIT);
ext4_readpage();
ext4_handle_inode_extension()
We will read the page at index 0, still see i_size == 1k so we will leave
page contents beyond 1k zeroed out in ext4_mpage_readpages(). But the page
is marked uptodate and there's nobody to invalidate it anymore. So
following buffered read at offset 1k will read stale zeros.
The cleanest solution would be to handle inode extension in
ext4_dio_write_end_io() to avoid this race. However it is not obvious how
to propagate all the information we need there... :-|
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
* Matthew Wilcox <[email protected]>:
> On Mon, Mar 22, 2021 at 12:37:03PM -0400, Eric Whitney wrote:
> > * Matthew Wilcox <[email protected]>:
> > > On Thu, Mar 18, 2021 at 05:38:08PM -0400, Eric Whitney wrote:
> > > > * Matthew Wilcox <[email protected]>:
> > > > > On Thu, Mar 18, 2021 at 02:16:13PM -0400, Eric Whitney wrote:
> > > > > > As mentioned in today's ext4 concall, I've seen generic/418 fail from time to
> > > > > > time when run on 5.12-rc3 and 5.12-rc1 kernels. This first occurred when
> > > > > > running the 1k test case using kvm-xfstests. I was then able to bisect the
> > > > > > failure to a patch landed in the -rc1 merge window:
> > > > > >
> > > > > > (bd8a1f3655a7) mm/filemap: support readpage splitting a page
> > > > >
> > > > > Thanks for letting me know. This failure is new to me.
> > > >
> > > > Sure - it's useful to know that it's new to you. Ted said he's also going
> > > > to test XFS with a large number of generic/418 trials which would be a
> > > > useful comparison. However, he's had no luck as yet reproducing what I've
> > > > seen on his Google compute engine test setup running ext4.
> > > >
> > > > >
> > > > > I don't understand it; this patch changes the behaviour of buffered reads
> > > > > from waiting on a page with a refcount held to waiting on a page without
> > > > > the refcount held, then starting the lookup from scratch once the page
> > > > > is unlocked. I find it hard to believe this introduces a /new/ failure.
> > > > > Either it makes an existing failure easier to hit, or there's a subtle
> > > > > bug in the retry logic that I'm not seeing.
> > > > >
> > > >
> > > > For keeping Murphy at bay I'm rerunning the bisection from scratch just
> > > > to make sure I come out at the same patch. The initial bisection looked
> > > > clean, but when dealing with a failure that occurs probabilistically it's
> > > > easy enough to get it wrong. Is this patch revertable in -rc1 or -rc3?
> > > > Ordinarily I like to do that for confirmation.
> > >
> > > Alas, not easily. I've built a lot on top of it since then. I could
> > > probably come up with a moral reversion (and will have to if we can't
> > > figure out why it's causing a problem!)
> > >
> > > > And there's always the chance that a latent ext4 bug is being hit.
> > >
> > > That would also be valuable information to find out. If this
> > > patch is exposing a latent bug, I can't think what it might be.
> > >
> > > > I'd be very happy to run whatever debugging patches you might want, though
> > > > you might want to wait until I've reproduced the bisection result. The
> > > > offsets vary, unfortunately - I've seen 1024, 2048, and 3072 reported when
> > > > running a file system with 4k blocks.
> > >
> > > As I expected, but thank you for being willing to run debug patches.
> > > I'll wait for you to confirm the bisection and then work up something
> > > that'll help figure out what's going on.
> >
> > An update:
> >
> > I've been able to rerun the bisection on a freshly installed test appliance
> > using a kernel built from a new tree using different test hardware. It ended
> > at the same patch - "mm/filemap: support readpage splitting a page".
> >
> > I'd thought I'd been able to reproduce the failure using ext4's 4K block size,
> > but my test logs say otherwise, as do a few thousand trial runs of generic/418.
> > So, at the moment I can only say definitively that I've seen the failures when
> > running with 1k blocks (block size < page size).
> >
> > I've been able to simplify generic/418 by reducing the set of test cases it
> > uses to just "-w", or direct I/O writes. When I modify the test to remove
> > just that test case and leaving the remainder, I no longer see test failures.
> > A beneficial side effect of the simplification is that it reduces the running
> > time by an order of magnitude.
> >
> > When it fails, it looks like the test is writing via dio a chunk of bytes all
> > of which have been set to 0x1 and then reading back via dio the same chunk with
> > all those bytes equal to 0x0.
> >
> > I'm attempting to create a simpler reproducer. generic/418 actually has a
> > number of processes simultaneously writing to and reading from disjoint
> > regions of the same file, and I'd like to see if the failure is dependent
> > upon that behavior, or whether it can be isolated to a single writer/reader.
>
> I've managed to reproduce this bug here now. There are _two_ places
> in xfstests that I need to specify "-b 1024" *grumble*.
>
> You've definitely identified the correct commit. I can go back-and-forth
> applying it and removing it, and the failure definitely tracks this
> commit.
>
> I think what's going on is that we have a dio write vs a buffered read.
> That is, we're only specifying -w to diotest and not -w -r. So we have
> a read-vs-invalidate situation.
Sorry about this - when writing this up, I was misled by a buggy comment in
dio-invalidate-cache.c's text which states that the reads are direct reads.
I've only seen the failure occur with a dio write to a region followed by
a buffered read of the same region. A dio write followed by a dio read always
succeeds in obtaining the written data (multiple thousands of test runs to
support this claim).
In addition, if any of generic/418's preallocation options (-F, -p, -t) are
specified, the test does not fail even with the combination of a dio write
followed by a buffered read. This is the first suggestion that ext4 might
be at fault.
With some work to simplify the test, the description of the problem looks like:
1) test on file system with block size < page size (1k on x86_64, say)
2) two processes do one dio write each to two disjoint regions of the same
size in a freshly created file
3) each process does a buffered read of the region it wrote for verification
4) if the region written is less than the file system block size (512 bytes
on a 1k file system, first write at offset 0, second at offset 512), the
test does not fail
5) if the region written is equal to or greater than the system page size,
the test does not fail
6) if the region written is less than the system page size, and equal to or
greater than the file system block size (1024 bytes on 1k file system, say,
first write at offset 0, second write at offset 1024), the test can (but
does not always) fail
7) in my data, it's always the second process writing at the non-zero offset
that sees the failure
8) failure rate is around 1% on my test system
So, point 6) also suggests the problem might be due to ext4.
>
> What used to happen:
>
> - Look up page in page cache, happens to be there
> - Check page is uptodate (it isn't; another thread started IO that hasn't
> finished yet)
> - Wait for page to be unlocked when the read finishes
> - See if page is now uptodate (it is)
> - Copy data from it
>
> What happens now:
>
> - Look up page in page cache, happens to be there
> - Check page is uptodate (it isn't; another thread started IO that hasn't
> finished yet)
> - Wait for page to be unlocked when the read finishes
> - Go back and look up the page again
> - Allocate a new page and call readpage
> - Wait for IO to complete
> - Copy data from it
>
> So if there's a subtle race in ext4 between DIO writes and ->readpage,
> this might be exposing it?
>
> I think this theory is given more credence by going to the parent commit
> (480546259811) and applying this:
>
> +++ b/mm/filemap.c
> @@ -2301,6 +2301,10 @@ static struct page *filemap_update_page(struct kiocb *iocb, struct file *filp,
> put_page(page);
> return ERR_PTR(error);
> }
> + if (!page->mapping) {
> + put_page(page);
> + return NULL;
> + }
> if (PageUptodate(page))
> return page;
>
> ie refusing to use a truncated page, even if it's still marked uptodate.
> That reproduces the g/418 crash within a minute for me.
>
> Note that if the read had come in a few instructions later, it wouldn't've
> found a page in the page cache, so this is just widening the window in
> which this race can happen.
Yes, it seems likely your patch simply enlarged a race window somewhere in
ext4's dio implementation. In parallel with this posting, Jan has noted
he's got a possible explanation. So, we'll pursue that direction next.
Matthew - thanks very much for your time and effort in reproducing this
failure and considering it in the context of your patch!
Eric
On 21/03/18 02:16PM, Eric Whitney wrote:
> As mentioned in today's ext4 concall, I've seen generic/418 fail from time to
> time when run on 5.12-rc3 and 5.12-rc1 kernels. This first occurred when
> running the 1k test case using kvm-xfstests. I was then able to bisect the
> failure to a patch landed in the -rc1 merge window:
>
> (bd8a1f3655a7) mm/filemap: support readpage splitting a page
>
> Typical test output resulting from a failure looks like:
>
> QA output created by 418
> +cmpbuf: offset 0: Expected: 0x1, got 0x0
> +[6:0] FAIL - comparison failed, offset 3072
> +diotest -w -b 512 -n 8 -i 4 failed at loop 0
> Silence is golden
> ...
>
> I've also been able to reproduce the failure on -rc3 in the 4k test case as
> well. The failure frequency there was 10 out of 100 runs. It was anywhere
> from 2 to 8 failures out of 100 runs in the 1k case.
Ok, I kept the test running for overnight on PPC64, since as you mentioned
the reproducibility rate is less.
./check -I 1000 tests/generic/418 // this stops the test as soon as we hit it.
I could hit the test on 8th iteration of the test.
Note this is 4k blocksize on 64K pagesize.
<logs>
======
SECTION -- ext4_4k
FSTYP -- ext4
PLATFORM -- Linux/ppc64le qemu 5.12.0-rc2-00313-gae9fda3a723 #66 SMP Sun Mar 14 23:05:24 CDT 2021
MKFS_OPTIONS -- -F -b4096 -I 256 -O 64bit /dev/loop3
MOUNT_OPTIONS -- -o block_validity /dev/loop3 /mnt1/scratch
generic/418 320s ... - output mismatch (see /home/qemu/work-tools/xfstests/results//ext4_4k/generic/418.out.bad)
--- tests/generic/418.out 2020-08-04 09:59:08.658307281 +0000
+++ /home/qemu/work-tools/xfstests/results//ext4_4k/generic/418.out.bad 2021-04-01 18:27:47.498465793 +0000
@@ -1,2 +1,5 @@
QA output created by 418
+cmpbuf: offset 0: Expected: 0x1, got 0x0
+[1:0] FAIL - comparison failed, offset 32768
+diotest -w -b 32768 -n 3 -i 1 failed at loop 6
Silence is golden
Ran: generic/418
Failures: generic/418
Failed 1 of 1 tests
-ritesh