2017-09-21 15:48:51

by Eric Whitney

[permalink] [raw]
Subject: generic/232 test failures on 4.14-rc1

I'm seeing generic/232 fail from time to time when running a 4.14-rc1 kernel
on xfstest-bld's most recent kvm-xfstests test appliance. In one set of
trials, it failed in the same manner 4 out of 10 times when running the 4k test
configuration for ext4.

The failure bisects to "quota: Do not acquire dqio_sem for dquot overwrites in
v2 format" (ab2b86360f6e). When this patch was reverted in a 4.14-rc1 kernel,
the failure did not reoccur in a series of 20 trials.

Example output from the failed test:

QA output created by 232

Testing fsstress

seed = S
Comparing user usage
218a219
> #3740 -- 4 0 0 1 0 0
245a247
> #45 -- 0 0 0 1 0 0

Note: I'm also seeing a similar failure for generic/233, but the patch
containing the root cause likely comes somewhere after ab2b86360f6e. I'll post
another bug report once I locate it.

Thanks,
Eric


2017-09-25 13:59:48

by Jan Kara

[permalink] [raw]
Subject: Re: generic/232 test failures on 4.14-rc1

On Thu 21-09-17 11:48:46, Eric Whitney wrote:
> I'm seeing generic/232 fail from time to time when running a 4.14-rc1 kernel
> on xfstest-bld's most recent kvm-xfstests test appliance. In one set of
> trials, it failed in the same manner 4 out of 10 times when running the 4k test
> configuration for ext4.
>
> The failure bisects to "quota: Do not acquire dqio_sem for dquot overwrites in
> v2 format" (ab2b86360f6e). When this patch was reverted in a 4.14-rc1 kernel,
> the failure did not reoccur in a series of 20 trials.

Thanks for debugging this! I'd just note that the commit hash of that
change is different for me - d2faa415166b2883428efa92f451774ef44373ac.

> Example output from the failed test:
>
> QA output created by 232
>
> Testing fsstress
>
> seed = S
> Comparing user usage
> 218a219
> > #3740 -- 4 0 0 1 0 0
> 245a247
> > #45 -- 0 0 0 1 0 0
>
> Note: I'm also seeing a similar failure for generic/233, but the patch
> containing the root cause likely comes somewhere after ab2b86360f6e. I'll post
> another bug report once I locate it.

I'll try to debug this further. Thanks for report!

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2017-09-26 12:58:33

by Jan Kara

[permalink] [raw]
Subject: Re: generic/232 test failures on 4.14-rc1

On Mon 25-09-17 15:59:46, Jan Kara wrote:
> On Thu 21-09-17 11:48:46, Eric Whitney wrote:
> > I'm seeing generic/232 fail from time to time when running a 4.14-rc1 kernel
> > on xfstest-bld's most recent kvm-xfstests test appliance. In one set of
> > trials, it failed in the same manner 4 out of 10 times when running the 4k test
> > configuration for ext4.
> >
> > The failure bisects to "quota: Do not acquire dqio_sem for dquot overwrites in
> > v2 format" (ab2b86360f6e). When this patch was reverted in a 4.14-rc1 kernel,
> > the failure did not reoccur in a series of 20 trials.
>
> Thanks for debugging this! I'd just note that the commit hash of that
> change is different for me - d2faa415166b2883428efa92f451774ef44373ac.
>
> > Example output from the failed test:
> >
> > QA output created by 232
> >
> > Testing fsstress
> >
> > seed = S
> > Comparing user usage
> > 218a219
> > > #3740 -- 4 0 0 1 0 0
> > 245a247
> > > #45 -- 0 0 0 1 0 0
> >
> > Note: I'm also seeing a similar failure for generic/233, but the patch
> > containing the root cause likely comes somewhere after ab2b86360f6e. I'll post
> > another bug report once I locate it.
>
> I'll try to debug this further. Thanks for report!

Attached patch fixes the problem for me. I'll merge it through my tree.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR


Attachments:
(No filename) (1.42 kB)
0001-quota-Fix-quota-corruption-with-generic-232-test.patch (1.83 kB)
Download all attachments

2017-09-26 21:41:56

by Eric Whitney

[permalink] [raw]
Subject: Re: generic/232 test failures on 4.14-rc1

* Jan Kara <[email protected]>:
> On Mon 25-09-17 15:59:46, Jan Kara wrote:
> > On Thu 21-09-17 11:48:46, Eric Whitney wrote:
> > > I'm seeing generic/232 fail from time to time when running a 4.14-rc1 kernel
> > > on xfstest-bld's most recent kvm-xfstests test appliance. In one set of
> > > trials, it failed in the same manner 4 out of 10 times when running the 4k test
> > > configuration for ext4.
> > >
> > > The failure bisects to "quota: Do not acquire dqio_sem for dquot overwrites in
> > > v2 format" (ab2b86360f6e). When this patch was reverted in a 4.14-rc1 kernel,
> > > the failure did not reoccur in a series of 20 trials.
> >
> > Thanks for debugging this! I'd just note that the commit hash of that
> > change is different for me - d2faa415166b2883428efa92f451774ef44373ac.
> >
> > > Example output from the failed test:
> > >
> > > QA output created by 232
> > >
> > > Testing fsstress
> > >
> > > seed = S
> > > Comparing user usage
> > > 218a219
> > > > #3740 -- 4 0 0 1 0 0
> > > 245a247
> > > > #45 -- 0 0 0 1 0 0
> > >
> > > Note: I'm also seeing a similar failure for generic/233, but the patch
> > > containing the root cause likely comes somewhere after ab2b86360f6e. I'll post
> > > another bug report once I locate it.
> >
> > I'll try to debug this further. Thanks for report!
>
> Attached patch fixes the problem for me. I'll merge it through my tree.
>
> Honza
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR

> From a0ae41c2a9c204374eafd24a928e4352841bd905 Mon Sep 17 00:00:00 2001
> From: Jan Kara <[email protected]>
> Date: Tue, 26 Sep 2017 10:36:05 +0200
> Subject: [PATCH] quota: Fix quota corruption with generic/232 test
>
> Eric has reported that since commit d2faa415166b "quota: Do not acquire
> dqio_sem for dquot overwrites in v2 format" test generic/232
> occasionally fails due to quota information being incorrect. Indeed that
> commit was too eager to remove dqio_sem completely from the path that
> just overwrites quota structure with updated information. Although that
> is innocent on its own, another process that inserts new quota structure
> to the same block can perform read-modify-write cycle of that block thus
> effectively discarding quota information update if they race in a wrong
> way.
>
> Fix the problem by acquiring dqio_sem for reading for overwrites of
> quota structure. Note that it *is* possible to completely avoid taking
> dqio_sem in the overwrite path however that will require modifying path
> inserting / deleting quota structures to avoid RMW cycles of the full
> block and for now it is not clear whether it is worth the hassle.
>
> Fixes: d2faa415166b2883428efa92f451774ef44373ac
> Reported-by: Eric Whitney <[email protected]>
> Signed-off-by: Jan Kara <[email protected]>
> ---
> fs/quota/quota_v2.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/fs/quota/quota_v2.c b/fs/quota/quota_v2.c
> index c0187cda2c1e..a73e5b34db41 100644
> --- a/fs/quota/quota_v2.c
> +++ b/fs/quota/quota_v2.c
> @@ -328,12 +328,16 @@ static int v2_write_dquot(struct dquot *dquot)
> if (!dquot->dq_off) {
> alloc = true;
> down_write(&dqopt->dqio_sem);
> + } else {
> + down_read(&dqopt->dqio_sem);
> }
> ret = qtree_write_dquot(
> sb_dqinfo(dquot->dq_sb, dquot->dq_id.type)->dqi_priv,
> dquot);
> if (alloc)
> up_write(&dqopt->dqio_sem);
> + else
> + up_read(&dqopt->dqio_sem);
> return ret;
> }
>
> --
> 2.12.3
>

Hi Honza:

That patch works for me - 100 out of 100 trials of generic/232 passed
successfully running a modified 4.14-rc1 kernel on kvm-xfstests' ext4 4k
test configuration.

Tested-by: Eric Whitney <[email protected]>

Thanks!
Eric

2017-09-27 01:19:39

by Darrick J. Wong

[permalink] [raw]
Subject: Re: generic/232 test failures on 4.14-rc1

[cc xfs list]

On Tue, Sep 26, 2017 at 02:58:31PM +0200, Jan Kara wrote:
> On Mon 25-09-17 15:59:46, Jan Kara wrote:
> > On Thu 21-09-17 11:48:46, Eric Whitney wrote:
> > > I'm seeing generic/232 fail from time to time when running a 4.14-rc1 kernel
> > > on xfstest-bld's most recent kvm-xfstests test appliance. In one set of
> > > trials, it failed in the same manner 4 out of 10 times when running the 4k test
> > > configuration for ext4.
> > >
> > > The failure bisects to "quota: Do not acquire dqio_sem for dquot overwrites in
> > > v2 format" (ab2b86360f6e). When this patch was reverted in a 4.14-rc1 kernel,
> > > the failure did not reoccur in a series of 20 trials.
> >
> > Thanks for debugging this! I'd just note that the commit hash of that
> > change is different for me - d2faa415166b2883428efa92f451774ef44373ac.
> >
> > > Example output from the failed test:
> > >
> > > QA output created by 232
> > >
> > > Testing fsstress
> > >
> > > seed = S
> > > Comparing user usage
> > > 218a219
> > > > #3740 -- 4 0 0 1 0 0
> > > 245a247
> > > > #45 -- 0 0 0 1 0 0
> > >
> > > Note: I'm also seeing a similar failure for generic/233, but the patch
> > > containing the root cause likely comes somewhere after ab2b86360f6e. I'll post
> > > another bug report once I locate it.
> >
> > I'll try to debug this further. Thanks for report!
>
> Attached patch fixes the problem for me. I'll merge it through my tree.

Hi Jan,

Ever since 4.14-rc1, I've noticed the same problem (intermittent
failures of generic/{232,233,270}) that Eric Whitney was complaining
about when running xfstests against XFS. I'll try a proper bisect
tomorrow, but given the big locking rework I wonder if that rings any
bells for you?

--D

>
> Honza
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR

> From a0ae41c2a9c204374eafd24a928e4352841bd905 Mon Sep 17 00:00:00 2001
> From: Jan Kara <[email protected]>
> Date: Tue, 26 Sep 2017 10:36:05 +0200
> Subject: [PATCH] quota: Fix quota corruption with generic/232 test
>
> Eric has reported that since commit d2faa415166b "quota: Do not acquire
> dqio_sem for dquot overwrites in v2 format" test generic/232
> occasionally fails due to quota information being incorrect. Indeed that
> commit was too eager to remove dqio_sem completely from the path that
> just overwrites quota structure with updated information. Although that
> is innocent on its own, another process that inserts new quota structure
> to the same block can perform read-modify-write cycle of that block thus
> effectively discarding quota information update if they race in a wrong
> way.
>
> Fix the problem by acquiring dqio_sem for reading for overwrites of
> quota structure. Note that it *is* possible to completely avoid taking
> dqio_sem in the overwrite path however that will require modifying path
> inserting / deleting quota structures to avoid RMW cycles of the full
> block and for now it is not clear whether it is worth the hassle.
>
> Fixes: d2faa415166b2883428efa92f451774ef44373ac
> Reported-by: Eric Whitney <[email protected]>
> Signed-off-by: Jan Kara <[email protected]>
> ---
> fs/quota/quota_v2.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/fs/quota/quota_v2.c b/fs/quota/quota_v2.c
> index c0187cda2c1e..a73e5b34db41 100644
> --- a/fs/quota/quota_v2.c
> +++ b/fs/quota/quota_v2.c
> @@ -328,12 +328,16 @@ static int v2_write_dquot(struct dquot *dquot)
> if (!dquot->dq_off) {
> alloc = true;
> down_write(&dqopt->dqio_sem);
> + } else {
> + down_read(&dqopt->dqio_sem);
> }
> ret = qtree_write_dquot(
> sb_dqinfo(dquot->dq_sb, dquot->dq_id.type)->dqi_priv,
> dquot);
> if (alloc)
> up_write(&dqopt->dqio_sem);
> + else
> + up_read(&dqopt->dqio_sem);
> return ret;
> }
>
> --
> 2.12.3
>

2017-09-27 09:42:05

by Jan Kara

[permalink] [raw]
Subject: Re: generic/232 test failures on 4.14-rc1

On Tue 26-09-17 17:41:52, Eric Whitney wrote:
> Hi Honza:
>
> That patch works for me - 100 out of 100 trials of generic/232 passed
> successfully running a modified 4.14-rc1 kernel on kvm-xfstests' ext4 4k
> test configuration.
>
> Tested-by: Eric Whitney <[email protected]>

Thanks for testing!

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2017-09-27 09:42:06

by Jan Kara

[permalink] [raw]
Subject: Re: generic/232 test failures on 4.14-rc1

Hi Darrick!

On Tue 26-09-17 18:19:29, Darrick J. Wong wrote:
> On Tue, Sep 26, 2017 at 02:58:31PM +0200, Jan Kara wrote:
> > On Mon 25-09-17 15:59:46, Jan Kara wrote:
> > > On Thu 21-09-17 11:48:46, Eric Whitney wrote:
> > > > I'm seeing generic/232 fail from time to time when running a 4.14-rc1 kernel
> > > > on xfstest-bld's most recent kvm-xfstests test appliance. In one set of
> > > > trials, it failed in the same manner 4 out of 10 times when running the 4k test
> > > > configuration for ext4.
> > > >
> > > > The failure bisects to "quota: Do not acquire dqio_sem for dquot overwrites in
> > > > v2 format" (ab2b86360f6e). When this patch was reverted in a 4.14-rc1 kernel,
> > > > the failure did not reoccur in a series of 20 trials.
> > >
> > > Thanks for debugging this! I'd just note that the commit hash of that
> > > change is different for me - d2faa415166b2883428efa92f451774ef44373ac.
> > >
> > > > Example output from the failed test:
> > > >
> > > > QA output created by 232
> > > >
> > > > Testing fsstress
> > > >
> > > > seed = S
> > > > Comparing user usage
> > > > 218a219
> > > > > #3740 -- 4 0 0 1 0 0
> > > > 245a247
> > > > > #45 -- 0 0 0 1 0 0
> > > >
> > > > Note: I'm also seeing a similar failure for generic/233, but the patch
> > > > containing the root cause likely comes somewhere after ab2b86360f6e. I'll post
> > > > another bug report once I locate it.
> > >
> > > I'll try to debug this further. Thanks for report!
> >
> > Attached patch fixes the problem for me. I'll merge it through my tree.
>
> Ever since 4.14-rc1, I've noticed the same problem (intermittent
> failures of generic/{232,233,270}) that Eric Whitney was complaining
> about when running xfstests against XFS. I'll try a proper bisect
> tomorrow, but given the big locking rework I wonder if that rings any
> bells for you?

Hum, no idea. XFS uses its own thing for quotas so my changes don't
influence it at all. I don't know much about XFS internal quota
implementation so I don't have a good guess what could have caused the
breakage you see. I'm sorry.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR