2015-08-15 19:00:00

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH] ext4: ratelimit the file system mounted message

The xfstests ext4/305 will mount and unmount the same file system over
4,000 times, and each one of these will cause a system log message.
Ratelimit this message since if we are getting more than a few dozen
of these messages, they probably aren't going to be helpful.

Signed-off-by: Theodore Ts'o <[email protected]>
---
fs/ext4/super.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 70c52ea..45658c1 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -60,6 +60,7 @@ static struct ext4_lazy_init *ext4_li_info;
static struct mutex ext4_li_mtx;
static struct ext4_features *ext4_feat;
static int ext4_mballoc_ready;
+static struct ratelimit_state ext4_mount_msg_ratelimit;

static int ext4_load_journal(struct super_block *, struct ext4_super_block *,
unsigned long journal_devnum);
@@ -4277,9 +4278,10 @@ no_journal:
"the device does not support discard");
}

- ext4_msg(sb, KERN_INFO, "mounted filesystem with%s. "
- "Opts: %s%s%s", descr, sbi->s_es->s_mount_opts,
- *sbi->s_es->s_mount_opts ? "; " : "", orig_data);
+ if (___ratelimit(&ext4_mount_msg_ratelimit, "EXT4-fs mount"))
+ ext4_msg(sb, KERN_INFO, "mounted filesystem with%s. "
+ "Opts: %s%s%s", descr, sbi->s_es->s_mount_opts,
+ *sbi->s_es->s_mount_opts ? "; " : "", orig_data);

if (es->s_error_count)
mod_timer(&sbi->s_err_report, jiffies + 300*HZ); /* 5 minutes */
@@ -5617,6 +5619,7 @@ static int __init ext4_init_fs(void)
{
int i, err;

+ ratelimit_state_init(&ext4_mount_msg_ratelimit, 30 * HZ, 64);
ext4_li_info = NULL;
mutex_init(&ext4_li_mtx);

--
2.5.0



2015-08-17 01:13:20

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH] ext4: ratelimit the file system mounted message

On Sat, Aug 15, 2015 at 02:59:57PM -0400, Theodore Ts'o wrote:
> The xfstests ext4/305 will mount and unmount the same file system over
> 4,000 times, and each one of these will cause a system log message.
> Ratelimit this message since if we are getting more than a few dozen
> of these messages, they probably aren't going to be helpful.

Perhaps you should look at fixing the test or making it a more
targetted reproducer. Tests that do "loop doing something basic
while looping doing something else basic for X minutes to try to
trip a race condition" aren't very good regression tests....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2015-08-17 14:50:59

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] ext4: ratelimit the file system mounted message

On Mon, Aug 17, 2015 at 11:12:15AM +1000, Dave Chinner wrote:
> On Sat, Aug 15, 2015 at 02:59:57PM -0400, Theodore Ts'o wrote:
> > The xfstests ext4/305 will mount and unmount the same file system over
> > 4,000 times, and each one of these will cause a system log message.
> > Ratelimit this message since if we are getting more than a few dozen
> > of these messages, they probably aren't going to be helpful.
>
> Perhaps you should look at fixing the test or making it a more
> targetted reproducer. Tests that do "loop doing something basic
> while looping doing something else basic for X minutes to try to
> trip a race condition" aren't very good regression tests....

The problem what we are specifically testing is a race where one
process is reading from a proc fs file while the file system is being
unmounted:

commit f7922730727844c6dee837bd1a64221342fef1d1
Author: Eryu Guan <[email protected]>
Date: Mon Apr 1 10:57:43 2013 +0000

xfstests ext4 305: test read /proc/fs/ext4/<dev>/mb_groups while the fs is being unmounted

Regression test for commit:
9559996 ext4: remove mb_groups before tearing down the buddy_cache

Signed-off-by: Eryu Guan <[email protected]>
Reviewed-by: Rich Johnston <[email protected]>
[[email protected] renumbered test to next in group sequence]
Signed-off-by: Rich Johnston <[email protected]>

I don't see a better way of doing the test off the top of my head,
though.... and to be honest I'm not sure how much value the test
really has, since it's the sort of thing that can easily be checked by
inspection, and it seems rather unlikely we would regress here.

BTW, out of curiosity I reverted 9559996 and tried running ext4/305
many times, on a variety of different VM's ranging from 1 to 8 CPU's,
and using both a SSD and a laptop HDD.

In all cases, ext3/305 reliably reproduced the failure within 30
mount/unmount cycles, and in most cases, under a dozen cycles. (i.e.,
under two seconds, and usually in a fraction of a second). So I'm not
entirely sure why the test was written to run the loop for 3 minutes
and thousands of mount/unmount cycles.

Eryu, you wrote the test; any thoughts? At the very least I'd suggest
cutting the test down so that it runs for at most 2 seconds, if for no
other reason than to speed up regression test runs.

- Ted


2015-08-17 23:07:52

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH] ext4: ratelimit the file system mounted message

[cc [email protected] as we are talking about the test rather
than the kernel behaviour. ]

On Mon, Aug 17, 2015 at 10:50:56AM -0400, Theodore Ts'o wrote:
> On Mon, Aug 17, 2015 at 11:12:15AM +1000, Dave Chinner wrote:
> > On Sat, Aug 15, 2015 at 02:59:57PM -0400, Theodore Ts'o wrote:
> > > The xfstests ext4/305 will mount and unmount the same file system over
> > > 4,000 times, and each one of these will cause a system log message.
> > > Ratelimit this message since if we are getting more than a few dozen
> > > of these messages, they probably aren't going to be helpful.
> >
> > Perhaps you should look at fixing the test or making it a more
> > targetted reproducer. Tests that do "loop doing something basic
> > while looping doing something else basic for X minutes to try to
> > trip a race condition" aren't very good regression tests....
>
> The problem what we are specifically testing is a race where one
> process is reading from a proc fs file while the file system is being
> unmounted:

*nod*. It's like trying to swat a fly with a sledge hammer: if you
get lucky you might shave the fly, but most of the time you're going
to miss.

> I don't see a better way of doing the test off the top of my head,
> though.... and to be honest I'm not sure how much value the test
> really has, since it's the sort of thing that can easily be checked by
> inspection, and it seems rather unlikely we would regress here.

Yup. A lot of regression tests get written to tick a process box
(i.e. did we fix regression X?), not because they provide on-going
value to prevent future regressions. I try to push back against
tests that won't provide us with useful protection against future
regressions....

FWIW, if we need to trigger a specific race in XFS for testing
purposes we've historically added debug code to add a delay in the
kernel code to allow the race condition to trigger. e.g.
tests/xfs/051 pokes a sysfs entry that only exists on
CONFIG_XFS_DEBUG=y builds to delay log recovery so that we can
trigger IO errors via dm-flakey while log recovery is being
performed.

> BTW, out of curiosity I reverted 9559996 and tried running ext4/305
> many times, on a variety of different VM's ranging from 1 to 8 CPU's,
> and using both a SSD and a laptop HDD.
>
> In all cases, ext3/305 reliably reproduced the failure within 30
> mount/unmount cycles, and in most cases, under a dozen cycles. (i.e.,
> under two seconds, and usually in a fraction of a second). So I'm not
> entirely sure why the test was written to run the loop for 3 minutes
> and thousands of mount/unmount cycles.

There were lots of tests being written at the time that used a 3
minute timeout. It's another of those red flags that I tend to
push back on these days, and this is an example of why - usually the
problem can be hit very quickly, or the test is extremely unreliable
and long runtime is the only way to trigger the race. Hence
running for X minutes doesn't really prove anything....

> Eryu, you wrote the test; any thoughts? At the very least I'd suggest
> cutting the test down so that it runs for at most 2 seconds, if for no
> other reason than to speed up regression test runs.

Rather than time limiting, how about bounding the number of
mount/unmount cycles?

Cheers,

Dave.
--
Dave Chinner
[email protected]

2015-08-18 01:18:07

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] ext4: ratelimit the file system mounted message

On Tue, Aug 18, 2015 at 09:07:52AM +1000, Dave Chinner wrote:
> Yup. A lot of regression tests get written to tick a process box
> (i.e. did we fix regression X?), not because they provide on-going
> value to prevent future regressions. I try to push back against
> tests that won't provide us with useful protection against future
> regressions....

Yeah, that wasn't the case here. The bug was fixed by Salman Qazi at
Google in May 2012. The test was created by Eryu nearly a year later
in April 2013.

> Rather than time limiting, how about bounding the number of
> mount/unmount cycles?

Sure, that makes sense.

- Ted

2015-08-18 03:55:05

by Eryu Guan

[permalink] [raw]
Subject: Re: [PATCH] ext4: ratelimit the file system mounted message

On Tue, Aug 18, 2015 at 09:07:52AM +1000, Dave Chinner wrote:
> [cc [email protected] as we are talking about the test rather
> than the kernel behaviour. ]
>
[snip]
> > In all cases, ext3/305 reliably reproduced the failure within 30
> > mount/unmount cycles, and in most cases, under a dozen cycles. (i.e.,
> > under two seconds, and usually in a fraction of a second). So I'm not
> > entirely sure why the test was written to run the loop for 3 minutes
> > and thousands of mount/unmount cycles.
>
> There were lots of tests being written at the time that used a 3
> minute timeout. It's another of those red flags that I tend to
> push back on these days, and this is an example of why - usually the
> problem can be hit very quickly, or the test is extremely unreliable
> and long runtime is the only way to trigger the race. Hence
> running for X minutes doesn't really prove anything....

IIRC, 3 minutes time limit was based on my testing before I submitted
the patch, but I could be wrong, it was two years ago..

I think I have better understanding of xfstests and regression tests
now than two years ago, after years education on the list (mainly by
Dave :-))

>
> > Eryu, you wrote the test; any thoughts? At the very least I'd suggest
> > cutting the test down so that it runs for at most 2 seconds, if for no
> > other reason than to speed up regression test runs.
>
> Rather than time limiting, how about bounding the number of
> mount/unmount cycles?

Agreed, 30 cycles seem a reasonable number, I can prepare a patch if no
objection.

Thanks Ted and Dave for looking into this!

Eryu