If you try to run tests such as generic/108 in a loop
you'll eventually see a failure, but the failure can
be a false positive and the test was just unable to remove
the scsi_debug module.
We need to give some time for the refcnt to become 0. For
instance for the test generic/108 the refcnt lingers between
2 and 1. It should be 0 when we're done but a bit of time
seems to be required. The chance of us trying to run rmmod
when the refcnt is 2 or 1 is low, about 1/30 times if you
run the test in a loop on linux-next today.
Likewise, even when its 0 we just need a tiny breather before
we can remove the module (sleep 10 suffices) but this is
only required on older kernels. Otherwise removing the module
will just fail.
Some of these races are documented on the korg#212337, and
Doug Gilbert has posted at least one patch attempt to try
to help with this [1]. The patch does not resolve all the
issues though, it helps though.
[0] https://bugzilla.kernel.org/show_bug.cgi?id=212337
[1] https://lkml.kernel.org/r/[email protected]
Signed-off-by: Luis Chamberlain <[email protected]>
---
common/scsi_debug | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/common/scsi_debug b/common/scsi_debug
index e7988469..3c9cd820 100644
--- a/common/scsi_debug
+++ b/common/scsi_debug
@@ -4,11 +4,13 @@
#
# Functions useful for tests on unique block devices
+. common/module
+
_require_scsi_debug()
{
# make sure we have the module and it's not already used
modinfo scsi_debug 2>&1 > /dev/null || _notrun "scsi_debug module not found"
- lsmod | grep -wq scsi_debug && (rmmod scsi_debug || _notrun "scsi_debug module in use")
+ lsmod | grep -wq scsi_debug && (_patient_rmmod scsi_debug || _notrun "scsi_debug module in use")
# make sure it has the features we need
# logical/physical sectors plus unmap support all went in together
modinfo scsi_debug | grep -wq sector_size || _notrun "scsi_debug too old"
@@ -53,5 +55,5 @@ _put_scsi_debug_dev()
$UDEV_SETTLE_PROG
n=$((n-1))
done
- rmmod scsi_debug || _fail "Could not remove scsi_debug module"
+ _patient_rmmod scsi_debug || _fail "Could not remove scsi_debug module"
}
--
2.30.2
On Wed, Aug 11, 2021 at 08:45:12AM -0700, Luis Chamberlain wrote:
> If you try to run tests such as generic/108 in a loop
> you'll eventually see a failure, but the failure can
> be a false positive and the test was just unable to remove
> the scsi_debug module.
>
> We need to give some time for the refcnt to become 0. For
> instance for the test generic/108 the refcnt lingers between
> 2 and 1. It should be 0 when we're done but a bit of time
> seems to be required. The chance of us trying to run rmmod
> when the refcnt is 2 or 1 is low, about 1/30 times if you
> run the test in a loop on linux-next today.
>
> Likewise, even when its 0 we just need a tiny breather before
> we can remove the module (sleep 10 suffices) but this is
> only required on older kernels. Otherwise removing the module
> will just fail.
>
> Some of these races are documented on the korg#212337, and
> Doug Gilbert has posted at least one patch attempt to try
> to help with this [1]. The patch does not resolve all the
> issues though, it helps though.
>
> [0] https://bugzilla.kernel.org/show_bug.cgi?id=212337
> [1] https://lkml.kernel.org/r/[email protected]
> Signed-off-by: Luis Chamberlain <[email protected]>
> ---
> common/scsi_debug | 6 ++++--
There're also some "modprobe -r" calls in common/module, should the be
replaced with _patient_rmmod as well?
> 1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/common/scsi_debug b/common/scsi_debug
> index e7988469..3c9cd820 100644
> --- a/common/scsi_debug
> +++ b/common/scsi_debug
> @@ -4,11 +4,13 @@
> #
> # Functions useful for tests on unique block devices
>
> +. common/module
> +
> _require_scsi_debug()
> {
> # make sure we have the module and it's not already used
> modinfo scsi_debug 2>&1 > /dev/null || _notrun "scsi_debug module not found"
> - lsmod | grep -wq scsi_debug && (rmmod scsi_debug || _notrun "scsi_debug module in use")
> + lsmod | grep -wq scsi_debug && (_patient_rmmod scsi_debug || _notrun "scsi_debug module in use")
I don't think we should use _patient_rmmod here, if we set timeout to
forever and there's actually some other process using scsi_debug module,
we'll loop forever here. And a failure to remove scsi_debug only results
in _notrun, and won't cause false test failure.
> # make sure it has the features we need
> # logical/physical sectors plus unmap support all went in together
> modinfo scsi_debug | grep -wq sector_size || _notrun "scsi_debug too old"
> @@ -53,5 +55,5 @@ _put_scsi_debug_dev()
> $UDEV_SETTLE_PROG
> n=$((n-1))
> done
I think the above while loop could be removed as well?
Thanks,
Eryu
> - rmmod scsi_debug || _fail "Could not remove scsi_debug module"
> + _patient_rmmod scsi_debug || _fail "Could not remove scsi_debug module"
> }
> --
> 2.30.2
Sorry for some reason I missed this email. Replies are below.
On Sun, Aug 15, 2021 at 08:35:02PM +0800, Eryu Guan wrote:
> On Wed, Aug 11, 2021 at 08:45:12AM -0700, Luis Chamberlain wrote:
> > If you try to run tests such as generic/108 in a loop
> > you'll eventually see a failure, but the failure can
> > be a false positive and the test was just unable to remove
> > the scsi_debug module.
> >
> > We need to give some time for the refcnt to become 0. For
> > instance for the test generic/108 the refcnt lingers between
> > 2 and 1. It should be 0 when we're done but a bit of time
> > seems to be required. The chance of us trying to run rmmod
> > when the refcnt is 2 or 1 is low, about 1/30 times if you
> > run the test in a loop on linux-next today.
> >
> > Likewise, even when its 0 we just need a tiny breather before
> > we can remove the module (sleep 10 suffices) but this is
> > only required on older kernels. Otherwise removing the module
> > will just fail.
> >
> > Some of these races are documented on the korg#212337, and
> > Doug Gilbert has posted at least one patch attempt to try
> > to help with this [1]. The patch does not resolve all the
> > issues though, it helps though.
> >
> > [0] https://bugzilla.kernel.org/show_bug.cgi?id=212337
> > [1] https://lkml.kernel.org/r/[email protected]
> > Signed-off-by: Luis Chamberlain <[email protected]>
> > ---
> > common/scsi_debug | 6 ++++--
>
> There're also some "modprobe -r" calls in common/module, should the be
> replaced with _patient_rmmod as well?
Ah, yes well... indeed, that could very well be the case. But that
is another change. It can surely help fix quite a bit of other odd
false positives.
I would prefer if such changes would go in separately.
> > 1 file changed, 4 insertions(+), 2 deletions(-)
> >
> > diff --git a/common/scsi_debug b/common/scsi_debug
> > index e7988469..3c9cd820 100644
> > --- a/common/scsi_debug
> > +++ b/common/scsi_debug
> > @@ -4,11 +4,13 @@
> > #
> > # Functions useful for tests on unique block devices
> >
> > +. common/module
> > +
> > _require_scsi_debug()
> > {
> > # make sure we have the module and it's not already used
> > modinfo scsi_debug 2>&1 > /dev/null || _notrun "scsi_debug module not found"
> > - lsmod | grep -wq scsi_debug && (rmmod scsi_debug || _notrun "scsi_debug module in use")
> > + lsmod | grep -wq scsi_debug && (_patient_rmmod scsi_debug || _notrun "scsi_debug module in use")
>
> I don't think we should use _patient_rmmod here, if we set timeout to
> forever and there's actually some other process using scsi_debug module,
> we'll loop forever here. And a failure to remove scsi_debug only results
> in _notrun, and won't cause false test failure.
If one sets the timeout to forever and if the module is already loaded
and we cannot remove it, I'd say its working as intended. But indeed,
fstests does not leave modules lingering around... it would be a user
error. I'll add a special case for this.
> > # make sure it has the features we need
> > # logical/physical sectors plus unmap support all went in together
> > modinfo scsi_debug | grep -wq sector_size || _notrun "scsi_debug too old"
> > @@ -53,5 +55,5 @@ _put_scsi_debug_dev()
> > $UDEV_SETTLE_PROG
> > n=$((n-1))
> > done
>
> I think the above while loop could be removed as well?
Good call, yeah that crap is not needed anymore. I'd still leave in
place the UDEV_SETTLE_PROG though, as I think that's good practice
and we can't loose anything. It can for instance salvage buggy
tests which forgot to use it.
Luis