2021-03-16 21:12:20

by Luis Chamberlain

[permalink] [raw]
Subject: blktests: block/009 next-20210304 failure rate average of 1/448

I've managed to reproduce blktests block/009 failures with kdevops [0]
on linux-next tag next-20210304 with a current failure rate average of
1/448 (3 counted failures so far). I've documented the failure on
korg#212305 [1] and provide instructions on how to reproduce. The
failure happens on KVM virtualized guests, for the host OS I am
using debian testing, but the target kernel is linux-next.

My personal suspicion is not on the block layer but on scsi_debug
because this can fail:

modprobe scsi_debug; rmmod scsi_debug

This second issue may be a secondary separate issue, but I figured
I'd mention it. To fix this later issue I've looked at ways to
make scsi_debug_init() wait until its scsi devices are probed,
however its not clear how to do this correctly. If someone has
an idea let me know. If that fixes this issue then we know it was
that.

[0] https://github.com/mcgrof/kdevops
[1] https://bugzilla.kernel.org/show_bug.cgi?id=212305

Luis


2021-03-16 21:22:34

by Luis Chamberlain

[permalink] [raw]
Subject: Re: blktests: block/009 next-20210304 failure rate average of 1/448

On Tue, Mar 16, 2021 at 05:46:45PM +0000, Luis Chamberlain wrote:
> I've managed to reproduce blktests block/009 failures with kdevops [0]
> on linux-next tag next-20210304 with a current failure rate average of
> 1/448 (3 counted failures so far).

Confirmed on next-20210316 with current failure rate at 1/1008

Luis

2021-03-18 17:57:08

by Luis Chamberlain

[permalink] [raw]
Subject: Re: blktests: block/009 next-20210304 failure rate average of 1/448

Adding linux-fsdevel as folks working on fstests might be
interested.

On Tue, Mar 16, 2021 at 05:46:45PM +0000, Luis Chamberlain wrote:
> My personal suspicion is not on the block layer but on scsi_debug
> because this can fail:
>
> modprobe scsi_debug; rmmod scsi_debug
>
> This second issue may be a secondary separate issue, but I figured
> I'd mention it. To fix this later issue I've looked at ways to
> make scsi_debug_init() wait until its scsi devices are probed,
> however its not clear how to do this correctly. If someone has
> an idea let me know. If that fixes this issue then we know it was
> that.

OK so this other issue with scsi_debug indeed deserves its own tracking
so I filed a bug for it but also looked into it and tried to see how to
resolve it.

Someone who works on scsi should revise my work as I haven't touched
scsi before except for the recent block layer work I had done for the
blktrace races, however, my own analysis is that this should not be
fixed in scsi_debug but instead in the users of scsi_debug.

The rationale for that is here:

https://bugzilla.kernel.org/show_bug.cgi?id=212337

The skinny of it is that we have no control over when userspace may muck
with the newly exposed devices as they are being initialized, and
shoe-horning a solution in scsi_debug_init() is prone to always be allow
a race with userspace never letting scsi_debug_init() complete.

So best we can do is just use something like lsof on the tools which
use scsi_debug *prior* to mucking with the devices and / or removal of
the module.

I'll follow up with respective blktests / fstests patches, which I
suspect may address a few false positives.

Luis

2021-03-18 19:35:01

by Luis Chamberlain

[permalink] [raw]
Subject: Re: blktests: block/009 next-20210304 failure rate average of 1/448

On Tue, Mar 16, 2021 at 06:47:39PM +0000, Luis Chamberlain wrote:
> On Tue, Mar 16, 2021 at 05:46:45PM +0000, Luis Chamberlain wrote:
> > I've managed to reproduce blktests block/009 failures with kdevops [0]
> > on linux-next tag next-20210304 with a current failure rate average of
> > 1/448 (3 counted failures so far).
>
> Confirmed on next-20210316 with current failure rate at 1/1008

Just in case this was a scsi_debug issue instead (I am covering that
prospect on another bug just for scsi_debug korg#212337 [0]) I tried
a userspace solution based on what I have observed I still can reproduce
this block/009 failure. The failure rate is much lower though, I have it
now at 1/1705 but alas it is still failing.

[0] https://bugzilla.kernel.org/show_bug.cgi?id=212337

The patch below demonstrates the exra settle work for scsi_debug
attempted, and with it, this is still failing. So either the settle
work needs *more* effort, or this is a real issue.

diff --git a/common/scsi_debug b/common/scsi_debug
index b48cdc9..ecdbcc6 100644
--- a/common/scsi_debug
+++ b/common/scsi_debug
@@ -8,13 +8,42 @@ _have_scsi_debug() {
_have_modules scsi_debug
}

+# As per korg#212337 [0] we must do more work in userspace to settle
+# scsi_debug devices a bit more carefully.
+
+# [0] https://bugzilla.kernel.org/show_bug.cgi?id=212337
+_settle_scsi_debug_device() {
+ SCSI_DEBUG_MAX_WAIT=10
+ SCSI_DEBUG_COUNT_WAIT_LOOP=0
+ while true ; do
+ if [[ -b $1 ]]; then
+ SCSI_DEBUG_LSOF_COUNT=$(lsof $1 | wc -l)
+ if [[ $SCSI_DEBUG_LSOF_COUNT -ne 0 ]]; then
+ sleep 1;
+ else
+ break
+ fi
+ else
+ # Let device come up
+ sleep 1
+
+ let SCSI_DEBUG_COUNT_WAIT_LOOP=$SCSI_DEBUG_COUNT_WAIT_LOOP+1
+ if [[ $SCSI_DEBUG_COUNT_WAIT_LOOP -ge $SCSI_DEBUG_MAX_WAIT ]]; then
+ break
+ fi
+ fi
+ done
+}
+
_init_scsi_debug() {
if ! modprobe -r scsi_debug || ! modprobe scsi_debug "$@"; then
return 1
fi
-
udevadm settle

+ # Allow dependencies to load
+ sleep 1
+
local host_sysfs host target_sysfs target
SCSI_DEBUG_HOSTS=()
SCSI_DEBUG_TARGETS=()
@@ -43,6 +72,10 @@ _init_scsi_debug() {
return 1
fi

+ for i in $SCSI_DEBUG_DEVICES ; do
+ _settle_scsi_debug_device /dev/$i
+ done
+
return 0
}