LinuxLists.cc - generic/650 makes v6.0-rc client unusable

2022-09-03 18:50:17

Subject: generic/650 makes v6.0-rc client unusable

While investigating some of the other issues that have been
reported lately, I've found that my v6.0-rc3 NFS/TCP client
goes off the rails often (but not always) during generic/650.

This is the test that runs a workload while offlining and
onlining CPUs. My test client has 12 physical cores.

The test appears to start normally, but then after a bit
the NFS server workload drops to zero and the NFS mount
disappears. I can't run programs (sudo, for example) on
the client. Can't log in, even on the console. The console
has a constant stream of "can't rotate log: Input/Output
error" type messages.

I haven't looked further into this yet. Actually I'm not
quite sure where to start looking.

I recently switched this client from a local /home to an
NFS-mounted one, and that's where the xfstests are built
and run from, fwiw.

--
Chuck Lever

2022-09-04 09:05:36

by David Wysochanski

[permalink] [raw]

Subject: Re: generic/650 makes v6.0-rc client unusable

On Sat, Sep 3, 2022 at 2:44 PM Chuck Lever III <[email protected]> wrote:
>
> While investigating some of the other issues that have been
> reported lately, I've found that my v6.0-rc3 NFS/TCP client
> goes off the rails often (but not always) during generic/650.
>
> This is the test that runs a workload while offlining and
> onlining CPUs. My test client has 12 physical cores.
>
> The test appears to start normally, but then after a bit
> the NFS server workload drops to zero and the NFS mount
> disappears. I can't run programs (sudo, for example) on
> the client. Can't log in, even on the console. The console
> has a constant stream of "can't rotate log: Input/Output
> error" type messages.
>
I've seen this occasionally as well.

> I haven't looked further into this yet. Actually I'm not
> quite sure where to start looking.
>
> I recently switched this client from a local /home to an
> NFS-mounted one, and that's where the xfstests are built
> and run from, fwiw.
>
My testbeds have xfstests built/run on a local filesystem.

2022-09-04 13:00:37

by Theodore Ts'o

[permalink] [raw]

Subject: Re: generic/650 makes v6.0-rc client unusable

On Sat, Sep 03, 2022 at 06:43:29PM +0000, Chuck Lever III wrote:
> While investigating some of the other issues that have been
> reported lately, I've found that my v6.0-rc3 NFS/TCP client
> goes off the rails often (but not always) during generic/650.
>
> This is the test that runs a workload while offlining and
> onlining CPUs. My test client has 12 physical cores.
>
> The test appears to start normally, but then after a bit
> the NFS server workload drops to zero and the NFS mount
> disappears. I can't run programs (sudo, for example) on
> the client. Can't log in, even on the console. The console
> has a constant stream of "can't rotate log: Input/Output
> error" type messages.

I've noticed problems with generic/650 for quite a while, but only
when running tests on GCE (but not KVM). I noted this not when
running xfstests on ext4; IIRC, it was was causing the VM to reboot
when testing any file system.

- Ted

commit 6e7867469bd3b135125a76e633e0bb50045ccb3c
Author: Theodore Ts'o <[email protected]>
Date: Fri Oct 22 23:24:31 2021 -0400

test-appliance: allow tests to be excluded based on the appliance flavor

The generic/650 test causes an instant reboot on GCE, so add
infrastructure to exclude a test based on the test appliance flavor
(i.e., android, gce, or kvm).

Signed-off-by: Theodore Ts'o <[email protected]>

2022-09-04 13:20:17

by Zorro Lang

[permalink] [raw]

Subject: Re: generic/650 makes v6.0-rc client unusable

On Sat, Sep 03, 2022 at 06:43:29PM +0000, Chuck Lever III wrote:
> While investigating some of the other issues that have been
> reported lately, I've found that my v6.0-rc3 NFS/TCP client
> goes off the rails often (but not always) during generic/650.
>
> This is the test that runs a workload while offlining and
> onlining CPUs. My test client has 12 physical cores.
>
> The test appears to start normally, but then after a bit
> the NFS server workload drops to zero and the NFS mount
> disappears. I can't run programs (sudo, for example) on
> the client. Can't log in, even on the console. The console
> has a constant stream of "can't rotate log: Input/Output
> error" type messages.
>
> I haven't looked further into this yet. Actually I'm not
> quite sure where to start looking.
>
> I recently switched this client from a local /home to an
> NFS-mounted one, and that's where the xfstests are built
> and run from, fwiw.

If most of users complain generic/650, I'd like to exclude g/650 from the
"auto" default run group. Any more points?

Thanks,
Zorro

>
>
> --
> Chuck Lever
>
>
>

2022-09-04 16:06:14

by Chuck Lever III

[permalink] [raw]

Subject: Re: generic/650 makes v6.0-rc client unusable

Hi-

> On Sep 4, 2022, at 9:15 AM, Zorro Lang <[email protected]> wrote:
>
> On Sat, Sep 03, 2022 at 06:43:29PM +0000, Chuck Lever III wrote:
>> While investigating some of the other issues that have been
>> reported lately, I've found that my v6.0-rc3 NFS/TCP client
>> goes off the rails often (but not always) during generic/650.
>>
>> This is the test that runs a workload while offlining and
>> onlining CPUs. My test client has 12 physical cores.
>>
>> The test appears to start normally, but then after a bit
>> the NFS server workload drops to zero and the NFS mount
>> disappears. I can't run programs (sudo, for example) on
>> the client. Can't log in, even on the console. The console
>> has a constant stream of "can't rotate log: Input/Output
>> error" type messages.
>>
>> I haven't looked further into this yet. Actually I'm not
>> quite sure where to start looking.
>>
>> I recently switched this client from a local /home to an
>> NFS-mounted one, and that's where the xfstests are built
>> and run from, fwiw.
>
> If most of users complain generic/650, I'd like to exclude g/650 from the
> "auto" default run group. Any more points?

Well generic/650 was passing for me before v6.0-rc, and IMO
it is a tough but reasonable test, considering the ubiquitous
use of workqueues and other scheduling primitives in our
filesystems.

So I think I caught a real bug, but I need a couple more days
to work it out before deciding generic/650 is throwing false
negatives and is thus not worth running in the "auto" group.

I can't really say whether Ted's failing tests are the
result of an interaction with the GCE platform or the test
itself. Ie, his patch might be the right approach -- exclude
it based on the test platform.

--
Chuck Lever

2022-09-06 16:29:26

by Chuck Lever III

[permalink] [raw]

Subject: Re: generic/650 makes v6.0-rc client unusable

> On Sep 4, 2022, at 12:02 PM, Chuck Lever III <[email protected]> wrote:
>
> Hi-
>
>> On Sep 4, 2022, at 9:15 AM, Zorro Lang <[email protected]> wrote:
>>
>> On Sat, Sep 03, 2022 at 06:43:29PM +0000, Chuck Lever III wrote:
>>> While investigating some of the other issues that have been
>>> reported lately, I've found that my v6.0-rc3 NFS/TCP client
>>> goes off the rails often (but not always) during generic/650.
>>>
>>> This is the test that runs a workload while offlining and
>>> onlining CPUs. My test client has 12 physical cores.
>>>
>>> The test appears to start normally, but then after a bit
>>> the NFS server workload drops to zero and the NFS mount
>>> disappears. I can't run programs (sudo, for example) on
>>> the client. Can't log in, even on the console. The console
>>> has a constant stream of "can't rotate log: Input/Output
>>> error" type messages.
>>>
>>> I haven't looked further into this yet. Actually I'm not
>>> quite sure where to start looking.
>>>
>>> I recently switched this client from a local /home to an
>>> NFS-mounted one, and that's where the xfstests are built
>>> and run from, fwiw.
>>
>> If most of users complain generic/650, I'd like to exclude g/650 from the
>> "auto" default run group. Any more points?
>
> Well generic/650 was passing for me before v6.0-rc, and IMO
> it is a tough but reasonable test, considering the ubiquitous
> use of workqueues and other scheduling primitives in our
> filesystems.
>
> So I think I caught a real bug, but I need a couple more days
> to work it out before deciding generic/650 is throwing false
> negatives and is thus not worth running in the "auto" group.

Following up. I can't reproduce it any more. I've heard more
than one report that this failure can happen on non-NFS
configurations. I'd therefore conclude that I haven't caught
a bug in something I'm actively testing.

Carry on!

> I can't really say whether Ted's failing tests are the
> result of an interaction with the GCE platform or the test
> itself. Ie, his patch might be the right approach -- exclude
> it based on the test platform.

--
Chuck Lever

2022-11-09 04:22:42

by Shinichiro Kawasaki

[permalink] [raw]

Subject: Re: generic/650 makes v6.0-rc client unusable

On Sep 04, 2022 / 21:15, Zorro Lang wrote:
> On Sat, Sep 03, 2022 at 06:43:29PM +0000, Chuck Lever III wrote:
> > While investigating some of the other issues that have been
> > reported lately, I've found that my v6.0-rc3 NFS/TCP client
> > goes off the rails often (but not always) during generic/650.
> >
> > This is the test that runs a workload while offlining and
> > onlining CPUs. My test client has 12 physical cores.
> >
> > The test appears to start normally, but then after a bit
> > the NFS server workload drops to zero and the NFS mount
> > disappears. I can't run programs (sudo, for example) on
> > the client. Can't log in, even on the console. The console
> > has a constant stream of "can't rotate log: Input/Output
> > error" type messages.

I also observe this failure when I ran fstests using btrfs on my HDDs.
The failure is recreated almost always.

> >
> > I haven't looked further into this yet. Actually I'm not
> > quite sure where to start looking.
> >
> > I recently switched this client from a local /home to an
> > NFS-mounted one, and that's where the xfstests are built
> > and run from, fwiw.
>
> If most of users complain generic/650, I'd like to exclude g/650 from the
> "auto" default run group. Any more points?

+1. I wish to remove it from the "auto" group. Since I can not login to the test
machine after the failure, I suggest to put it in the "dangerous" group.

--
Shin'ichiro Kawasaki

2022-11-09 10:37:39

by Filipe Manana

[permalink] [raw]

Subject: Re: generic/650 makes v6.0-rc client unusable

On Wed, Nov 9, 2022 at 4:22 AM Shinichiro Kawasaki
<[email protected]> wrote:
>
> On Sep 04, 2022 / 21:15, Zorro Lang wrote:
> > On Sat, Sep 03, 2022 at 06:43:29PM +0000, Chuck Lever III wrote:
> > > While investigating some of the other issues that have been
> > > reported lately, I've found that my v6.0-rc3 NFS/TCP client
> > > goes off the rails often (but not always) during generic/650.
> > >
> > > This is the test that runs a workload while offlining and
> > > onlining CPUs. My test client has 12 physical cores.
> > >
> > > The test appears to start normally, but then after a bit
> > > the NFS server workload drops to zero and the NFS mount
> > > disappears. I can't run programs (sudo, for example) on
> > > the client. Can't log in, even on the console. The console
> > > has a constant stream of "can't rotate log: Input/Output
> > > error" type messages.
>
> I also observe this failure when I ran fstests using btrfs on my HDDs.
> The failure is recreated almost always.

I'm wondering what do you get in dmesg, any traces?

I've excluded the test from my runs for over an year now, due to some
crash that I reported
to the mm and cpu hotplug people here:

https://lore.kernel.org/linux-mm/CAL3q7H4AyrZ5erimDyO7mOVeppd5BeMw3CS=wGbzrMZrp56ktA@mail.gmail.com/

Unfortunately I had no reply from anyone who works or maintains those
subsystems.

It didn't happen very often, and I haven't tested again with recent kernels.

>
> > >
> > > I haven't looked further into this yet. Actually I'm not
> > > quite sure where to start looking.
> > >
> > > I recently switched this client from a local /home to an
> > > NFS-mounted one, and that's where the xfstests are built
> > > and run from, fwiw.
> >
> > If most of users complain generic/650, I'd like to exclude g/650 from the
> > "auto" default run group. Any more points?
>
> +1. I wish to remove it from the "auto" group. Since I can not login to the test
> machine after the failure, I suggest to put it in the "dangerous" group.
>
> --
> Shin'ichiro Kawasaki

2022-11-09 18:07:07

by Darrick J. Wong

[permalink] [raw]

Subject: Re: generic/650 makes v6.0-rc client unusable

On Wed, Nov 09, 2022 at 10:36:04AM +0000, Filipe Manana wrote:
> On Wed, Nov 9, 2022 at 4:22 AM Shinichiro Kawasaki
> <[email protected]> wrote:
> >
> > On Sep 04, 2022 / 21:15, Zorro Lang wrote:
> > > On Sat, Sep 03, 2022 at 06:43:29PM +0000, Chuck Lever III wrote:
> > > > While investigating some of the other issues that have been
> > > > reported lately, I've found that my v6.0-rc3 NFS/TCP client
> > > > goes off the rails often (but not always) during generic/650.
> > > >
> > > > This is the test that runs a workload while offlining and
> > > > onlining CPUs. My test client has 12 physical cores.
> > > >
> > > > The test appears to start normally, but then after a bit
> > > > the NFS server workload drops to zero and the NFS mount
> > > > disappears. I can't run programs (sudo, for example) on
> > > > the client. Can't log in, even on the console. The console
> > > > has a constant stream of "can't rotate log: Input/Output
> > > > error" type messages.
> >
> > I also observe this failure when I ran fstests using btrfs on my HDDs.
> > The failure is recreated almost always.
>
> I'm wondering what do you get in dmesg, any traces?
>
> I've excluded the test from my runs for over an year now, due to some
> crash that I reported
> to the mm and cpu hotplug people here:
>
> https://lore.kernel.org/linux-mm/CAL3q7H4AyrZ5erimDyO7mOVeppd5BeMw3CS=wGbzrMZrp56ktA@mail.gmail.com/
>
> Unfortunately I had no reply from anyone who works or maintains those
> subsystems.
>
> It didn't happen very often, and I haven't tested again with recent kernels.

I've been testing with xfs/btrfs/ext4 nightly, and haven't seen any
problems with the last two. There's some very infrequent log accounting
problem that is probably a regression from Dave's recent round of log
refactorings, so once we're clear of the write race corruption problem,
I intend to inquire about that.

Granted I also don't have hundreds-of-cpus machines to test this kind of
stuff, so I don't know how well hotplug mania fares on a big iron.

I don't think it's valid to remove a test from the auto group because it
uncovers bugs. If test runner folks want to put it in their own exclude
lists for their own convenience, that's fine with me.

--D

> >
> > > >
> > > > I haven't looked further into this yet. Actually I'm not
> > > > quite sure where to start looking.
> > > >
> > > > I recently switched this client from a local /home to an
> > > > NFS-mounted one, and that's where the xfstests are built
> > > > and run from, fwiw.
> > >
> > > If most of users complain generic/650, I'd like to exclude g/650 from the
> > > "auto" default run group. Any more points?
> >
> > +1. I wish to remove it from the "auto" group. Since I can not login to the test
> > machine after the failure, I suggest to put it in the "dangerous" group.
> >
> > --
> > Shin'ichiro Kawasaki

2022-11-10 08:54:18

by Shinichiro Kawasaki

[permalink] [raw]

Subject: Re: generic/650 makes v6.0-rc client unusable

On Nov 09, 2022 / 10:06, Darrick J. Wong wrote:

...

> I don't think it's valid to remove a test from the auto group because it
> uncovers bugs. If test runner folks want to put it in their own exclude
> lists for their own convenience, that's fine with me.

I see, then removing the test case from auto group may not be a good idea.
I will set the test case to my exclude list.

--
Shin'ichiro Kawasaki

2022-11-10 08:54:18

by Shinichiro Kawasaki

[permalink] [raw]

Subject: Re: generic/650 makes v6.0-rc client unusable

On Nov 09, 2022 / 10:36, Filipe Manana wrote:
> On Wed, Nov 9, 2022 at 4:22 AM Shinichiro Kawasaki
> <[email protected]> wrote:
> >
> > On Sep 04, 2022 / 21:15, Zorro Lang wrote:
> > > On Sat, Sep 03, 2022 at 06:43:29PM +0000, Chuck Lever III wrote:
> > > > While investigating some of the other issues that have been
> > > > reported lately, I've found that my v6.0-rc3 NFS/TCP client
> > > > goes off the rails often (but not always) during generic/650.
> > > >
> > > > This is the test that runs a workload while offlining and
> > > > onlining CPUs. My test client has 12 physical cores.
> > > >
> > > > The test appears to start normally, but then after a bit
> > > > the NFS server workload drops to zero and the NFS mount
> > > > disappears. I can't run programs (sudo, for example) on
> > > > the client. Can't log in, even on the console. The console
> > > > has a constant stream of "can't rotate log: Input/Output
> > > > error" type messages.
> >
> > I also observe this failure when I ran fstests using btrfs on my HDDs.
> > The failure is recreated almost always.
>
> I'm wondering what do you get in dmesg, any traces?

I show the log I observed at the end of this e-mail [1]. No BUG message.
The WARN "didn't collect load info for all cpus, balancing is broken" is
repeated. But I once the hang without this WARN.

The last message left was from xfs "ctx ticket reservation ran out. Need to up
reservation". This is for the system disk, not for the test target file system.

> I've excluded the test from my runs for over an year now, due to some
> crash that I reported
> to the mm and cpu hotplug people here:
>
> https://lore.kernel.org/linux-mm/CAL3q7H4AyrZ5erimDyO7mOVeppd5BeMw3CS=wGbzrMZrp56ktA@mail.gmail.com/
>
> Unfortunately I had no reply from anyone who works or maintains those
> subsystems.
>
> It didn't happen very often, and I haven't tested again with recent kernels.

Thanks for sharing your experience. Hmm, your failure symptom is different from
mine.

[1]

Nov 09 11:50:09 redsun40 root[3480]: run xfstest generic/650
Nov 09 11:50:09 redsun40 unknown: run fstests generic/650 at 2022-11-09 11:50:09
Nov 09 11:50:09 redsun40 systemd[1]: Started fstests-generic-650.scope - /usr/bin/bash -c test -w /proc/self/oom_score_adj && echo 250 > /proc/self/oom_score_adj; exec ./tests/generic/650.
Nov 09 11:50:11 redsun40 kernel: smpboot: CPU 10 is now offline
Nov 09 11:50:11 redsun40 kernel: MMIO Stale Data CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/processor_mmio_stale_data.html for more details.
Nov 09 11:50:11 redsun40 kernel: smpboot: CPU 14 is now offline
Nov 09 11:50:14 redsun40 kernel: smpboot: CPU 25 is now offline
Nov 09 11:50:15 redsun40 kernel: smpboot: Booting Node 0 Processor 14 APIC 0x1c
Nov 09 11:50:15 redsun40 kernel: x86/cpu: SGX disabled by BIOS.
Nov 09 11:50:15 redsun40 kernel: x86/tme: not enabled by BIOS
Nov 09 11:50:15 redsun40 kernel: CPU0: Thermal monitoring enabled (TM1)
Nov 09 11:50:15 redsun40 kernel: x86/cpu: User Mode Instruction Prevention (UMIP) activated
Nov 09 11:50:15 redsun40 kernel: smpboot: CPU 30 is now offline
Nov 09 11:50:17 redsun40 kernel: smpboot: CPU 2 is now offline
Nov 09 11:50:19 redsun40 kernel: smpboot: CPU 20 is now offline
Nov 09 11:50:22 redsun40 kernel: smpboot: CPU 31 is now offline
Nov 09 11:50:23 redsun40 kernel: smpboot: CPU 23 is now offline
Nov 09 11:50:24 redsun40 kernel: smpboot: Booting Node 0 Processor 10 APIC 0x14
Nov 09 11:50:26 redsun40 kernel: smpboot: CPU 10 is now offline
Nov 09 11:50:28 redsun40 kernel: smpboot: Booting Node 0 Processor 20 APIC 0x9
Nov 09 11:50:29 redsun40 kernel: smpboot: CPU 21 is now offline
Nov 09 11:50:30 redsun40 kernel: smpboot: CPU 16 is now offline
Nov 09 11:50:31 redsun40 /usr/sbin/irqbalance[1143]: WARNING, didn't collect load info for all cpus, balancing is broken
Nov 09 11:50:31 redsun40 kernel: smpboot: Booting Node 0 Processor 30 APIC 0x1d
Nov 09 11:50:32 redsun40 kernel: smpboot: CPU 18 is now offline
Nov 09 11:50:33 redsun40 kernel: smpboot: Booting Node 0 Processor 2 APIC 0x4
Nov 09 11:50:34 redsun40 kernel: smpboot: CPU 4 is now offline
Nov 09 11:50:35 redsun40 kernel: smpboot: CPU 19 is now offline
Nov 09 11:50:36 redsun40 kernel: smpboot: Booting Node 0 Processor 31 APIC 0x1f
Nov 09 11:50:37 redsun40 kernel: smpboot: CPU 27 is now offline
Nov 09 11:50:38 redsun40 kernel: smpboot: CPU 26 is now offline
Nov 09 11:50:39 redsun40 kernel: smpboot: CPU 11 is now offline
Nov 09 11:50:41 redsun40 /usr/sbin/irqbalance[1143]: WARNING, didn't collect load info for all cpus, balancing is broken

...

Nov 09 12:28:51 redsun40 kernel: smpboot: Booting Node 0 Processor 31 APIC 0x1f
Nov 09 12:28:52 redsun40 /usr/sbin/irqbalance[1143]: WARNING, didn't collect load info for all cpus, balancing is broken
Nov 09 12:28:52 redsun40 kernel: smpboot: Booting Node 0 Processor 14 APIC 0x1c
Nov 09 12:28:52 redsun40 /usr/sbin/irqbalance[1143]: WARNING, didn't collect load info for all cpus, balancing is broken
Nov 09 12:28:53 redsun40 kernel: smpboot: CPU 24 is now offline
Nov 09 12:28:55 redsun40 kernel: smpboot: Booting Node 0 Processor 26 APIC 0x15
Nov 09 12:28:57 redsun40 kernel: smpboot: CPU 29 is now offline
Nov 09 12:28:58 redsun40 kernel: smpboot: Booting Node 0 Processor 20 APIC 0x9
Nov 09 12:28:59 redsun40 kernel: smpboot: Booting Node 0 Processor 24 APIC 0x11
Nov 09 12:29:00 redsun40 kernel: x86: Booting SMP configuration:
Nov 09 12:29:00 redsun40 kernel: smpboot: Booting Node 0 Processor 1 APIC 0x2
Nov 09 12:29:01 redsun40 kernel: smpboot: CPU 19 is now offline
Nov 09 12:29:02 redsun40 /usr/sbin/irqbalance[1143]: WARNING, didn't collect load info for all cpus, balancing is broken
Nov 09 12:29:04 redsun40 kernel: smpboot: Booting Node 0 Processor 7 APIC 0xe
Nov 09 12:29:04 redsun40 kernel: smpboot: CPU 1 is now offline
Nov 09 12:29:04 redsun40 kernel: XFS (nvme0n1p3): ctx ticket reservation ran out. Need to up reservation

--
Shin'ichiro Kawasaki

2022-11-10 15:26:28

by Theodore Ts'o

[permalink] [raw]

Subject: Re: generic/650 makes v6.0-rc client unusable

On Wed, Nov 09, 2022 at 10:06:29AM -0800, Darrick J. Wong wrote:
> I've been testing with xfs/btrfs/ext4 nightly, and haven't seen any
> problems with the last two. There's some very infrequent log accounting
> problem that is probably a regression from Dave's recent round of log
> refactorings, so once we're clear of the write race corruption problem,
> I intend to inquire about that.
>
> Granted I also don't have hundreds-of-cpus machines to test this kind of
> stuff, so I don't know how well hotplug mania fares on a big iron.
>
> I don't think it's valid to remove a test from the auto group because it
> uncovers bugs. If test runner folks want to put it in their own exclude
> lists for their own convenience, that's fine with me.

Well, for me, on a GCE VM (but not using KVM), using ***any*** file
system, the test is an automatic instant crash of the VM. It's a
pretty clearly a CPU hotplug bug, not a file system bug. And given
that the purpose of running the test is to find file system bugs, and
running the test prevents the rest of the file system tests from
running, of course it's on my exclude list for gce-xfstests.

I don't care *that* much whether it's removed from the auto group or
not, or added to the dangerous group or not, but perhaps we should add
a comment that this may trigger unrelated bugs in CPU hotplug, so that
other testers don't run into this?

I'm also especially thinking about "drive-by testers", who might not
be tracking the fstests mailing list and won't know the nuances of "oh
yeah, you need to add this to the exclude list, or you may be
sorry....". On the other hand, that's why I recommend that drive-by
testers use things like my test runner infrastructure, and not
xfstesets directly. :-)

- Ted