Hi Liam,
We caught a "task blocked" dmesg in maple tree test. Not sure if this
is expected for maple tree test, so we are sending this report for
your information. Thanks.
Greeting,
FYI, we noticed INFO:task_blocked_for_more_than#seconds due to commit (built with clang-14):
commit: 120b116208a0877227fc82e3f0df81e7a3ed4ab1 ("maple_tree: reorganize testing to restore module testing")
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
in testcase: boot
on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):
[ 17.318428][ T1] calling maple_tree_seed+0x0/0x15d0 @ 1
[ 17.319219][ T1]
[ 17.319219][ T1] TEST STARTING
[ 17.319219][ T1]
[ 999.249871][ T23] INFO: task rcu_scale_shutd:59 blocked for more than 491 seconds.
[ 999.253363][ T23] Not tainted 6.1.0-rc4-00003-g120b116208a0 #1
[ 999.254249][ T23] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 999.255390][ T23] task:rcu_scale_shutd state:D stack:30968 pid:59 ppid:2 flags:0x00004000
[ 999.256934][ T23] Call Trace:
[ 999.257418][ T23] <TASK>
[ 999.257900][ T23] __schedule+0x169b/0x1f90
[ 999.261677][ T23] schedule+0x151/0x300
[ 999.262281][ T23] ? compute_real+0xe0/0xe0
[ 999.263364][ T23] rcu_scale_shutdown+0xdd/0x130
[ 999.264093][ T23] ? wake_bit_function+0x2c0/0x2c0
[ 999.268985][ T23] kthread+0x309/0x3a0
[ 999.269958][ T23] ? compute_real+0xe0/0xe0
[ 999.270552][ T23] ? kthread_unuse_mm+0x200/0x200
[ 999.271281][ T23] ret_from_fork+0x1f/0x30
[ 999.272385][ T23] </TASK>
[ 999.272865][ T23]
[ 999.272865][ T23] Showing all locks held in the system:
[ 999.273988][ T23] 2 locks held by swapper/0/1:
[ 999.274684][ T23] 1 lock held by khungtaskd/23:
[ 999.275400][ T23] #0: ffffffff88346e00 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire+0x8/0x30
[ 999.277171][ T23]
[ 999.277525][ T23] =============================================
[ 999.277525][ T23]
[ 1049.050884][ T1] maple_tree: 12610686 of 12610686 tests passed
If you fix the issue, kindly add following tag
| Reported-by: kernel test robot <[email protected]>
| Link: https://lore.kernel.org/oe-lkp/[email protected]
To reproduce:
# build kernel
cd linux
cp config-6.1.0-rc4-00003-g120b116208a0 .config
make HOSTCC=clang-14 CC=clang-14 ARCH=x86_64 olddefconfig prepare modules_prepare bzImage modules
make HOSTCC=clang-14 CC=clang-14 ARCH=x86_64 INSTALL_MOD_PATH=<mod-install-dir> modules_install
cd <mod-install-dir>
find lib/ | cpio -o -H newc --quiet | gzip > modules.cgz
git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp qemu -k <bzImage> -m modules.cgz job-script # job-script is attached in this email
# if come across any failure that blocks the test,
# please remove ~/.lkp and /lkp dir to run from a clean state.
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests
On Tue, Jan 31, 2023 at 03:18:22PM +0800, kernel test robot wrote:
> Hi Liam,
>
> We caught a "task blocked" dmesg in maple tree test. Not sure if this
> is expected for maple tree test, so we are sending this report for
> your information. Thanks.
>
> Greeting,
>
> FYI, we noticed INFO:task_blocked_for_more_than#seconds due to commit (built with clang-14):
>
> commit: 120b116208a0877227fc82e3f0df81e7a3ed4ab1 ("maple_tree: reorganize testing to restore module testing")
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>
> in testcase: boot
>
> on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
>
> caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):
>
>
> [ 17.318428][ T1] calling maple_tree_seed+0x0/0x15d0 @ 1
> [ 17.319219][ T1]
> [ 17.319219][ T1] TEST STARTING
> [ 17.319219][ T1]
> [ 999.249871][ T23] INFO: task rcu_scale_shutd:59 blocked for more than 491 seconds.
> [ 999.253363][ T23] Not tainted 6.1.0-rc4-00003-g120b116208a0 #1
> [ 999.254249][ T23] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 999.255390][ T23] task:rcu_scale_shutd state:D stack:30968 pid:59 ppid:2 flags:0x00004000
> [ 999.256934][ T23] Call Trace:
> [ 999.257418][ T23] <TASK>
> [ 999.257900][ T23] __schedule+0x169b/0x1f90
> [ 999.261677][ T23] schedule+0x151/0x300
> [ 999.262281][ T23] ? compute_real+0xe0/0xe0
> [ 999.263364][ T23] rcu_scale_shutdown+0xdd/0x130
> [ 999.264093][ T23] ? wake_bit_function+0x2c0/0x2c0
> [ 999.268985][ T23] kthread+0x309/0x3a0
> [ 999.269958][ T23] ? compute_real+0xe0/0xe0
> [ 999.270552][ T23] ? kthread_unuse_mm+0x200/0x200
> [ 999.271281][ T23] ret_from_fork+0x1f/0x30
> [ 999.272385][ T23] </TASK>
> [ 999.272865][ T23]
> [ 999.272865][ T23] Showing all locks held in the system:
> [ 999.273988][ T23] 2 locks held by swapper/0/1:
> [ 999.274684][ T23] 1 lock held by khungtaskd/23:
> [ 999.275400][ T23] #0: ffffffff88346e00 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire+0x8/0x30
> [ 999.277171][ T23]
> [ 999.277525][ T23] =============================================
> [ 999.277525][ T23]
> [ 1049.050884][ T1] maple_tree: 12610686 of 12610686 tests passed
>
>
> If you fix the issue, kindly add following tag
> | Reported-by: kernel test robot <[email protected]>
> | Link: https://lore.kernel.org/oe-lkp/[email protected]
Liam brought this to my attention on IRC, and it looks like the root
cause is that the rcuscale code does not deal gracefully with grace
periods that are in much excess of a second in duration.
Now, it might well be worth looking into why the grace periods were taking
that long, but if you were running Maple Tree stress tests concurrently
with rcuscale, this might well be expected behavior.
So, does the patch below clear this up for you?
Thanx, Paul
------------------------------------------------------------------------
commit 8e44d51e3411994091f7c7c136286d82c5757a4a
Author: Paul E. McKenney <[email protected]>
Date: Tue Jan 31 12:08:54 2023 -0800
rcuscale: Move shutdown from wait_event() to wait_event_idle()
The rcu_scale_shutdown() and kfree_scale_shutdown() kthreads/functions
use wait_event() to wait for the rcuscale test to complete. However,
each updater thread in such a test waits for at least 100 grace periods.
If each grace period takes more than 1.2 seconds, which is long, but
not insanely so, this can trigger the hung-task timeout.
This commit therefore replaces those wait_event() calls with calls to
wait_event_idle(), which do not trigger the hung-task timeout.
Reported-by: kernel test robot <[email protected]>
Reported-by: Liam Howlett <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
diff --git a/kernel/rcu/rcuscale.c b/kernel/rcu/rcuscale.c
index 91fb5905a008f..4120f94030c3c 100644
--- a/kernel/rcu/rcuscale.c
+++ b/kernel/rcu/rcuscale.c
@@ -631,8 +631,7 @@ static int compute_real(int n)
static int
rcu_scale_shutdown(void *arg)
{
- wait_event(shutdown_wq,
- atomic_read(&n_rcu_scale_writer_finished) >= nrealwriters);
+ wait_event_idle(shutdown_wq, atomic_read(&n_rcu_scale_writer_finished) >= nrealwriters);
smp_mb(); /* Wake before output. */
rcu_scale_cleanup();
kernel_power_off();
@@ -771,8 +770,8 @@ kfree_scale_cleanup(void)
static int
kfree_scale_shutdown(void *arg)
{
- wait_event(shutdown_wq,
- atomic_read(&n_kfree_scale_thread_ended) >= kfree_nrealthreads);
+ wait_event_idle(shutdown_wq,
+ atomic_read(&n_kfree_scale_thread_ended) >= kfree_nrealthreads);
smp_mb(); /* Wake before output. */
* Paul E. McKenney <[email protected]> [230131 15:26]:
> On Tue, Jan 31, 2023 at 03:18:22PM +0800, kernel test robot wrote:
> > Hi Liam,
> >
> > We caught a "task blocked" dmesg in maple tree test. Not sure if this
> > is expected for maple tree test, so we are sending this report for
> > your information. Thanks.
> >
> > Greeting,
> >
> > FYI, we noticed INFO:task_blocked_for_more_than#seconds due to commit (built with clang-14):
> >
> > commit: 120b116208a0877227fc82e3f0df81e7a3ed4ab1 ("maple_tree: reorganize testing to restore module testing")
> > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> >
> > in testcase: boot
> >
> > on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
> >
> > caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):
> >
> >
> > [ 17.318428][ T1] calling maple_tree_seed+0x0/0x15d0 @ 1
> > [ 17.319219][ T1]
> > [ 17.319219][ T1] TEST STARTING
> > [ 17.319219][ T1]
> > [ 999.249871][ T23] INFO: task rcu_scale_shutd:59 blocked for more than 491 seconds.
> > [ 999.253363][ T23] Not tainted 6.1.0-rc4-00003-g120b116208a0 #1
> > [ 999.254249][ T23] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > [ 999.255390][ T23] task:rcu_scale_shutd state:D stack:30968 pid:59 ppid:2 flags:0x00004000
> > [ 999.256934][ T23] Call Trace:
> > [ 999.257418][ T23] <TASK>
> > [ 999.257900][ T23] __schedule+0x169b/0x1f90
> > [ 999.261677][ T23] schedule+0x151/0x300
> > [ 999.262281][ T23] ? compute_real+0xe0/0xe0
> > [ 999.263364][ T23] rcu_scale_shutdown+0xdd/0x130
> > [ 999.264093][ T23] ? wake_bit_function+0x2c0/0x2c0
> > [ 999.268985][ T23] kthread+0x309/0x3a0
> > [ 999.269958][ T23] ? compute_real+0xe0/0xe0
> > [ 999.270552][ T23] ? kthread_unuse_mm+0x200/0x200
> > [ 999.271281][ T23] ret_from_fork+0x1f/0x30
> > [ 999.272385][ T23] </TASK>
> > [ 999.272865][ T23]
> > [ 999.272865][ T23] Showing all locks held in the system:
> > [ 999.273988][ T23] 2 locks held by swapper/0/1:
> > [ 999.274684][ T23] 1 lock held by khungtaskd/23:
> > [ 999.275400][ T23] #0: ffffffff88346e00 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire+0x8/0x30
> > [ 999.277171][ T23]
> > [ 999.277525][ T23] =============================================
> > [ 999.277525][ T23]
> > [ 1049.050884][ T1] maple_tree: 12610686 of 12610686 tests passed
> >
> >
> > If you fix the issue, kindly add following tag
> > | Reported-by: kernel test robot <[email protected]>
> > | Link: https://lore.kernel.org/oe-lkp/[email protected]
>
> Liam brought this to my attention on IRC, and it looks like the root
> cause is that the rcuscale code does not deal gracefully with grace
> periods that are in much excess of a second in duration.
>
> Now, it might well be worth looking into why the grace periods were taking
> that long, but if you were running Maple Tree stress tests concurrently
> with rcuscale, this might well be expected behavior.
>
This could be simply cpu starvation causing no foward progress in your
tests with the number of concurrent running tests and "-smp 2".
It's also worth noting that building in the rcu test module makes the
machine turn off once the test is complete. This can be seen in your
console message:
[ 13.254240][ T1] rcu-scale:--- Start of test: nreaders=2 nwriters=2 verbose=1 shutdown=1
so your machine may not have finished running through the array of tests
you have specified to build in - which is a lot. I'm not sure if this
is the best approach considering the load that produces on the system
and how difficult it is (was) to figure out which test is causing a
stall, or other issue.
...
Thanks,
Liam
On Tue, Jan 31, 2023 at 03:45:20PM -0500, Liam R. Howlett wrote:
> * Paul E. McKenney <[email protected]> [230131 15:26]:
> > On Tue, Jan 31, 2023 at 03:18:22PM +0800, kernel test robot wrote:
> > > Hi Liam,
> > >
> > > We caught a "task blocked" dmesg in maple tree test. Not sure if this
> > > is expected for maple tree test, so we are sending this report for
> > > your information. Thanks.
> > >
> > > Greeting,
> > >
> > > FYI, we noticed INFO:task_blocked_for_more_than#seconds due to commit (built with clang-14):
> > >
> > > commit: 120b116208a0877227fc82e3f0df81e7a3ed4ab1 ("maple_tree: reorganize testing to restore module testing")
> > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> > >
> > > in testcase: boot
> > >
> > > on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
> > >
> > > caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):
> > >
> > >
> > > [ 17.318428][ T1] calling maple_tree_seed+0x0/0x15d0 @ 1
> > > [ 17.319219][ T1]
> > > [ 17.319219][ T1] TEST STARTING
> > > [ 17.319219][ T1]
> > > [ 999.249871][ T23] INFO: task rcu_scale_shutd:59 blocked for more than 491 seconds.
> > > [ 999.253363][ T23] Not tainted 6.1.0-rc4-00003-g120b116208a0 #1
> > > [ 999.254249][ T23] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > [ 999.255390][ T23] task:rcu_scale_shutd state:D stack:30968 pid:59 ppid:2 flags:0x00004000
> > > [ 999.256934][ T23] Call Trace:
> > > [ 999.257418][ T23] <TASK>
> > > [ 999.257900][ T23] __schedule+0x169b/0x1f90
> > > [ 999.261677][ T23] schedule+0x151/0x300
> > > [ 999.262281][ T23] ? compute_real+0xe0/0xe0
> > > [ 999.263364][ T23] rcu_scale_shutdown+0xdd/0x130
> > > [ 999.264093][ T23] ? wake_bit_function+0x2c0/0x2c0
> > > [ 999.268985][ T23] kthread+0x309/0x3a0
> > > [ 999.269958][ T23] ? compute_real+0xe0/0xe0
> > > [ 999.270552][ T23] ? kthread_unuse_mm+0x200/0x200
> > > [ 999.271281][ T23] ret_from_fork+0x1f/0x30
> > > [ 999.272385][ T23] </TASK>
> > > [ 999.272865][ T23]
> > > [ 999.272865][ T23] Showing all locks held in the system:
> > > [ 999.273988][ T23] 2 locks held by swapper/0/1:
> > > [ 999.274684][ T23] 1 lock held by khungtaskd/23:
> > > [ 999.275400][ T23] #0: ffffffff88346e00 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire+0x8/0x30
> > > [ 999.277171][ T23]
> > > [ 999.277525][ T23] =============================================
> > > [ 999.277525][ T23]
> > > [ 1049.050884][ T1] maple_tree: 12610686 of 12610686 tests passed
> > >
> > >
> > > If you fix the issue, kindly add following tag
> > > | Reported-by: kernel test robot <[email protected]>
> > > | Link: https://lore.kernel.org/oe-lkp/[email protected]
> >
> > Liam brought this to my attention on IRC, and it looks like the root
> > cause is that the rcuscale code does not deal gracefully with grace
> > periods that are in much excess of a second in duration.
> >
> > Now, it might well be worth looking into why the grace periods were taking
> > that long, but if you were running Maple Tree stress tests concurrently
> > with rcuscale, this might well be expected behavior.
> >
>
> This could be simply cpu starvation causing no foward progress in your
> tests with the number of concurrent running tests and "-smp 2".
>
> It's also worth noting that building in the rcu test module makes the
> machine turn off once the test is complete. This can be seen in your
> console message:
> [ 13.254240][ T1] rcu-scale:--- Start of test: nreaders=2 nwriters=2 verbose=1 shutdown=1
>
> so your machine may not have finished running through the array of tests
> you have specified to build in - which is a lot. I'm not sure if this
> is the best approach considering the load that produces on the system
> and how difficult it is (was) to figure out which test is causing a
> stall, or other issue.
Agreed, both rcuscale and refscale when built in turn the machine off at
the end of the test. For providing background stress for some other test
(in this case Maple Tree tests), rcutorture, locktorture, or scftorture
might be better choices.
Thanx, Paul
Hi Paul, Hi Liam,
On Tue, Jan 31, 2023 at 12:52:21PM -0800, Paul E. McKenney wrote:
> On Tue, Jan 31, 2023 at 03:45:20PM -0500, Liam R. Howlett wrote:
> > * Paul E. McKenney <[email protected]> [230131 15:26]:
> > > On Tue, Jan 31, 2023 at 03:18:22PM +0800, kernel test robot wrote:
> > > > Hi Liam,
> > > >
> > > > We caught a "task blocked" dmesg in maple tree test. Not sure if this
> > > > is expected for maple tree test, so we are sending this report for
> > > > your information. Thanks.
> > > >
> > > > Greeting,
> > > >
> > > > FYI, we noticed INFO:task_blocked_for_more_than#seconds due to commit (built with clang-14):
> > > >
> > > > commit: 120b116208a0877227fc82e3f0df81e7a3ed4ab1 ("maple_tree: reorganize testing to restore module testing")
> > > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> > > >
> > > > in testcase: boot
> > > >
> > > > on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
> > > >
> > > > caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):
> > > >
> > > >
> > > > [ 17.318428][ T1] calling maple_tree_seed+0x0/0x15d0 @ 1
> > > > [ 17.319219][ T1]
> > > > [ 17.319219][ T1] TEST STARTING
> > > > [ 17.319219][ T1]
> > > > [ 999.249871][ T23] INFO: task rcu_scale_shutd:59 blocked for more than 491 seconds.
> > > > [ 999.253363][ T23] Not tainted 6.1.0-rc4-00003-g120b116208a0 #1
> > > > [ 999.254249][ T23] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > > [ 999.255390][ T23] task:rcu_scale_shutd state:D stack:30968 pid:59 ppid:2 flags:0x00004000
> > > > [ 999.256934][ T23] Call Trace:
> > > > [ 999.257418][ T23] <TASK>
> > > > [ 999.257900][ T23] __schedule+0x169b/0x1f90
> > > > [ 999.261677][ T23] schedule+0x151/0x300
> > > > [ 999.262281][ T23] ? compute_real+0xe0/0xe0
> > > > [ 999.263364][ T23] rcu_scale_shutdown+0xdd/0x130
> > > > [ 999.264093][ T23] ? wake_bit_function+0x2c0/0x2c0
> > > > [ 999.268985][ T23] kthread+0x309/0x3a0
> > > > [ 999.269958][ T23] ? compute_real+0xe0/0xe0
> > > > [ 999.270552][ T23] ? kthread_unuse_mm+0x200/0x200
> > > > [ 999.271281][ T23] ret_from_fork+0x1f/0x30
> > > > [ 999.272385][ T23] </TASK>
> > > > [ 999.272865][ T23]
> > > > [ 999.272865][ T23] Showing all locks held in the system:
> > > > [ 999.273988][ T23] 2 locks held by swapper/0/1:
> > > > [ 999.274684][ T23] 1 lock held by khungtaskd/23:
> > > > [ 999.275400][ T23] #0: ffffffff88346e00 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire+0x8/0x30
> > > > [ 999.277171][ T23]
> > > > [ 999.277525][ T23] =============================================
> > > > [ 999.277525][ T23]
> > > > [ 1049.050884][ T1] maple_tree: 12610686 of 12610686 tests passed
> > > >
> > > >
> > > > If you fix the issue, kindly add following tag
> > > > | Reported-by: kernel test robot <[email protected]>
> > > > | Link: https://lore.kernel.org/oe-lkp/[email protected]
> > >
> > > Liam brought this to my attention on IRC, and it looks like the root
> > > cause is that the rcuscale code does not deal gracefully with grace
> > > periods that are in much excess of a second in duration.
> > >
> > > Now, it might well be worth looking into why the grace periods were taking
> > > that long, but if you were running Maple Tree stress tests concurrently
> > > with rcuscale, this might well be expected behavior.
> > >
> >
> > This could be simply cpu starvation causing no foward progress in your
> > tests with the number of concurrent running tests and "-smp 2".
> >
> > It's also worth noting that building in the rcu test module makes the
> > machine turn off once the test is complete. This can be seen in your
> > console message:
> > [ 13.254240][ T1] rcu-scale:--- Start of test: nreaders=2 nwriters=2 verbose=1 shutdown=1
> >
> > so your machine may not have finished running through the array of tests
> > you have specified to build in - which is a lot. I'm not sure if this
> > is the best approach considering the load that produces on the system
> > and how difficult it is (was) to figure out which test is causing a
> > stall, or other issue.
>
> Agreed, both rcuscale and refscale when built in turn the machine off at
> the end of the test. For providing background stress for some other test
> (in this case Maple Tree tests), rcutorture, locktorture, or scftorture
> might be better choices.
Thanks for looking into this. This is a boot test on a randconfig
kernel, and yes, it happend to select CONFIG_RCU_SCALE_TEST=y together
with CONFIG_TEST_MAPLE_TREE=y, leading to the situation in this case.
I've tested the patch on same config, it does clear up the "task
blocked" log, though it still waits a long time at this step. The test
result is as follows:
[ 18.397784][ T1] calling maple_tree_seed+0x0/0x15d0 @ 1
[ 18.398646][ T1]
[ 18.398646][ T1] TEST STARTING
[ 18.398646][ T1]
[ 1266.450656][ T1] maple_tree: 12610686 of 12610686 tests passed
[ 1266.451749][ T1] initcall maple_tree_seed+0x0/0x15d0 returned 0 after 1248053116 usecs
...
=========================================================================================
compiler/kconfig/rootfs/sleep/tbox_group/testcase:
clang-14/x86_64-randconfig-a006-20230116/yocto-x86_64-minimal-20190520.cgz/300/vm-snb/boot
commit:
120b116208a08 ("maple_tree: reorganize testing to restore module testing")
4b1aafbdb208f ("rcuscale: Move shutdown from wait_event() to wait_event_idle()")
120b116208a08 4b1aafbdb208fb4e10bf7abff1a
---------------- ---------------------------
fail:runs %reproduction fail:runs
| | |
5:6 -83% :6 dmesg.INFO:task_blocked_for_more_than#seconds
Tested-by: Yujie Liu <[email protected]>
--
Best Regards,
Yujie
On Wed, Feb 01, 2023 at 04:37:00PM +0800, Yujie Liu wrote:
> Hi Paul, Hi Liam,
>
> On Tue, Jan 31, 2023 at 12:52:21PM -0800, Paul E. McKenney wrote:
> > On Tue, Jan 31, 2023 at 03:45:20PM -0500, Liam R. Howlett wrote:
> > > * Paul E. McKenney <[email protected]> [230131 15:26]:
> > > > On Tue, Jan 31, 2023 at 03:18:22PM +0800, kernel test robot wrote:
> > > > > Hi Liam,
> > > > >
> > > > > We caught a "task blocked" dmesg in maple tree test. Not sure if this
> > > > > is expected for maple tree test, so we are sending this report for
> > > > > your information. Thanks.
> > > > >
> > > > > Greeting,
> > > > >
> > > > > FYI, we noticed INFO:task_blocked_for_more_than#seconds due to commit (built with clang-14):
> > > > >
> > > > > commit: 120b116208a0877227fc82e3f0df81e7a3ed4ab1 ("maple_tree: reorganize testing to restore module testing")
> > > > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> > > > >
> > > > > in testcase: boot
> > > > >
> > > > > on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
> > > > >
> > > > > caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):
> > > > >
> > > > >
> > > > > [ 17.318428][ T1] calling maple_tree_seed+0x0/0x15d0 @ 1
> > > > > [ 17.319219][ T1]
> > > > > [ 17.319219][ T1] TEST STARTING
> > > > > [ 17.319219][ T1]
> > > > > [ 999.249871][ T23] INFO: task rcu_scale_shutd:59 blocked for more than 491 seconds.
> > > > > [ 999.253363][ T23] Not tainted 6.1.0-rc4-00003-g120b116208a0 #1
> > > > > [ 999.254249][ T23] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > > > [ 999.255390][ T23] task:rcu_scale_shutd state:D stack:30968 pid:59 ppid:2 flags:0x00004000
> > > > > [ 999.256934][ T23] Call Trace:
> > > > > [ 999.257418][ T23] <TASK>
> > > > > [ 999.257900][ T23] __schedule+0x169b/0x1f90
> > > > > [ 999.261677][ T23] schedule+0x151/0x300
> > > > > [ 999.262281][ T23] ? compute_real+0xe0/0xe0
> > > > > [ 999.263364][ T23] rcu_scale_shutdown+0xdd/0x130
> > > > > [ 999.264093][ T23] ? wake_bit_function+0x2c0/0x2c0
> > > > > [ 999.268985][ T23] kthread+0x309/0x3a0
> > > > > [ 999.269958][ T23] ? compute_real+0xe0/0xe0
> > > > > [ 999.270552][ T23] ? kthread_unuse_mm+0x200/0x200
> > > > > [ 999.271281][ T23] ret_from_fork+0x1f/0x30
> > > > > [ 999.272385][ T23] </TASK>
> > > > > [ 999.272865][ T23]
> > > > > [ 999.272865][ T23] Showing all locks held in the system:
> > > > > [ 999.273988][ T23] 2 locks held by swapper/0/1:
> > > > > [ 999.274684][ T23] 1 lock held by khungtaskd/23:
> > > > > [ 999.275400][ T23] #0: ffffffff88346e00 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire+0x8/0x30
> > > > > [ 999.277171][ T23]
> > > > > [ 999.277525][ T23] =============================================
> > > > > [ 999.277525][ T23]
> > > > > [ 1049.050884][ T1] maple_tree: 12610686 of 12610686 tests passed
> > > > >
> > > > >
> > > > > If you fix the issue, kindly add following tag
> > > > > | Reported-by: kernel test robot <[email protected]>
> > > > > | Link: https://lore.kernel.org/oe-lkp/[email protected]
> > > >
> > > > Liam brought this to my attention on IRC, and it looks like the root
> > > > cause is that the rcuscale code does not deal gracefully with grace
> > > > periods that are in much excess of a second in duration.
> > > >
> > > > Now, it might well be worth looking into why the grace periods were taking
> > > > that long, but if you were running Maple Tree stress tests concurrently
> > > > with rcuscale, this might well be expected behavior.
> > > >
> > >
> > > This could be simply cpu starvation causing no foward progress in your
> > > tests with the number of concurrent running tests and "-smp 2".
> > >
> > > It's also worth noting that building in the rcu test module makes the
> > > machine turn off once the test is complete. This can be seen in your
> > > console message:
> > > [ 13.254240][ T1] rcu-scale:--- Start of test: nreaders=2 nwriters=2 verbose=1 shutdown=1
> > >
> > > so your machine may not have finished running through the array of tests
> > > you have specified to build in - which is a lot. I'm not sure if this
> > > is the best approach considering the load that produces on the system
> > > and how difficult it is (was) to figure out which test is causing a
> > > stall, or other issue.
> >
> > Agreed, both rcuscale and refscale when built in turn the machine off at
> > the end of the test. For providing background stress for some other test
> > (in this case Maple Tree tests), rcutorture, locktorture, or scftorture
> > might be better choices.
>
> Thanks for looking into this. This is a boot test on a randconfig
> kernel, and yes, it happend to select CONFIG_RCU_SCALE_TEST=y together
> with CONFIG_TEST_MAPLE_TREE=y, leading to the situation in this case.
>
> I've tested the patch on same config, it does clear up the "task
> blocked" log, though it still waits a long time at this step. The test
> result is as follows:
>
> [ 18.397784][ T1] calling maple_tree_seed+0x0/0x15d0 @ 1
> [ 18.398646][ T1]
> [ 18.398646][ T1] TEST STARTING
> [ 18.398646][ T1]
> [ 1266.450656][ T1] maple_tree: 12610686 of 12610686 tests passed
> [ 1266.451749][ T1] initcall maple_tree_seed+0x0/0x15d0 returned 0 after 1248053116 usecs
> ...
>
> =========================================================================================
> compiler/kconfig/rootfs/sleep/tbox_group/testcase:
> clang-14/x86_64-randconfig-a006-20230116/yocto-x86_64-minimal-20190520.cgz/300/vm-snb/boot
>
> commit:
> 120b116208a08 ("maple_tree: reorganize testing to restore module testing")
> 4b1aafbdb208f ("rcuscale: Move shutdown from wait_event() to wait_event_idle()")
>
> 120b116208a08 4b1aafbdb208fb4e10bf7abff1a
> ---------------- ---------------------------
> fail:runs %reproduction fail:runs
> | | |
> 5:6 -83% :6 dmesg.INFO:task_blocked_for_more_than#seconds
>
> Tested-by: Yujie Liu <[email protected]>
Thank you very much! I will apply this on my next rebase.
Thanx, Paul
...
> > > > > >
> > > > > > FYI, we noticed INFO:task_blocked_for_more_than#seconds due to commit (built with clang-14):
> > > > > >
> > > > > > commit: 120b116208a0877227fc82e3f0df81e7a3ed4ab1 ("maple_tree: reorganize testing to restore module testing")
> > > > > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> > > > > >
> > > > > > in testcase: boot
> > > > > >
> > > > > > on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
> > > > > >
...
> > > > > > If you fix the issue, kindly add following tag
> > > > > > | Reported-by: kernel test robot <[email protected]>
> > > > > > | Link: https://lore.kernel.org/oe-lkp/[email protected]
> > > > >
> > > > > Liam brought this to my attention on IRC, and it looks like the root
> > > > > cause is that the rcuscale code does not deal gracefully with grace
> > > > > periods that are in much excess of a second in duration.
> > > > >
> > > > > Now, it might well be worth looking into why the grace periods were taking
> > > > > that long, but if you were running Maple Tree stress tests concurrently
> > > > > with rcuscale, this might well be expected behavior.
> > > > >
> > > >
> > > > This could be simply cpu starvation causing no foward progress in your
> > > > tests with the number of concurrent running tests and "-smp 2".
> > > >
> > > > It's also worth noting that building in the rcu test module makes the
> > > > machine turn off once the test is complete. This can be seen in your
> > > > console message:
> > > > [ 13.254240][ T1] rcu-scale:--- Start of test: nreaders=2 nwriters=2 verbose=1 shutdown=1
> > > >
> > > > so your machine may not have finished running through the array of tests
> > > > you have specified to build in - which is a lot. I'm not sure if this
> > > > is the best approach considering the load that produces on the system
> > > > and how difficult it is (was) to figure out which test is causing a
> > > > stall, or other issue.
> > >
> > > Agreed, both rcuscale and refscale when built in turn the machine off at
> > > the end of the test. For providing background stress for some other test
> > > (in this case Maple Tree tests), rcutorture, locktorture, or scftorture
> > > might be better choices.
> >
> > Thanks for looking into this. This is a boot test on a randconfig
> > kernel, and yes, it happend to select CONFIG_RCU_SCALE_TEST=y together
> > with CONFIG_TEST_MAPLE_TREE=y, leading to the situation in this case.
Ah, I see. Thanks for that information, this makes more sense now.
> >
> > I've tested the patch on same config, it does clear up the "task
> > blocked" log, though it still waits a long time at this step. The test
> > result is as follows:
> >
> > [ 18.397784][ T1] calling maple_tree_seed+0x0/0x15d0 @ 1
> > [ 18.398646][ T1]
> > [ 18.398646][ T1] TEST STARTING
> > [ 18.398646][ T1]
> > [ 1266.450656][ T1] maple_tree: 12610686 of 12610686 tests passed
> > [ 1266.451749][ T1] initcall maple_tree_seed+0x0/0x15d0 returned 0 after 1248053116 usecs
> > ...
Thanks. Yes, I have a lot of tests in there that add up to taking a
while. This is the expected output.
...
Thanks,
Liam