When doing CPU un-plug stress test, function smpboot_park_threads() would
get call to park kernel threads (which including ksoftirqd) on that
CPU core, and function wait_task_inactive() would yield for those queued
task(s) by calling schedule_hrtimerout() with mode of HRTIMER_MODE_REL.
stack trace:
...
smpboot_thread_fn
cpuhp_thread_fun
cpuhp_invoke_callback
smpboot_park_threads
smpboot_park_thread: ksoftirqd/1
kthread_park
wait_task_inactive
schedule_hrtimerout
However, when PREEMPT_RT is set, this would cause a pending issue since
schedule_hrtimerout() depend on thread ksoftirqd to complete related
work if it using HRTIMER_MODE_SOFT. So force using HRTIMER_MODE_HARD
in such case.
Suggested-by: Jiafei Pan <[email protected]>
Signed-off-by: Ran Wang <[email protected]>
---
kernel/sched/core.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 792da55..4cc742a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2054,10 +2054,15 @@ unsigned long wait_task_inactive(struct task_struct *p, long match_state)
ktime_t to = NSEC_PER_SEC / HZ;
set_current_state(TASK_UNINTERRUPTIBLE);
- schedule_hrtimeout(&to, HRTIMER_MODE_REL);
+
+ if (IS_ENABLED(CONFIG_PREEMPT_RT) &&
+ !strncmp(p->comm, "ksoftirqd/", 10))
+ schedule_hrtimeout(&to,
+ HRTIMER_MODE_REL | HRTIMER_MODE_HARD);
+ else
+ schedule_hrtimeout(&to, HRTIMER_MODE_REL);
continue;
}
-
/*
* Ahh, all good. It wasn't running, and it wasn't
* runnable, which means that it will never become
--
2.7.4
Hi,
On Thursday, January 7, 2021 5:19 PM, Ran Wang wrote:
>
> When doing CPU un-plug stress test, function smpboot_park_threads() would get call to park kernel threads (which including ksoftirqd) on
> that CPU core, and function wait_task_inactive() would yield for those queued
> task(s) by calling schedule_hrtimerout() with mode of HRTIMER_MODE_REL.
>
> stack trace:
> ...
> smpboot_thread_fn
> cpuhp_thread_fun
> cpuhp_invoke_callback
> smpboot_park_threads
> smpboot_park_thread: ksoftirqd/1
> kthread_park
> wait_task_inactive
> schedule_hrtimerout
>
> However, when PREEMPT_RT is set, this would cause a pending issue since
> schedule_hrtimerout() depend on thread ksoftirqd to complete related work if it using HRTIMER_MODE_SOFT. So force using
> HRTIMER_MODE_HARD in such case.
This issue was observed on LX2160ARDB (arm64, 16 A72 cores) when selecting PREEMPT_RT,
non-RT kernel works fine.And I could verify that fix on both linux-5.6.y-rt and linux-5.4.y-rt.
But for linux-5.9.y-rt and linux-5.10.y-rt, looks there are other issues which blocking
verification currently. Below is the steps for issue reproducing:
1. Kernel menuconfig:
CONFIG_QORIQ_CPUFREQ=y
CONFIG_HAVE_PREEMPT_LAZY=y
CONFIG_PREEMPT_LAZY=y
# CONFIG_PREEMPT_NONE is not set
# CONFIG_PREEMPT_VOLUNTARY is not set
# CONFIG_PREEMPT is not set
CONFIG_PREEMPT_RT=y
CONFIG_PREEMPT_COUNT=y
CONFIG_PREEMPTION=y
2. Shell commands (Issue would happen within roughly 400 rounds of below loop)
echo ondemand > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
echo ondemand > /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor
echo ondemand > /sys/devices/system/cpu/cpu2/cpufreq/scaling_governor
echo ondemand > /sys/devices/system/cpu/cpu3/cpufreq/scaling_governor
echo ondemand > /sys/devices/system/cpu/cpu4/cpufreq/scaling_governor
echo ondemand > /sys/devices/system/cpu/cpu5/cpufreq/scaling_governor
echo ondemand > /sys/devices/system/cpu/cpu6/cpufreq/scaling_governor
echo ondemand > /sys/devices/system/cpu/cpu7/cpufreq/scaling_governor
echo ondemand > /sys/devices/system/cpu/cpu8/cpufreq/scaling_governor
echo ondemand > /sys/devices/system/cpu/cpu9/cpufreq/scaling_governor
echo ondemand > /sys/devices/system/cpu/cpu10/cpufreq/scaling_governor
echo ondemand > /sys/devices/system/cpu/cpu11/cpufreq/scaling_governor
echo ondemand > /sys/devices/system/cpu/cpu12/cpufreq/scaling_governor
echo ondemand > /sys/devices/system/cpu/cpu13/cpufreq/scaling_governor
echo ondemand > /sys/devices/system/cpu/cpu14/cpufreq/scaling_governor
echo ondemand > /sys/devices/system/cpu/cpu15/cpufreq/scaling_governor
count=1
while [ $? -eq 0 ]
do
echo "$count th test"
sleep 3
let "count=count+1"
echo 0 > /sys/devices/system/cpu/cpu0/online
echo 0 > /sys/devices/system/cpu/cpu1/online
echo 0 > /sys/devices/system/cpu/cpu2/online
echo 0 > /sys/devices/system/cpu/cpu3/online
echo 0 > /sys/devices/system/cpu/cpu4/online
echo 0 > /sys/devices/system/cpu/cpu5/online
echo 0 > /sys/devices/system/cpu/cpu6/online
echo 0 > /sys/devices/system/cpu/cpu7/online
echo 0 > /sys/devices/system/cpu/cpu8/online
echo 0 > /sys/devices/system/cpu/cpu9/online
echo 0 > /sys/devices/system/cpu/cpu10/online
echo 0 > /sys/devices/system/cpu/cpu11/online
echo 0 > /sys/devices/system/cpu/cpu12/online
echo 0 > /sys/devices/system/cpu/cpu13/online
echo 0 > /sys/devices/system/cpu/cpu14/online
echo 1 > /sys/devices/system/cpu/cpu0/online
echo 1 > /sys/devices/system/cpu/cpu1/online
echo 1 > /sys/devices/system/cpu/cpu2/online
echo 1 > /sys/devices/system/cpu/cpu3/online
echo 1 > /sys/devices/system/cpu/cpu4/online
echo 1 > /sys/devices/system/cpu/cpu5/online
echo 1 > /sys/devices/system/cpu/cpu6/online
echo 1 > /sys/devices/system/cpu/cpu7/online
echo 1 > /sys/devices/system/cpu/cpu8/online
echo 1 > /sys/devices/system/cpu/cpu9/online
echo 1 > /sys/devices/system/cpu/cpu10/online
echo 1 > /sys/devices/system/cpu/cpu11/online
echo 1 > /sys/devices/system/cpu/cpu12/online
echo 1 > /sys/devices/system/cpu/cpu13/online
echo 1 > /sys/devices/system/cpu/cpu14/online
done
To be honest, I am not sure how non-RT kernel could avoid this issue. Could anybody give some input/suggestion on this?
Thank you.
Regards,
Ran
> Suggested-by: Jiafei Pan <[email protected]>
> Signed-off-by: Ran Wang <[email protected]>
> ---
> kernel/sched/core.c | 9 +++++++--
> 1 file changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 792da55..4cc742a 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2054,10 +2054,15 @@ unsigned long wait_task_inactive(struct task_struct *p, long match_state)
> ktime_t to = NSEC_PER_SEC / HZ;
>
> set_current_state(TASK_UNINTERRUPTIBLE);
> - schedule_hrtimeout(&to, HRTIMER_MODE_REL);
> +
> + if (IS_ENABLED(CONFIG_PREEMPT_RT) &&
> + !strncmp(p->comm, "ksoftirqd/", 10))
> + schedule_hrtimeout(&to,
> + HRTIMER_MODE_REL | HRTIMER_MODE_HARD);
> + else
> + schedule_hrtimeout(&to, HRTIMER_MODE_REL);
> continue;
> }
> -
> /*
> * Ahh, all good. It wasn't running, and it wasn't
> * runnable, which means that it will never become
> --
> 2.7.4
On Thu, Jan 07, 2021 at 05:18:41PM +0800, Ran Wang wrote:
> +
> + if (IS_ENABLED(CONFIG_PREEMPT_RT) &&
> + !strncmp(p->comm, "ksoftirqd/", 10))
> + schedule_hrtimeout(&to,
> + HRTIMER_MODE_REL | HRTIMER_MODE_HARD);
> + else
> + schedule_hrtimeout(&to, HRTIMER_MODE_REL);
This is horrific, why did you not self-censor and spare me the mental
anguish of having to formulate a CoC compliant response?
It also violates coding style, but given the total lack of any sense,
that seems like a minor detail.
Why can't we use HRTIMER_MODE_HARD unconditionally?
On 2021-01-07 11:45:39 [+0100], Peter Zijlstra wrote:
> On Thu, Jan 07, 2021 at 05:18:41PM +0800, Ran Wang wrote:
> > +
> > + if (IS_ENABLED(CONFIG_PREEMPT_RT) &&
> > + !strncmp(p->comm, "ksoftirqd/", 10))
> > + schedule_hrtimeout(&to,
> > + HRTIMER_MODE_REL | HRTIMER_MODE_HARD);
> > + else
> > + schedule_hrtimeout(&to, HRTIMER_MODE_REL);
>
> This is horrific, why did you not self-censor and spare me the mental
> anguish of having to formulate a CoC compliant response?
>
> It also violates coding style, but given the total lack of any sense,
> that seems like a minor detail.
>
> Why can't we use HRTIMER_MODE_HARD unconditionally?
I had a similar patch in -RT and dropped it in v5.10-rc7-rt16.
It was added because RT couldn't boot since creating the boot-threads
didn't work before the ksoftirqd was up. This was fixed by commit
26c7295be0c5e ("kthread: Do not preempt current task if it is going to call schedule()")
and live was good again.
tglx (also) suggested to add HRTIMER_MODE_HARD unconditionally (it
looked at SYSTEM_STATE back then) and I was only worried some abuse via
userland.
This sleep can be triggered by ptrace/strace() and with brief testing I
can trigger the sleep there but I don't get it anywhere near where I
would notice it with cyclictest.
Sebastian
Hi Sebastian, Peter
Thursday, January 7, 2021 11:29 PM, Sebastian Siewior wrote:
>
> On 2021-01-07 11:45:39 [+0100], Peter Zijlstra wrote:
> > On Thu, Jan 07, 2021 at 05:18:41PM +0800, Ran Wang wrote:
> > > +
> > > + if (IS_ENABLED(CONFIG_PREEMPT_RT) &&
> > > + !strncmp(p->comm, "ksoftirqd/", 10))
> > > + schedule_hrtimeout(&to,
> > > + HRTIMER_MODE_REL | HRTIMER_MODE_HARD);
> > > + else
> > > + schedule_hrtimeout(&to, HRTIMER_MODE_REL);
> >
> > This is horrific, why did you not self-censor and spare me the mental
> > anguish of having to formulate a CoC compliant response?
> >
> > It also violates coding style, but given the total lack of any sense,
> > that seems like a minor detail.
> >
> > Why can't we use HRTIMER_MODE_HARD unconditionally?
>
> I had a similar patch in -RT and dropped it in v5.10-rc7-rt16.
> It was added because RT couldn't boot since creating the boot-threads didn't work before the ksoftirqd was up. This was fixed by commit
> 26c7295be0c5e ("kthread: Do not preempt current task if it is going to call schedule()")
I tried applying above commit to linux-5.6.y-rt, it could resolve my problem on LX2160ARDB, THANKS!
> and live was good again.
> tglx (also) suggested to add HRTIMER_MODE_HARD unconditionally (it looked at SYSTEM_STATE back then) and I was only worried some
> abuse via userland.
Got it.
Regards,
Ran
> This sleep can be triggered by ptrace/strace() and with brief testing I can trigger the sleep there but I don't get it anywhere near where I
> would notice it with cyclictest.
>
> Sebastian
On 2021-01-08 08:45:14 [+0000], Ran Wang wrote:
> Hi Sebastian, Peter
Hi,
> > I had a similar patch in -RT and dropped it in v5.10-rc7-rt16.
> > It was added because RT couldn't boot since creating the boot-threads didn't work before the ksoftirqd was up. This was fixed by commit
> > 26c7295be0c5e ("kthread: Do not preempt current task if it is going to call schedule()")
>
> I tried applying above commit to linux-5.6.y-rt, it could resolve my problem on LX2160ARDB, THANKS!
so in other words all this could have been avoided by using a supported
or maintained RT series. The v5.4 series has this patch, v5.6 isn't
maintained anymore so it is likely that there is more missing.
Sebastian
Hi Sebastian,
On Friday, January 8, 2021 5:05 PM, Sebastian Siewior wrote:
>
> On 2021-01-08 08:45:14 [+0000], Ran Wang wrote:
> > Hi Sebastian, Peter
> Hi,
>
> > > I had a similar patch in -RT and dropped it in v5.10-rc7-rt16.
> > > It was added because RT couldn't boot since creating the boot-threads didn't work before the ksoftirqd was up. This was fixed by commit
> > > 26c7295be0c5e ("kthread: Do not preempt current task if it is
> > > going to call schedule()")
> >
> > I tried applying above commit to linux-5.6.y-rt, it could resolve my problem on LX2160ARDB, THANKS!
>
> so in other words all this could have been avoided by using a supported or maintained RT series. The v5.4 series has this patch, v5.6 isn't
> maintained anymore so it is likely that there is more missing.
Thanks for let me know this.
The reason I trying linux-5.6-rt is that I have encountered other more serious issues on later RT version (even with v5.10-rc7-rt16), one of them
is CPU hot plug got stuck in irq_work_sync() which called by sugov_stop(), failure happen at 1st loop stress test every time.
I will try to collect more information and create another mail thread later.
Thanks & Regards,
Ran
> Sebastian
On Thu, Jan 07, 2021 at 04:28:43PM +0100, Sebastian Siewior wrote:
> This sleep can be triggered by ptrace/strace() and with brief testing I
> can trigger the sleep there but I don't get it anywhere near where I
> would notice it with cyclictest.
It's a single task wakeup (the caller), doing that from hardirq context
really should not be a problem, we do lots of that in RT already.
On 2021-01-08 10:32:36 [+0100], Peter Zijlstra wrote:
> It's a single task wakeup (the caller), doing that from hardirq context
> really should not be a problem, we do lots of that in RT already.
I'm not worry about that single wakeup but about an artificial case
where you manage to accumulate multiple single wake ups in a short
time frame.
Sebastian