Hello,
When using the schedutil governor together with the softlockup detector
all CPUs go to their maximum frequency on a regular basis. This seems
to be because the watchdog creates a RT thread on each CPU and this
causes regular kicks with:
cpufreq_update_this_cpu(rq, SCHED_CPUFREQ_RT);
The schedutil governor responds to this by immediately setting the
maximum cpu frequency, this is very undesirable.
The issue can be fixed by this patch from android:
https://patchwork.kernel.org/patch/9301909/
The patch stalled in a long discussion about how it's difficult for
cpufreq to deal with RT and how some RT users might just disable
cpufreq. It is indeed hard but if the system experiences regular power
kicks from a common debug feature they will end up disabling schedutil
instead. No other governors behave this way, perhaps the current
behavior should be considered a bug in schedutil.
That patch now has conflicts with latest upstream. Perhaps a modified
variant should be reconsidered for inclusion, or is there some other
solution pending?
Alternatively the watchdog threads could be somehow marked as to never
cause increased cpufreq.
--
Regards,
Leonard
On Fri, Jan 5, 2018 at 9:37 PM, Leonard Crestez <[email protected]> wrote:
> Hello,
>
> When using the schedutil governor together with the softlockup detector
> all CPUs go to their maximum frequency on a regular basis. This seems
> to be because the watchdog creates a RT thread on each CPU and this
> causes regular kicks with:
>
> cpufreq_update_this_cpu(rq, SCHED_CPUFREQ_RT);
>
> The schedutil governor responds to this by immediately setting the
> maximum cpu frequency, this is very undesirable.
>
> The issue can be fixed by this patch from android:
> https://patchwork.kernel.org/patch/9301909/
>
> The patch stalled in a long discussion about how it's difficult for
> cpufreq to deal with RT and how some RT users might just disable
> cpufreq. It is indeed hard but if the system experiences regular power
> kicks from a common debug feature they will end up disabling schedutil
> instead.
They are basically free to use the other governors instead if they prefer them.
> No other governors behave this way,
Because they work differently overall.
> perhaps the current behavior should be considered a bug in schedutil.
>
> That patch now has conflicts with latest upstream. Perhaps a modified
> variant should be reconsidered for inclusion, or is there some other
> solution pending?
Patrick has a series of patches dealing with this problem area AFAICS,
but we are currently integrating material from Juri related to
deadline tasks.
> Alternatively the watchdog threads could be somehow marked as to never
> cause increased cpufreq.
Or maybe just replaced with something that is not a thread?
RT really doesn't leave much choice, because it basically means "I'm
important and I have a deadline, but I'm not telling you how important
I am and what the deadline is".
Thanks,
Rafael
On 2018.01.05 12:38 Leonard Crestez wrote:
> When using the schedutil governor together with the softlockup detector
> all CPUs go to their maximum frequency on a regular basis. This seems
> to be because the watchdog creates a RT thread on each CPU and this
> causes regular kicks with:
>
> cpufreq_update_this_cpu(rq, SCHED_CPUFREQ_RT);
>
> The schedutil governor responds to this by immediately setting the
> maximum cpu frequency, this is very undesirable.
>
> The issue can be fixed by this patch from android:
> https://patchwork.kernel.org/patch/9301909/
>
> The patch stalled in a long discussion about how it's difficult for
> cpufreq to deal with RT and how some RT users might just disable
> cpufreq. It is indeed hard but if the system experiences regular power
> kicks from a common debug feature they will end up disabling schedutil
> instead. No other governors behave this way, perhaps the current
> behavior should be considered a bug in schedutil.
>
> That patch now has conflicts with latest upstream. Perhaps a modified
> variant should be reconsidered for inclusion, or is there some other
> solution pending?
>
> Alternatively the watchdog threads could be somehow marked as to never
> cause increased cpufreq.
Your e-mail was very timely for me. In mid December, while testing the
minimum sampling rate change commit, I also did a reference test using
intel-cpufreq driver and schedutil governor. Under a range of
conditions 79% more package power was consumed by schedutil when compared
to: ondemand, sample rate 2 mSec; ondemand, sample rate 20 mSec;
intel_pstate driver.
I did not know about the thread and patch you referred to. Thanks.
Additionally, on otherwise mostly idle CPUs, sometimes I observe that after
the setting of max pstate, it gets left there with no update at all for
over a hundred seconds. Examples:
CPU3: 165 seconds since change to max pstate; Load 0.07%; new pstate = minimum
CPU5: 121 seconds since change to max pstate; Load 0.47%; new pstate = mid range
Reference (for me only): trace_stuff/results/pass24 samples 59797 and 59803
... Doug
On 05-01-18, 23:18, Rafael J. Wysocki wrote:
> On Fri, Jan 5, 2018 at 9:37 PM, Leonard Crestez <[email protected]> wrote:
> > Hello,
> >
> > When using the schedutil governor together with the softlockup detector
> > all CPUs go to their maximum frequency on a regular basis. This seems
> > to be because the watchdog creates a RT thread on each CPU and this
> > causes regular kicks with:
> >
> > cpufreq_update_this_cpu(rq, SCHED_CPUFREQ_RT);
> >
> > The schedutil governor responds to this by immediately setting the
> > maximum cpu frequency, this is very undesirable.
> >
> > The issue can be fixed by this patch from android:
> > https://patchwork.kernel.org/patch/9301909/
> >
> > The patch stalled in a long discussion about how it's difficult for
> > cpufreq to deal with RT and how some RT users might just disable
> > cpufreq. It is indeed hard but if the system experiences regular power
> > kicks from a common debug feature they will end up disabling schedutil
> > instead.
>
> They are basically free to use the other governors instead if they prefer them.
>
> > No other governors behave this way,
>
> Because they work differently overall.
>
> > perhaps the current behavior should be considered a bug in schedutil.
> >
> > That patch now has conflicts with latest upstream. Perhaps a modified
> > variant should be reconsidered for inclusion, or is there some other
> > solution pending?
>
> Patrick has a series of patches dealing with this problem area AFAICS,
> but we are currently integrating material from Juri related to
> deadline tasks.
I am not sure if Patrick's patches would solve this problem at all as
we still go to max for RT and the RT task is created from the
softlockup detector somehow.
One way to fix that can be to use DL for the softlockup detector as
after Juri's patches we don't always go to max for DL.
On the other side, AFAIR, Peter was very clear during the previous LPC
that it doesn't make sense to use rt-avg as the above patch suggests.
--
viresh
On Monday, January 8, 2018 5:01:21 AM CET Viresh Kumar wrote:
> On 05-01-18, 23:18, Rafael J. Wysocki wrote:
> > On Fri, Jan 5, 2018 at 9:37 PM, Leonard Crestez <[email protected]> wrote:
> > > Hello,
> > >
> > > When using the schedutil governor together with the softlockup detector
> > > all CPUs go to their maximum frequency on a regular basis. This seems
> > > to be because the watchdog creates a RT thread on each CPU and this
> > > causes regular kicks with:
> > >
> > > cpufreq_update_this_cpu(rq, SCHED_CPUFREQ_RT);
> > >
> > > The schedutil governor responds to this by immediately setting the
> > > maximum cpu frequency, this is very undesirable.
> > >
> > > The issue can be fixed by this patch from android:
> > > https://patchwork.kernel.org/patch/9301909/
> > >
> > > The patch stalled in a long discussion about how it's difficult for
> > > cpufreq to deal with RT and how some RT users might just disable
> > > cpufreq. It is indeed hard but if the system experiences regular power
> > > kicks from a common debug feature they will end up disabling schedutil
> > > instead.
> >
> > They are basically free to use the other governors instead if they prefer them.
> >
> > > No other governors behave this way,
> >
> > Because they work differently overall.
> >
> > > perhaps the current behavior should be considered a bug in schedutil.
> > >
> > > That patch now has conflicts with latest upstream. Perhaps a modified
> > > variant should be reconsidered for inclusion, or is there some other
> > > solution pending?
> >
> > Patrick has a series of patches dealing with this problem area AFAICS,
> > but we are currently integrating material from Juri related to
> > deadline tasks.
>
> I am not sure if Patrick's patches would solve this problem at all as
> we still go to max for RT and the RT task is created from the
> softlockup detector somehow.
>
> One way to fix that can be to use DL for the softlockup detector as
> after Juri's patches we don't always go to max for DL.
>
> On the other side, AFAIR, Peter was very clear during the previous LPC
> that it doesn't make sense to use rt-avg as the above patch suggests.
Right.
Why does the softlockup watchdog use RT tasks in the first place?
Thanks,
Rafael
On Mon, 2018-01-08 at 09:31 +0530, Viresh Kumar wrote:
> On 05-01-18, 23:18, Rafael J. Wysocki wrote:
> > On Fri, Jan 5, 2018 at 9:37 PM, Leonard Crestez <[email protected]> wrote:
> > > When using the schedutil governor together with the softlockup detector
> > > all CPUs go to their maximum frequency on a regular basis. This seems
> > > to be because the watchdog creates a RT thread on each CPU and this
> > > causes regular kicks with:
> > >
> > > cpufreq_update_this_cpu(rq, SCHED_CPUFREQ_RT);
> > >
> > > The schedutil governor responds to this by immediately setting the
> > > maximum cpu frequency, this is very undesirable.
> > >
> > > The issue can be fixed by this patch from android:
> > >
> > > The patch stalled in a long discussion about how it's difficult for
> > > cpufreq to deal with RT and how some RT users might just disable
> > > cpufreq. It is indeed hard but if the system experiences regular power
> > > kicks from a common debug feature they will end up disabling schedutil
> > > instead.
> > Patrick has a series of patches dealing with this problem area AFAICS,
> > but we are currently integrating material from Juri related to
> > deadline tasks.
> I am not sure if Patrick's patches would solve this problem at all as
> we still go to max for RT and the RT task is created from the
> softlockup detector somehow.
I assume you're talking about the series starting with
"[PATCH v3 0/6] cpufreq: schedutil: fixes for flags updates"
I checked and they have no effect on this particular issue (not
surprising).
--
Regards,
Leonard
On 08-Jan 15:20, Leonard Crestez wrote:
> On Mon, 2018-01-08 at 09:31 +0530, Viresh Kumar wrote:
> > On 05-01-18, 23:18, Rafael J. Wysocki wrote:
> > > On Fri, Jan 5, 2018 at 9:37 PM, Leonard Crestez <[email protected]> wrote:
>
> > > > When using the schedutil governor together with the softlockup detector
> > > > all CPUs go to their maximum frequency on a regular basis. This seems
> > > > to be because the watchdog creates a RT thread on each CPU and this
> > > > causes regular kicks with:
> > > >
> > > > ????cpufreq_update_this_cpu(rq, SCHED_CPUFREQ_RT);
> > > >
> > > > The schedutil governor responds to this by immediately setting the
> > > > maximum cpu frequency, this is very undesirable.
> > > >
> > > > The issue can be fixed by this patch from android:
> > > >
> > > > The patch stalled in a long discussion about how it's difficult for
> > > > cpufreq to deal with RT and how some RT users might just disable
> > > > cpufreq. It is indeed hard but if the system experiences regular power
> > > > kicks from a common debug feature they will end up disabling schedutil
> > > > instead.
>
> > > Patrick has a series of patches dealing with this problem area AFAICS,
> > > but we are currently integrating material from Juri related to
> > > deadline tasks.
>
> > I am not sure if Patrick's patches would solve this problem at all as
> > we still go to max for RT and the RT task is created from the
> > softlockup detector somehow.
>
> I assume you're talking about the series starting with
> "[PATCH v3 0/6] cpufreq: schedutil: fixes for flags updates"
>
> I checked and they have no effect on this particular issue (not
> surprising).
Yeah, that series was addressing the same issue but for one specific
RT thread: the one used by schedutil to change the frequency.
For all other RT threads the intended behavior was still to got
to max... moreover those patches has been superseded by a different
solution which has been recently proposed by Peter:
[email protected]
As Viresh and Rafael suggested, we should eventually consider a
different scheduling class and/or execution context for the watchdog.
Maybe a generalization of Juri's proposed SCHED_FLAG_SUGOV flag for
DL tasks can be useful:
[email protected]
Although that solution is already considered "gross" and thus perhaps
it does not make sense to keep adding special DL tasks.
Another possible alternative to "tag an RT task" as being special, is
to use an API similar to the one proposed by the util_clamp RFC:
[email protected]
which would allow to define what's the maximum utilization which can
be required by a properly configured RT task.
--
#include <best/regards.h>
Patrick Bellasi
On Mon, 2018-01-08 at 15:14 +0000, Patrick Bellasi wrote:
> On 08-Jan 15:20, Leonard Crestez wrote:
> > On Mon, 2018-01-08 at 09:31 +0530, Viresh Kumar wrote:
> > > On 05-01-18, 23:18, Rafael J. Wysocki wrote:
> > > > On Fri, Jan 5, 2018 at 9:37 PM, Leonard Crestez wrote:
> > > > >
> > > > > When using the schedutil governor together with the softlockup detector
> > > > > all CPUs go to their maximum frequency on a regular basis. This seems
> > > > > to be because the watchdog creates a RT thread on each CPU and this
> > > > > causes regular kicks with:
> > > > >
> > > > > cpufreq_update_this_cpu(rq, SCHED_CPUFREQ_RT);
> > > > >
> > > > > The schedutil governor responds to this by immediately setting the
> > > > > maximum cpu frequency, this is very undesirable.
> > > > >
> > > > > The issue can be fixed by this patch from android:
> > > > >
> > > > > The patch stalled in a long discussion about how it's difficult for
> > > > > cpufreq to deal with RT and how some RT users might just disable
> > > > > cpufreq. It is indeed hard but if the system experiences regular power
> > > > > kicks from a common debug feature they will end up disabling schedutil
> > > > > instead.
> > > > Patrick has a series of patches dealing with this problem area AFAICS,
> > > > but we are currently integrating material from Juri related to
> > > > deadline tasks.
> > >
> > > I am not sure if Patrick's patches would solve this problem at all as
> > > we still go to max for RT and the RT task is created from the
> > > softlockup detector somehow.
> > I assume you're talking about the series starting with
> > "[PATCH v3 0/6] cpufreq: schedutil: fixes for flags updates"
> >
> > I checked and they have no effect on this particular issue (not
> > surprising).
> Yeah, that series was addressing the same issue but for one specific
> RT thread: the one used by schedutil to change the frequency.
> For all other RT threads the intended behavior was still to got
> to max... moreover those patches has been superseded by a different
> solution which has been recently proposed by Peter:
>
> [email protected]
>
> As Viresh and Rafael suggested, we should eventually consider a
> different scheduling class and/or execution context for the watchdog.
> Maybe a generalization of Juri's proposed SCHED_FLAG_SUGOV flag for
> DL tasks can be useful:
>
> [email protected]
>
> Although that solution is already considered "gross" and thus perhaps
> it does not make sense to keep adding special DL tasks.
>
> Another possible alternative to "tag an RT task" as being special, is
> to use an API similar to the one proposed by the util_clamp RFC:
>
> [email protected]
>
> which would allow to define what's the maximum utilization which can
> be required by a properly configured RT task.
Marking the watchdog as somehow "not important for performance" would
probably work, I guess it will take a while to get a stable solution.
BTW, in the current version it seems the kick happens *after* the RT
task executes. It seems very likely that cpufreq will go back down
before a RT executes again, so how does it help? Unless most of the
workload is RT. But even in that case aren't you better off with
regular scaling since schedutil will notice utilization is high anyway?
Scaling freq up first would make more sense except such operations can
have very high latencies anyway.
Viresh suggested earlier to move watchdog to DL but apparently per-cpu
threads are not supported. sched_setattr fails on this check:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree
/kernel/sched/core.c#n4167
--
Regards,
Leonard
On Mon, Jan 8, 2018 at 4:51 PM, Leonard Crestez <[email protected]> wrote:
> On Mon, 2018-01-08 at 15:14 +0000, Patrick Bellasi wrote:
>> On 08-Jan 15:20, Leonard Crestez wrote:
>> > On Mon, 2018-01-08 at 09:31 +0530, Viresh Kumar wrote:
>> > > On 05-01-18, 23:18, Rafael J. Wysocki wrote:
>> > > > On Fri, Jan 5, 2018 at 9:37 PM, Leonard Crestez wrote:
>> > > > >
>> > > > > When using the schedutil governor together with the softlockup detector
>> > > > > all CPUs go to their maximum frequency on a regular basis. This seems
>> > > > > to be because the watchdog creates a RT thread on each CPU and this
>> > > > > causes regular kicks with:
>> > > > >
>> > > > > cpufreq_update_this_cpu(rq, SCHED_CPUFREQ_RT);
>> > > > >
>> > > > > The schedutil governor responds to this by immediately setting the
>> > > > > maximum cpu frequency, this is very undesirable.
>> > > > >
>> > > > > The issue can be fixed by this patch from android:
>> > > > >
>> > > > > The patch stalled in a long discussion about how it's difficult for
>> > > > > cpufreq to deal with RT and how some RT users might just disable
>> > > > > cpufreq. It is indeed hard but if the system experiences regular power
>> > > > > kicks from a common debug feature they will end up disabling schedutil
>> > > > > instead.
>
>> > > > Patrick has a series of patches dealing with this problem area AFAICS,
>> > > > but we are currently integrating material from Juri related to
>> > > > deadline tasks.
>> > >
>> > > I am not sure if Patrick's patches would solve this problem at all as
>> > > we still go to max for RT and the RT task is created from the
>> > > softlockup detector somehow.
>
>> > I assume you're talking about the series starting with
>> > "[PATCH v3 0/6] cpufreq: schedutil: fixes for flags updates"
>> >
>> > I checked and they have no effect on this particular issue (not
>> > surprising).
>
>> Yeah, that series was addressing the same issue but for one specific
>> RT thread: the one used by schedutil to change the frequency.
>> For all other RT threads the intended behavior was still to got
>> to max... moreover those patches has been superseded by a different
>> solution which has been recently proposed by Peter:
>>
>> [email protected]
>>
>> As Viresh and Rafael suggested, we should eventually consider a
>> different scheduling class and/or execution context for the watchdog.
>> Maybe a generalization of Juri's proposed SCHED_FLAG_SUGOV flag for
>> DL tasks can be useful:
>>
>> [email protected]
>>
>> Although that solution is already considered "gross" and thus perhaps
>> it does not make sense to keep adding special DL tasks.
>>
>> Another possible alternative to "tag an RT task" as being special, is
>> to use an API similar to the one proposed by the util_clamp RFC:
>>
>> [email protected]
>>
>> which would allow to define what's the maximum utilization which can
>> be required by a properly configured RT task.
>
> Marking the watchdog as somehow "not important for performance" would
> probably work, I guess it will take a while to get a stable solution.
>
> BTW, in the current version it seems the kick happens *after* the RT
> task executes. It seems very likely that cpufreq will go back down
> before a RT executes again, so how does it help? Unless most of the
> workload is RT. But even in that case aren't you better off with
> regular scaling since schedutil will notice utilization is high anyway?
>
> Scaling freq up first would make more sense except such operations can
> have very high latencies anyway.
I guess what happens is that it takes time to switch the frequency and
the RT task gives the CPU away before the frequency actually changes.
That is a problem, but we don't know in advance how much time the RT
task is going to run, so it is rather hard to avoid this entirely. It
might be possible to cancel the freq switch if the RT task goes to
sleep when it is still pending, but that's rather tricky
synchronization-wise. It may be worth trying as a proof of concept,
though.
> Viresh suggested earlier to move watchdog to DL but apparently per-cpu
> threads are not supported. sched_setattr fails on this check:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree
> /kernel/sched/core.c#n4167
Actually, how often does the softlockup watchdog run?
On Tue, 2018-01-09 at 02:17 +0100, Rafael J. Wysocki wrote:
> On Mon, Jan 8, 2018 at 4:51 PM, Leonard Crestez wrote:
> > On Mon, 2018-01-08 at 15:14 +0000, Patrick Bellasi wrote:
> > > On 08-Jan 15:20, Leonard Crestez wrote:
> > > > On Mon, 2018-01-08 at 09:31 +0530, Viresh Kumar wrote:
> > > > > On 05-01-18, 23:18, Rafael J. Wysocki wrote:
> > > > > > On Fri, Jan 5, 2018 at 9:37 PM, Leonard Crestez wrote:
> > > > > > > When using the schedutil governor together with the softlockup detector
> > > > > > > all CPUs go to their maximum frequency on a regular basis. This seems
> > > > > > > to be because the watchdog creates a RT thread on each CPU and this
> > > > > > > causes regular kicks with:
> > > > > > >
> > > > > > > cpufreq_update_this_cpu(rq, SCHED_CPUFREQ_RT);
> > > > > > >
> > > > > > > The schedutil governor responds to this by immediately setting the
> > > > > > > maximum cpu frequency, this is very undesirable.
> > > > > > >
> > > > > > > The issue can be fixed by this patch from android:
> > > > > > >
> > > > > > > The patch stalled in a long discussion about how it's difficult for
> > > > > > > cpufreq to deal with RT and how some RT users might just disable
> > > > > > > cpufreq. It is indeed hard but if the system experiences regular power
> > > > > > > kicks from a common debug feature they will end up disabling schedutil
> > > > > > > instead.
> > > > > > Patrick has a series of patches dealing with this problem area AFAICS,
> > > > > > but we are currently integrating material from Juri related to
> > > > > > deadline tasks.
> > > > > I am not sure if Patrick's patches would solve this problem at all as
> > > > > we still go to max for RT and the RT task is created from the
> > > > > softlockup detector somehow.
> > > > I assume you're talking about the series starting with
> > > > "[PATCH v3 0/6] cpufreq: schedutil: fixes for flags updates"
> > > >
> > > > I checked and they have no effect on this particular issue (not
> > > > surprising).
> > >
> > > Yeah, that series was addressing the same issue but for one specific
> > > RT thread: the one used by schedutil to change the frequency.
> > > For all other RT threads the intended behavior was still to got
> > > to max... moreover those patches has been superseded by a different
> > > solution which has been recently proposed by Peter:
> > >
> > > [email protected]
> > >
> > > As Viresh and Rafael suggested, we should eventually consider a
> > > different scheduling class and/or execution context for the watchdog.
> > > Maybe a generalization of Juri's proposed SCHED_FLAG_SUGOV flag for
> > > DL tasks can be useful:
> > >
> > > [email protected]
> > >
> > > Although that solution is already considered "gross" and thus perhaps
> > > it does not make sense to keep adding special DL tasks.
> > >
> > > Another possible alternative to "tag an RT task" as being special, is
> > > to use an API similar to the one proposed by the util_clamp RFC:
> > >
> > > [email protected]
> > >
> > > which would allow to define what's the maximum utilization which can
> > > be required by a properly configured RT task.
> > Marking the watchdog as somehow "not important for performance" would
> > probably work, I guess it will take a while to get a stable solution.
> >
> > BTW, in the current version it seems the kick happens *after* the RT
> > task executes. It seems very likely that cpufreq will go back down
> > before a RT executes again, so how does it help? Unless most of the
> > workload is RT. But even in that case aren't you better off with
> > regular scaling since schedutil will notice utilization is high anyway?
> >
> > Scaling freq up first would make more sense except such operations can
> > have very high latencies anyway.
> I guess what happens is that it takes time to switch the frequency and
> the RT task gives the CPU away before the frequency actually changes.
What I am saying is that as far as I can tell when cpufreq_update_util
is called when the task has already executed and is been switched out.
My tests are not very elaborate but based on some ftracing it seems to
me that the current behavior is for cpufreq spikes to always trail RT
activity. Like this:
<idle>-0 [002] 496.510138: sched_switch: swapper/2:0 [120] S ==> watchdog/2:20 [0]
watchdog/2-20 [002] 496.510156: bprint: watchdog: IN watchdog(2)
watchdog/2-20 [002] 496.510364: bprint: watchdog: OU watchdog(2)
watchdog/2-20 [002] 496.510377: bprint: update_curr_rt: watchdog kick RT! cpu=2 comm=watchdog/2
watchdog/2-20 [002] 496.510383: kernel_stack: <stack trace>
=> deactivate_task (c0157d94)
=> __schedule (c0b13570)
=> schedule (c0b13c8c)
=> smpboot_thread_fn (c015211c)
=> kthread (c014db3c)
=> ret_from_fork (c0108214)
watchdog/2-20 [002] 496.510410: sched_switch: watchdog/2:20 [0] D ==> swapper/2:0 [120]
<idle>-0 [001] 496.510488: sched_switch: swapper/1:0 [120] S ==> sugov:0:580 [49]
sugov:0-580 [001] 496.510634: sched_switch: sugov:0:580 [49] T ==> swapper/1:0 [120]
<idle>-0 [001] 496.510817: sched_switch: swapper/1:0 [120] S ==> sugov:0:580 [49]
sugov:0-580 [001] 496.510867: sched_switch: sugov:0:580 [49] T ==> swapper/1:0 [120]
<idle>-0 [001] 496.511036: sched_switch: swapper/1:0 [120] S ==> sugov:0:580 [49]
sugov:0-580 [001] 496.511079: sched_switch: sugov:0:580 [49] T ==> swapper/1:0 [120]
<idle>-0 [001] 496.511243: sched_switch: swapper/1:0 [120] S ==> sugov:0:580 [49]
sugov:0-580 [001] 496.511282: sched_switch: sugov:0:580 [49] T ==> swapper/1:0 [120]
<idle>-0 [001] 496.511445: sched_switch: swapper/1:0 [120] S ==> sugov:0:580 [49]
sugov:0-580 [001] 496.511669: sched_switch: sugov:0:580 [49] T ==> swapper/1:0 [120]
<idle>-0 [001] 496.511859: sched_switch: swapper/1:0 [120] S ==> sugov:0:580 [49]
sugov:0-580 [001] 496.511906: sched_switch: sugov:0:580 [49] T ==> swapper/1:0 [120]
<idle>-0 [001] 496.512073: sched_switch: swapper/1:0 [120] S ==> sugov:0:580 [49]
sugov:0-580 [001] 496.512114: sched_switch: sugov:0:580 [49] T ==> swapper/1:0 [120]
<idle>-0 [001] 496.512269: sched_switch: swapper/1:0 [120] S ==> sugov:0:580 [49]
sugov:0-580 [001] 496.512312: sched_switch: sugov:0:580 [49] T ==> swapper/1:0 [120]
<idle>-0 [001] 496.512448: sched_switch: swapper/1:0 [120] S ==> sugov:0:580 [49]
sugov:0-580 [001] 496.512662: sched_switch: sugov:0:580 [49] T ==> swapper/1:0 [120]
<idle>-0 [001] 496.513185: sched_switch: swapper/1:0 [120] S ==> sugov:0:580 [49]
sugov:0-580 [001] 496.513239: cpu_frequency: state=996000 cpu_id=0
sugov:0-580 [001] 496.513243: cpu_frequency: state=996000 cpu_id=1
sugov:0-580 [001] 496.513245: cpu_frequency: state=996000 cpu_id=2
sugov:0-580 [001] 496.513247: cpu_frequency: state=996000 cpu_id=3
I guess it would still help if an RT task starts, blocks and then
immediately resumes?
> > Viresh suggested earlier to move watchdog to DL but apparently per-cpu
> > threads are not supported. sched_setattr fails on this check:
> >
> > kernel/sched/core.c#n4167
> Actually, how often does the softlockup watchdog run?
Every 4 seconds (really it's /proc/sys/kernel/watchdog_thresh * 2 / 5
and watchdog_thresh defaults to 10). There is a per-cpu hrtimer which
wakes the per-cpu thread in order to check that tasks can still
execute, this works very well against bugs like infinite loops in
softirq mode. The timers are synchronized initially but can get
staggered (for example by hotplug).
My guess is that it's only marked RT so that it executes ahead of other
threads and the watchdog doesn't trigger simply when there are lots of
userspace tasks.
--
Regards,
Leonard
Am Dienstag, den 09.01.2018, 16:43 +0200 schrieb Leonard Crestez:
> On Tue, 2018-01-09 at 02:17 +0100, Rafael J. Wysocki wrote:
> > On Mon, Jan 8, 2018 at 4:51 PM, Leonard Crestez wrote:
> > > On Mon, 2018-01-08 at 15:14 +0000, Patrick Bellasi wrote:
> > > > On 08-Jan 15:20, Leonard Crestez wrote:
> > > > > On Mon, 2018-01-08 at 09:31 +0530, Viresh Kumar wrote:
> > > > > > On 05-01-18, 23:18, Rafael J. Wysocki wrote:
> > > > > > > On Fri, Jan 5, 2018 at 9:37 PM, Leonard Crestez wrote:
> > > > > > > > When using the schedutil governor together with the softlockup detector
> > > > > > > > all CPUs go to their maximum frequency on a regular basis. This seems
> > > > > > > > to be because the watchdog creates a RT thread on each CPU and this
> > > > > > > > causes regular kicks with:
> > > > > > > >
> > > > > > > > cpufreq_update_this_cpu(rq, SCHED_CPUFREQ_RT);
> > > > > > > >
> > > > > > > > The schedutil governor responds to this by immediately setting the
> > > > > > > > maximum cpu frequency, this is very undesirable.
> > > > > > > >
> > > > > > > > The issue can be fixed by this patch from android:
> > > > > > > >
> > > > > > > > The patch stalled in a long discussion about how it's difficult for
> > > > > > > > cpufreq to deal with RT and how some RT users might just disable
> > > > > > > > cpufreq. It is indeed hard but if the system experiences regular power
> > > > > > > > kicks from a common debug feature they will end up disabling schedutil
> > > > > > > > instead.
> > > > > > > Patrick has a series of patches dealing with this problem area AFAICS,
> > > > > > > but we are currently integrating material from Juri related to
> > > > > > > deadline tasks.
> > > > > > I am not sure if Patrick's patches would solve this problem at all as
> > > > > > we still go to max for RT and the RT task is created from the
> > > > > > softlockup detector somehow.
> > > > > I assume you're talking about the series starting with
> > > > > "[PATCH v3 0/6] cpufreq: schedutil: fixes for flags updates"
> > > > >
> > > > > I checked and they have no effect on this particular issue (not
> > > > > surprising).
> > > >
> > > > Yeah, that series was addressing the same issue but for one specific
> > > > RT thread: the one used by schedutil to change the frequency.
> > > > For all other RT threads the intended behavior was still to got
> > > > to max... moreover those patches has been superseded by a different
> > > > solution which has been recently proposed by Peter:
> > > >
> > > > [email protected]
> > > >
> > > > As Viresh and Rafael suggested, we should eventually consider a
> > > > different scheduling class and/or execution context for the watchdog.
> > > > Maybe a generalization of Juri's proposed SCHED_FLAG_SUGOV flag for
> > > > DL tasks can be useful:
> > > >
> > > > [email protected]
> > > >
> > > > Although that solution is already considered "gross" and thus perhaps
> > > > it does not make sense to keep adding special DL tasks.
> > > >
> > > > Another possible alternative to "tag an RT task" as being special, is
> > > > to use an API similar to the one proposed by the util_clamp RFC:
> > > >
> > > > [email protected]
> > > >
> > > > which would allow to define what's the maximum utilization which can
> > > > be required by a properly configured RT task.
> > > Marking the watchdog as somehow "not important for performance" would
> > > probably work, I guess it will take a while to get a stable solution.
> > >
> > > BTW, in the current version it seems the kick happens *after* the RT
> > > task executes. It seems very likely that cpufreq will go back down
> > > before a RT executes again, so how does it help? Unless most of the
> > > workload is RT. But even in that case aren't you better off with
> > > regular scaling since schedutil will notice utilization is high anyway?
> > >
> > > Scaling freq up first would make more sense except such operations can
> > > have very high latencies anyway.
> > I guess what happens is that it takes time to switch the frequency and
> > the RT task gives the CPU away before the frequency actually changes.
>
> What I am saying is that as far as I can tell when cpufreq_update_util
> is called when the task has already executed and is been switched out.
> My tests are not very elaborate but based on some ftracing it seems to
> me that the current behavior is for cpufreq spikes to always trail RT
> activity. Like this:
On i.MX switching the CPU frequency involves both a regulator and PLL
reconfiguration. Both actions have really long latencies (giving the
CPU away to other processes while waiting to finish), so the frequency
switch only happens after the sort-lived watchdog RT process has
already completed its work.
This behavior is probably less bad for regular RT tasks that actually
use a bit more CPU when running, but it's completely nonsensical for
the lightweight watchdog thread.
Regards,
Lucas
On Tue, Jan 9, 2018 at 3:43 PM, Leonard Crestez <[email protected]> wrote:
> On Tue, 2018-01-09 at 02:17 +0100, Rafael J. Wysocki wrote:
>> On Mon, Jan 8, 2018 at 4:51 PM, Leonard Crestez wrote:
>> > On Mon, 2018-01-08 at 15:14 +0000, Patrick Bellasi wrote:
>> > > On 08-Jan 15:20, Leonard Crestez wrote:
>> > > > On Mon, 2018-01-08 at 09:31 +0530, Viresh Kumar wrote:
>> > > > > On 05-01-18, 23:18, Rafael J. Wysocki wrote:
>> > > > > > On Fri, Jan 5, 2018 at 9:37 PM, Leonard Crestez wrote:
>
>> > > > > > > When using the schedutil governor together with the softlockup detector
>> > > > > > > all CPUs go to their maximum frequency on a regular basis. This seems
>> > > > > > > to be because the watchdog creates a RT thread on each CPU and this
>> > > > > > > causes regular kicks with:
>> > > > > > >
>> > > > > > > cpufreq_update_this_cpu(rq, SCHED_CPUFREQ_RT);
>> > > > > > >
>> > > > > > > The schedutil governor responds to this by immediately setting the
>> > > > > > > maximum cpu frequency, this is very undesirable.
>> > > > > > >
>> > > > > > > The issue can be fixed by this patch from android:
>> > > > > > >
>> > > > > > > The patch stalled in a long discussion about how it's difficult for
>> > > > > > > cpufreq to deal with RT and how some RT users might just disable
>> > > > > > > cpufreq. It is indeed hard but if the system experiences regular power
>> > > > > > > kicks from a common debug feature they will end up disabling schedutil
>> > > > > > > instead.
>
>> > > > > > Patrick has a series of patches dealing with this problem area AFAICS,
>> > > > > > but we are currently integrating material from Juri related to
>> > > > > > deadline tasks.
>
>> > > > > I am not sure if Patrick's patches would solve this problem at all as
>> > > > > we still go to max for RT and the RT task is created from the
>> > > > > softlockup detector somehow.
>
>> > > > I assume you're talking about the series starting with
>> > > > "[PATCH v3 0/6] cpufreq: schedutil: fixes for flags updates"
>> > > >
>> > > > I checked and they have no effect on this particular issue (not
>> > > > surprising).
>> > >
>> > > Yeah, that series was addressing the same issue but for one specific
>> > > RT thread: the one used by schedutil to change the frequency.
>> > > For all other RT threads the intended behavior was still to got
>> > > to max... moreover those patches has been superseded by a different
>> > > solution which has been recently proposed by Peter:
>> > >
>> > > [email protected]
>> > >
>> > > As Viresh and Rafael suggested, we should eventually consider a
>> > > different scheduling class and/or execution context for the watchdog.
>> > > Maybe a generalization of Juri's proposed SCHED_FLAG_SUGOV flag for
>> > > DL tasks can be useful:
>> > >
>> > > [email protected]
>> > >
>> > > Although that solution is already considered "gross" and thus perhaps
>> > > it does not make sense to keep adding special DL tasks.
>> > >
>> > > Another possible alternative to "tag an RT task" as being special, is
>> > > to use an API similar to the one proposed by the util_clamp RFC:
>> > >
>> > > [email protected]
>> > >
>> > > which would allow to define what's the maximum utilization which can
>> > > be required by a properly configured RT task.
>
>> > Marking the watchdog as somehow "not important for performance" would
>> > probably work, I guess it will take a while to get a stable solution.
>> >
>> > BTW, in the current version it seems the kick happens *after* the RT
>> > task executes. It seems very likely that cpufreq will go back down
>> > before a RT executes again, so how does it help? Unless most of the
>> > workload is RT. But even in that case aren't you better off with
>> > regular scaling since schedutil will notice utilization is high anyway?
>> >
>> > Scaling freq up first would make more sense except such operations can
>> > have very high latencies anyway.
>
>> I guess what happens is that it takes time to switch the frequency and
>> the RT task gives the CPU away before the frequency actually changes.
>
> What I am saying is that as far as I can tell when cpufreq_update_util
> is called when the task has already executed and is been switched out.
That would be a bug.
> My tests are not very elaborate but based on some ftracing it seems to
> me that the current behavior is for cpufreq spikes to always trail RT
> activity. Like this:
The cpufreq spikes need not be correlated with cpufreq_update_util()
execution time except that they occur more or less after
cpufreq_update_util() has run.
>
> <idle>-0 [002] 496.510138: sched_switch: swapper/2:0 [120] S ==> watchdog/2:20 [0]
> watchdog/2-20 [002] 496.510156: bprint: watchdog: IN watchdog(2)
> watchdog/2-20 [002] 496.510364: bprint: watchdog: OU watchdog(2)
> watchdog/2-20 [002] 496.510377: bprint: update_curr_rt: watchdog kick RT! cpu=2 comm=watchdog/2
> watchdog/2-20 [002] 496.510383: kernel_stack: <stack trace>
> => deactivate_task (c0157d94)
> => __schedule (c0b13570)
> => schedule (c0b13c8c)
> => smpboot_thread_fn (c015211c)
> => kthread (c014db3c)
> => ret_from_fork (c0108214)
> watchdog/2-20 [002] 496.510410: sched_switch: watchdog/2:20 [0] D ==> swapper/2:0 [120]
> <idle>-0 [001] 496.510488: sched_switch: swapper/1:0 [120] S ==> sugov:0:580 [49]
> sugov:0-580 [001] 496.510634: sched_switch: sugov:0:580 [49] T ==> swapper/1:0 [120]
> <idle>-0 [001] 496.510817: sched_switch: swapper/1:0 [120] S ==> sugov:0:580 [49]
> sugov:0-580 [001] 496.510867: sched_switch: sugov:0:580 [49] T ==> swapper/1:0 [120]
> <idle>-0 [001] 496.511036: sched_switch: swapper/1:0 [120] S ==> sugov:0:580 [49]
> sugov:0-580 [001] 496.511079: sched_switch: sugov:0:580 [49] T ==> swapper/1:0 [120]
> <idle>-0 [001] 496.511243: sched_switch: swapper/1:0 [120] S ==> sugov:0:580 [49]
> sugov:0-580 [001] 496.511282: sched_switch: sugov:0:580 [49] T ==> swapper/1:0 [120]
> <idle>-0 [001] 496.511445: sched_switch: swapper/1:0 [120] S ==> sugov:0:580 [49]
> sugov:0-580 [001] 496.511669: sched_switch: sugov:0:580 [49] T ==> swapper/1:0 [120]
> <idle>-0 [001] 496.511859: sched_switch: swapper/1:0 [120] S ==> sugov:0:580 [49]
> sugov:0-580 [001] 496.511906: sched_switch: sugov:0:580 [49] T ==> swapper/1:0 [120]
> <idle>-0 [001] 496.512073: sched_switch: swapper/1:0 [120] S ==> sugov:0:580 [49]
> sugov:0-580 [001] 496.512114: sched_switch: sugov:0:580 [49] T ==> swapper/1:0 [120]
> <idle>-0 [001] 496.512269: sched_switch: swapper/1:0 [120] S ==> sugov:0:580 [49]
> sugov:0-580 [001] 496.512312: sched_switch: sugov:0:580 [49] T ==> swapper/1:0 [120]
> <idle>-0 [001] 496.512448: sched_switch: swapper/1:0 [120] S ==> sugov:0:580 [49]
> sugov:0-580 [001] 496.512662: sched_switch: sugov:0:580 [49] T ==> swapper/1:0 [120]
> <idle>-0 [001] 496.513185: sched_switch: swapper/1:0 [120] S ==> sugov:0:580 [49]
> sugov:0-580 [001] 496.513239: cpu_frequency: state=996000 cpu_id=0
> sugov:0-580 [001] 496.513243: cpu_frequency: state=996000 cpu_id=1
> sugov:0-580 [001] 496.513245: cpu_frequency: state=996000 cpu_id=2
> sugov:0-580 [001] 496.513247: cpu_frequency: state=996000 cpu_id=3
sugov is the schedutil's kthread, right? It will always run with a
delay with respect to the cpufreq_update_util() invocation that
triggers it.
>
> I guess it would still help if an RT task starts, blocks and then
> immediately resumes?
Yes. Or if it continues to run.
>> > Viresh suggested earlier to move watchdog to DL but apparently per-cpu
>> > threads are not supported. sched_setattr fails on this check:
>> >
>> > kernel/sched/core.c#n4167
>
>> Actually, how often does the softlockup watchdog run?
>
> Every 4 seconds (really it's /proc/sys/kernel/watchdog_thresh * 2 / 5
> and watchdog_thresh defaults to 10). There is a per-cpu hrtimer which
> wakes the per-cpu thread in order to check that tasks can still
> execute, this works very well against bugs like infinite loops in
> softirq mode. The timers are synchronized initially but can get
> staggered (for example by hotplug).
>
> My guess is that it's only marked RT so that it executes ahead of other
> threads and the watchdog doesn't trigger simply when there are lots of
> userspace tasks.
I think so too.
I see a couple of more-or-less hackish ways to avoid the issue, but
nothing particularly attractive ATM.
I wouldn't change the general behavior with respect to RT tasks
because of this, though, as we would quickly find a case in which that
would turn out to be not desirable.
Thanks,
Rafael
On 09-01-18, 16:43, Leonard Crestez wrote:
> What I am saying is that as far as I can tell when cpufreq_update_util
> is called when the task has already executed and is been switched out.
Can you check if this patch makes it any better ?
https://marc.info/?l=linux-kernel&m=151204248901636&w=2
> My tests are not very elaborate but based on some ftracing it seems to
> me that the current behavior is for cpufreq spikes to always trail RT
> activity. Like this:
>
> ? ? ? ? ? <idle>-0?????[002]???496.510138: sched_switch:?????????swapper/2:0 [120] S ==> watchdog/2:20 [0]
> ??????watchdog/2-20????[002]???496.510156: bprint:???????????????watchdog: IN watchdog(2)
> ??????watchdog/2-20????[002]???496.510364: bprint:???????????????watchdog: OU watchdog(2)
> ??????watchdog/2-20????[002]???496.510377: bprint:???????????????update_curr_rt: watchdog kick RT! cpu=2 comm=watchdog/2
Probabl update_curr_rt is getting called a bit after the task has
already run. The above patch moves the call to cpufreq_update_util()
to enqueue/dequeue paths and that should fix it.
--
viresh
On 09/01/18 16:50, Rafael J. Wysocki wrote:
> On Tue, Jan 9, 2018 at 3:43 PM, Leonard Crestez <[email protected]> wrote:
[...]
> > Every 4 seconds (really it's /proc/sys/kernel/watchdog_thresh * 2 / 5
> > and watchdog_thresh defaults to 10). There is a per-cpu hrtimer which
> > wakes the per-cpu thread in order to check that tasks can still
> > execute, this works very well against bugs like infinite loops in
> > softirq mode. The timers are synchronized initially but can get
> > staggered (for example by hotplug).
> >
> > My guess is that it's only marked RT so that it executes ahead of other
> > threads and the watchdog doesn't trigger simply when there are lots of
> > userspace tasks.
>
> I think so too.
>
> I see a couple of more-or-less hackish ways to avoid the issue, but
> nothing particularly attractive ATM.
>
> I wouldn't change the general behavior with respect to RT tasks
> because of this, though, as we would quickly find a case in which that
> would turn out to be not desirable.
I agree we cannot generalize to all RT tasks, but what Patrick proposed
(clamping utilization of certain known tasks) might help here:
lkml.kernel.org/r/[email protected]
Maybe with a per-task interface instead of using cgroups?
The other option would be to relax DL tasks affinity constraints, so
that a case like this might be handled. Daniel and Tommaso proposed
possible approaches, this might be a driving use case. Not sure how we
would come up with a proper runtime for the watchdog, though.
Best,
- Juri
On Wed, Jan 10, 2018 at 11:54 AM, Juri Lelli <[email protected]> wrote:
> On 09/01/18 16:50, Rafael J. Wysocki wrote:
>> On Tue, Jan 9, 2018 at 3:43 PM, Leonard Crestez <[email protected]> wrote:
>
> [...]
>
>> > Every 4 seconds (really it's /proc/sys/kernel/watchdog_thresh * 2 / 5
>> > and watchdog_thresh defaults to 10). There is a per-cpu hrtimer which
>> > wakes the per-cpu thread in order to check that tasks can still
>> > execute, this works very well against bugs like infinite loops in
>> > softirq mode. The timers are synchronized initially but can get
>> > staggered (for example by hotplug).
>> >
>> > My guess is that it's only marked RT so that it executes ahead of other
>> > threads and the watchdog doesn't trigger simply when there are lots of
>> > userspace tasks.
>>
>> I think so too.
>>
>> I see a couple of more-or-less hackish ways to avoid the issue, but
>> nothing particularly attractive ATM.
>>
>> I wouldn't change the general behavior with respect to RT tasks
>> because of this, though, as we would quickly find a case in which that
>> would turn out to be not desirable.
>
> I agree we cannot generalize to all RT tasks, but what Patrick proposed
> (clamping utilization of certain known tasks) might help here:
>
> lkml.kernel.org/r/[email protected]
>
> Maybe with a per-task interface instead of using cgroups?
The problem here is that this is a kernel thing and user space should
not be expected to have to do anything about fixing this IMO.
> The other option would be to relax DL tasks affinity constraints, so
> that a case like this might be handled. Daniel and Tommaso proposed
> possible approaches, this might be a driving use case. Not sure how we
> would come up with a proper runtime for the watchdog, though.
That is a problem.
Basically, it needs to run as soon as possible, but it will be running
for a very short time, every time. Overall, using a thread for that
seems wasteful ...
Thanks,
Rafael
On 10/01/18 13:35, Rafael J. Wysocki wrote:
> On Wed, Jan 10, 2018 at 11:54 AM, Juri Lelli <[email protected]> wrote:
> > On 09/01/18 16:50, Rafael J. Wysocki wrote:
> >> On Tue, Jan 9, 2018 at 3:43 PM, Leonard Crestez <[email protected]> wrote:
> >
> > [...]
> >
> >> > Every 4 seconds (really it's /proc/sys/kernel/watchdog_thresh * 2 / 5
> >> > and watchdog_thresh defaults to 10). There is a per-cpu hrtimer which
> >> > wakes the per-cpu thread in order to check that tasks can still
> >> > execute, this works very well against bugs like infinite loops in
> >> > softirq mode. The timers are synchronized initially but can get
> >> > staggered (for example by hotplug).
> >> >
> >> > My guess is that it's only marked RT so that it executes ahead of other
> >> > threads and the watchdog doesn't trigger simply when there are lots of
> >> > userspace tasks.
> >>
> >> I think so too.
> >>
> >> I see a couple of more-or-less hackish ways to avoid the issue, but
> >> nothing particularly attractive ATM.
> >>
> >> I wouldn't change the general behavior with respect to RT tasks
> >> because of this, though, as we would quickly find a case in which that
> >> would turn out to be not desirable.
> >
> > I agree we cannot generalize to all RT tasks, but what Patrick proposed
> > (clamping utilization of certain known tasks) might help here:
> >
> > lkml.kernel.org/r/[email protected]
> >
> > Maybe with a per-task interface instead of using cgroups?
>
> The problem here is that this is a kernel thing and user space should
> not be expected to have to do anything about fixing this IMO.
Not sure. If we would have such an interface, it should be possible to
use it from both kernel and userspace. In this case kernel might be able
to do the "right" thing. Also, RT userspace is usually already responsible
for configuring system priorities, it might be easy to set this as well.
> > The other option would be to relax DL tasks affinity constraints, so
> > that a case like this might be handled. Daniel and Tommaso proposed
> > possible approaches, this might be a driving use case. Not sure how we
> > would come up with a proper runtime for the watchdog, though.
>
> That is a problem.
>
> Basically, it needs to run as soon as possible, but it will be running
> for a very short time, every time.
Does it really require to run "as soon as possible" or is it "at least
once every watchdog period"? In the latter case DL might still fit, with
a very short runtime (to be defined).
> Overall, using a thread for that seems wasteful ...
Not sure I'm following you here, aren't we using a thread already?
Thanks,
- Juri
On Wednesday, January 10, 2018 3:21:58 PM CET Juri Lelli wrote:
> On 10/01/18 13:35, Rafael J. Wysocki wrote:
> > On Wed, Jan 10, 2018 at 11:54 AM, Juri Lelli <[email protected]> wrote:
> > > On 09/01/18 16:50, Rafael J. Wysocki wrote:
> > >> On Tue, Jan 9, 2018 at 3:43 PM, Leonard Crestez <[email protected]> wrote:
> > >
> > > [...]
> > >
> > >> > Every 4 seconds (really it's /proc/sys/kernel/watchdog_thresh * 2 / 5
> > >> > and watchdog_thresh defaults to 10). There is a per-cpu hrtimer which
> > >> > wakes the per-cpu thread in order to check that tasks can still
> > >> > execute, this works very well against bugs like infinite loops in
> > >> > softirq mode. The timers are synchronized initially but can get
> > >> > staggered (for example by hotplug).
> > >> >
> > >> > My guess is that it's only marked RT so that it executes ahead of other
> > >> > threads and the watchdog doesn't trigger simply when there are lots of
> > >> > userspace tasks.
> > >>
> > >> I think so too.
> > >>
> > >> I see a couple of more-or-less hackish ways to avoid the issue, but
> > >> nothing particularly attractive ATM.
> > >>
> > >> I wouldn't change the general behavior with respect to RT tasks
> > >> because of this, though, as we would quickly find a case in which that
> > >> would turn out to be not desirable.
> > >
> > > I agree we cannot generalize to all RT tasks, but what Patrick proposed
> > > (clamping utilization of certain known tasks) might help here:
> > >
> > > lkml.kernel.org/r/[email protected]
> > >
> > > Maybe with a per-task interface instead of using cgroups?
> >
> > The problem here is that this is a kernel thing and user space should
> > not be expected to have to do anything about fixing this IMO.
>
> Not sure. If we would have such an interface, it should be possible to
> use it from both kernel and userspace.
OK
> In this case kernel might be able
> to do the "right" thing. Also, RT userspace is usually already responsible
> for configuring system priorities, it might be easy to set this as well.
>
> > > The other option would be to relax DL tasks affinity constraints, so
> > > that a case like this might be handled. Daniel and Tommaso proposed
> > > possible approaches, this might be a driving use case. Not sure how we
> > > would come up with a proper runtime for the watchdog, though.
> >
> > That is a problem.
> >
> > Basically, it needs to run as soon as possible, but it will be running
> > for a very short time, every time.
>
> Does it really require to run "as soon as possible" or is it "at least
> once every watchdog period"? In the latter case DL might still fit, with
> a very short runtime (to be defined).
I guess the latter is closer to what's needed.
> > Overall, using a thread for that seems wasteful ...
>
> Not sure I'm following you here, aren't we using a thread already?
Yes, we are, which is why I'm wondering if that is the right choice. :-)
Thanks,
Rafael