Hi,
I'm seeing a strange behavior that I _think_ is a bug. I'm hoping that
some of the scheduling experts can tell me if I'm just
misunderstanding or if this is truly a bug. To see it, I do this:
--
# Allow 1000 us more of RT at system and top cgroup
old_rt=$(cat /proc/sys/kernel/sched_rt_runtime_us)
echo $((old_rt + 1000)) > /proc/sys/kernel/sched_rt_runtime_us
old_rt=$(cat /sys/fs/cgroup/cpu/cpu.rt_runtime_us)
echo $((old_rt + 1000)) > /sys/fs/cgroup/cpu/cpu.rt_runtime_us
# Give the 1000 us to my own group
mkdir /sys/fs/cgroup/cpu/doug
echo 1000 > /sys/fs/cgroup/cpu/doug/cpu.rt_runtime_us
# Fork off a bunch of spinny things
for i in $(seq 13); do
python -c "while True: pass"&
done
# Make my spinny things RT and put in my group
# (assumes no other python is running!)
for pid in $(ps aux | grep python | grep -v grep | awk '{print $2}'); do
echo $pid >> /sys/fs/cgroup/cpu/doug/tasks
chrt -p -f 99 $pid
done
--
As expected, the spinny python tasks are pretty much throttled down to
0 in the above (they get 1 ms out of 1 second).
However, _the bug_ is that the above basically causes all _other_ RT
things in my system to stop functioning. I'm on an ARM Chromebook
(sc7180-trogdor) and we communicate to our EC on a "realtime" thread
due to SPI timing requirements. The above commands appear to starve
the EC's communication task and (as far as I can tell) other RT tasks
in the system.
Notably:
a) My EC comms slow to a crawl (eventually one gets through).
b) "top" shows stuff like this:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+
COMMAND
179 root -51 0 0 0 0 R 100.0 0.0 0:31.79
cros_ec_spi_hig
180 root -51 0 0 0 0 R 95.2 0.0 0:50.19
irq/169-chromeo
184 root -51 0 0 0 0 R 95.2 0.0 0:13.24
spi10
221 root -2 0 0 0 0 R 95.2 0.0 0:50.57
ring0
c) If I give my spinny tasks just a little bit more time than 1 ms
then I get a hung task.
When I'm testing the above, the non-RT stuff in the system continues
to work OK though. I can even go in and kill all my python tasks and
the system returns to normal.
I tried gathering some tracing. One bit that might (?) be relevant:
cros_ec_spi_hig-179 [000] d.h5 1495.305919: sched_waking:
comm=kworker/4:2 pid=5232 prio=120 target_cpu=004
cros_ec_spi_hig-179 [000] d.h6 1495.305926: sched_wakeup:
comm=kworker/4:2 pid=5232 prio=120 target_cpu=004
<idle>-0 [001] d.H5 1495.309113: sched_waking:
comm=sugov:6 pid=2658 prio=-1 target_cpu=006
<idle>-0 [001] d.H6 1495.309119: sched_wakeup:
comm=sugov:6 pid=2658 prio=-1 target_cpu=006
cros_ec_spi_hig-179 [000] d.h5 1495.309336: sched_waking:
comm=sugov:6 pid=2658 prio=-1 target_cpu=006
cros_ec_spi_hig-179 [000] d.h6 1495.309341: sched_wakeup:
comm=sugov:6 pid=2658 prio=-1 target_cpu=006
<idle>-0 [001] d.H5 1495.312137: sched_waking:
comm=sugov:6 pid=2658 prio=-1 target_cpu=006
<idle>-0 [001] d.H6 1495.312142: sched_wakeup:
comm=sugov:6 pid=2658 prio=-1 target_cpu=006
cros_ec_spi_hig-179 [000] d.h5 1495.312859: sched_waking:
comm=sugov:6 pid=2658 prio=-1 target_cpu=006
cros_ec_spi_hig-179 [000] d.h6 1495.312870: sched_wakeup:
comm=sugov:6 pid=2658 prio=-1 target_cpu=006
My best guess is that there's some bug in the scheduler where it just
loops constantly picking an unthrottled RT task but then incorrectly
decides that it's throttled and thus doesn't run it.
Most of my testing has been on the chromeos-5.4 kernel, but just in
case I tried a vanilla v5.15 kernel and I could reproduce the same
problems.
Anyway, if I'm just doing something stupid then I appologize of the
noise. If the above should work and you need me to gather more logging
/ try any experiments, I'm happy to do so.
Thanks!
-Doug
On Fri, Nov 5, 2021 at 1:44 PM Doug Anderson <[email protected]> wrote:
[..]
>
>
> I tried gathering some tracing. One bit that might (?) be relevant:
>
> cros_ec_spi_hig-179 [000] d.h5 1495.305919: sched_waking:
> comm=kworker/4:2 pid=5232 prio=120 target_cpu=004
> cros_ec_spi_hig-179 [000] d.h6 1495.305926: sched_wakeup:
> comm=kworker/4:2 pid=5232 prio=120 target_cpu=004
> <idle>-0 [001] d.H5 1495.309113: sched_waking:
> comm=sugov:6 pid=2658 prio=-1 target_cpu=006
> <idle>-0 [001] d.H6 1495.309119: sched_wakeup:
> comm=sugov:6 pid=2658 prio=-1 target_cpu=006
> cros_ec_spi_hig-179 [000] d.h5 1495.309336: sched_waking:
> comm=sugov:6 pid=2658 prio=-1 target_cpu=006
> cros_ec_spi_hig-179 [000] d.h6 1495.309341: sched_wakeup:
> comm=sugov:6 pid=2658 prio=-1 target_cpu=006
> <idle>-0 [001] d.H5 1495.312137: sched_waking:
> comm=sugov:6 pid=2658 prio=-1 target_cpu=006
> <idle>-0 [001] d.H6 1495.312142: sched_wakeup:
> comm=sugov:6 pid=2658 prio=-1 target_cpu=006
> cros_ec_spi_hig-179 [000] d.h5 1495.312859: sched_waking:
> comm=sugov:6 pid=2658 prio=-1 target_cpu=006
> cros_ec_spi_hig-179 [000] d.h6 1495.312870: sched_wakeup:
> comm=sugov:6 pid=2658 prio=-1 target_cpu=006
>
> My best guess is that there's some bug in the scheduler where it just
> loops constantly picking an unthrottled RT task but then incorrectly
> decides that it's throttled and thus doesn't run it.
Thanks for posting this. Tricky bit indeed. I was wondering if the
issue is here:
https://elixir.bootlin.com/linux/latest/source/kernel/sched/rt.c#L1031
Basically, if something is actively getting RT-throttled on a CPU,
then if something else unthrottled but RT is ready to run, the
scheduler does resched_curr(rq) anyway. This probably happens when
update_curr_rt() is called on the unthrottled RT task via
task_tick_rt(). That seems a limitation of the current RT scheduler
I think? Thoughts?
On Fri, Nov 5, 2021 at 7:05 PM Joel Fernandes <[email protected]> wrote:
>
> On Fri, Nov 5, 2021 at 7:04 PM Joel Fernandes <[email protected]> wrote:
> >
> > On Fri, Nov 5, 2021 at 1:44 PM Doug Anderson <[email protected]> wrote:
> > [..]
> > >
> > >
> > > I tried gathering some tracing. One bit that might (?) be relevant:
> > >
> > > cros_ec_spi_hig-179 [000] d.h5 1495.305919: sched_waking:
> > > comm=kworker/4:2 pid=5232 prio=120 target_cpu=004
> > > cros_ec_spi_hig-179 [000] d.h6 1495.305926: sched_wakeup:
> > > comm=kworker/4:2 pid=5232 prio=120 target_cpu=004
> > > <idle>-0 [001] d.H5 1495.309113: sched_waking:
> > > comm=sugov:6 pid=2658 prio=-1 target_cpu=006
> > > <idle>-0 [001] d.H6 1495.309119: sched_wakeup:
> > > comm=sugov:6 pid=2658 prio=-1 target_cpu=006
> > > cros_ec_spi_hig-179 [000] d.h5 1495.309336: sched_waking:
> > > comm=sugov:6 pid=2658 prio=-1 target_cpu=006
> > > cros_ec_spi_hig-179 [000] d.h6 1495.309341: sched_wakeup:
> > > comm=sugov:6 pid=2658 prio=-1 target_cpu=006
> > > <idle>-0 [001] d.H5 1495.312137: sched_waking:
> > > comm=sugov:6 pid=2658 prio=-1 target_cpu=006
> > > <idle>-0 [001] d.H6 1495.312142: sched_wakeup:
> > > comm=sugov:6 pid=2658 prio=-1 target_cpu=006
> > > cros_ec_spi_hig-179 [000] d.h5 1495.312859: sched_waking:
> > > comm=sugov:6 pid=2658 prio=-1 target_cpu=006
> > > cros_ec_spi_hig-179 [000] d.h6 1495.312870: sched_wakeup:
> > > comm=sugov:6 pid=2658 prio=-1 target_cpu=006
> > >
> > > My best guess is that there's some bug in the scheduler where it just
> > > loops constantly picking an unthrottled RT task but then incorrectly
> > > decides that it's throttled and thus doesn't run it.
> >
> > Thanks for posting this. Tricky bit indeed. I was wondering if the
> > issue is here:
> > https://elixir.bootlin.com/linux/latest/source/kernel/sched/rt.c#L1031
>
> Errm, I meant Line 1060 in rt.c
> https://elixir.bootlin.com/linux/latest/source/kernel/sched/rt.c#L1060
AAARGH , I hate elixir. I meant this code in update_curr_rt() :
if (sched_rt_runtime_exceeded(rt_rq))
resched_curr(rq);
On Fri, Nov 5, 2021 at 7:04 PM Joel Fernandes <[email protected]> wrote:
>
> On Fri, Nov 5, 2021 at 1:44 PM Doug Anderson <[email protected]> wrote:
> [..]
> >
> >
> > I tried gathering some tracing. One bit that might (?) be relevant:
> >
> > cros_ec_spi_hig-179 [000] d.h5 1495.305919: sched_waking:
> > comm=kworker/4:2 pid=5232 prio=120 target_cpu=004
> > cros_ec_spi_hig-179 [000] d.h6 1495.305926: sched_wakeup:
> > comm=kworker/4:2 pid=5232 prio=120 target_cpu=004
> > <idle>-0 [001] d.H5 1495.309113: sched_waking:
> > comm=sugov:6 pid=2658 prio=-1 target_cpu=006
> > <idle>-0 [001] d.H6 1495.309119: sched_wakeup:
> > comm=sugov:6 pid=2658 prio=-1 target_cpu=006
> > cros_ec_spi_hig-179 [000] d.h5 1495.309336: sched_waking:
> > comm=sugov:6 pid=2658 prio=-1 target_cpu=006
> > cros_ec_spi_hig-179 [000] d.h6 1495.309341: sched_wakeup:
> > comm=sugov:6 pid=2658 prio=-1 target_cpu=006
> > <idle>-0 [001] d.H5 1495.312137: sched_waking:
> > comm=sugov:6 pid=2658 prio=-1 target_cpu=006
> > <idle>-0 [001] d.H6 1495.312142: sched_wakeup:
> > comm=sugov:6 pid=2658 prio=-1 target_cpu=006
> > cros_ec_spi_hig-179 [000] d.h5 1495.312859: sched_waking:
> > comm=sugov:6 pid=2658 prio=-1 target_cpu=006
> > cros_ec_spi_hig-179 [000] d.h6 1495.312870: sched_wakeup:
> > comm=sugov:6 pid=2658 prio=-1 target_cpu=006
> >
> > My best guess is that there's some bug in the scheduler where it just
> > loops constantly picking an unthrottled RT task but then incorrectly
> > decides that it's throttled and thus doesn't run it.
>
> Thanks for posting this. Tricky bit indeed. I was wondering if the
> issue is here:
> https://elixir.bootlin.com/linux/latest/source/kernel/sched/rt.c#L1031
Errm, I meant Line 1060 in rt.c
https://elixir.bootlin.com/linux/latest/source/kernel/sched/rt.c#L1060
Hi,
On Fri, Nov 5, 2021 at 10:44 AM Doug Anderson <[email protected]> wrote:
>
> Hi,
>
> I'm seeing a strange behavior that I _think_ is a bug. I'm hoping that
> some of the scheduling experts can tell me if I'm just
> misunderstanding or if this is truly a bug. To see it, I do this:
>
> --
>
> # Allow 1000 us more of RT at system and top cgroup
> old_rt=$(cat /proc/sys/kernel/sched_rt_runtime_us)
> echo $((old_rt + 1000)) > /proc/sys/kernel/sched_rt_runtime_us
> old_rt=$(cat /sys/fs/cgroup/cpu/cpu.rt_runtime_us)
> echo $((old_rt + 1000)) > /sys/fs/cgroup/cpu/cpu.rt_runtime_us
>
> # Give the 1000 us to my own group
> mkdir /sys/fs/cgroup/cpu/doug
> echo 1000 > /sys/fs/cgroup/cpu/doug/cpu.rt_runtime_us
>
> # Fork off a bunch of spinny things
> for i in $(seq 13); do
> python -c "while True: pass"&
> done
>
> # Make my spinny things RT and put in my group
> # (assumes no other python is running!)
> for pid in $(ps aux | grep python | grep -v grep | awk '{print $2}'); do
> echo $pid >> /sys/fs/cgroup/cpu/doug/tasks
> chrt -p -f 99 $pid
> done
>
> --
>
> As expected, the spinny python tasks are pretty much throttled down to
> 0 in the above (they get 1 ms out of 1 second).
>
> However, _the bug_ is that the above basically causes all _other_ RT
> things in my system to stop functioning. I'm on an ARM Chromebook
> (sc7180-trogdor) and we communicate to our EC on a "realtime" thread
> due to SPI timing requirements. The above commands appear to starve
> the EC's communication task and (as far as I can tell) other RT tasks
> in the system.
>
> Notably:
>
> a) My EC comms slow to a crawl (eventually one gets through).
>
> b) "top" shows stuff like this:
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+
> COMMAND
> 179 root -51 0 0 0 0 R 100.0 0.0 0:31.79
> cros_ec_spi_hig
> 180 root -51 0 0 0 0 R 95.2 0.0 0:50.19
> irq/169-chromeo
> 184 root -51 0 0 0 0 R 95.2 0.0 0:13.24
> spi10
> 221 root -2 0 0 0 0 R 95.2 0.0 0:50.57
> ring0
>
> c) If I give my spinny tasks just a little bit more time than 1 ms
> then I get a hung task.
>
>
> When I'm testing the above, the non-RT stuff in the system continues
> to work OK though. I can even go in and kill all my python tasks and
> the system returns to normal.
>
> I tried gathering some tracing. One bit that might (?) be relevant:
>
> cros_ec_spi_hig-179 [000] d.h5 1495.305919: sched_waking:
> comm=kworker/4:2 pid=5232 prio=120 target_cpu=004
> cros_ec_spi_hig-179 [000] d.h6 1495.305926: sched_wakeup:
> comm=kworker/4:2 pid=5232 prio=120 target_cpu=004
> <idle>-0 [001] d.H5 1495.309113: sched_waking:
> comm=sugov:6 pid=2658 prio=-1 target_cpu=006
> <idle>-0 [001] d.H6 1495.309119: sched_wakeup:
> comm=sugov:6 pid=2658 prio=-1 target_cpu=006
> cros_ec_spi_hig-179 [000] d.h5 1495.309336: sched_waking:
> comm=sugov:6 pid=2658 prio=-1 target_cpu=006
> cros_ec_spi_hig-179 [000] d.h6 1495.309341: sched_wakeup:
> comm=sugov:6 pid=2658 prio=-1 target_cpu=006
> <idle>-0 [001] d.H5 1495.312137: sched_waking:
> comm=sugov:6 pid=2658 prio=-1 target_cpu=006
> <idle>-0 [001] d.H6 1495.312142: sched_wakeup:
> comm=sugov:6 pid=2658 prio=-1 target_cpu=006
> cros_ec_spi_hig-179 [000] d.h5 1495.312859: sched_waking:
> comm=sugov:6 pid=2658 prio=-1 target_cpu=006
> cros_ec_spi_hig-179 [000] d.h6 1495.312870: sched_wakeup:
> comm=sugov:6 pid=2658 prio=-1 target_cpu=006
>
> My best guess is that there's some bug in the scheduler where it just
> loops constantly picking an unthrottled RT task but then incorrectly
> decides that it's throttled and thus doesn't run it.
>
> Most of my testing has been on the chromeos-5.4 kernel, but just in
> case I tried a vanilla v5.15 kernel and I could reproduce the same
> problems.
>
>
> Anyway, if I'm just doing something stupid then I appologize of the
> noise. If the above should work and you need me to gather more logging
> / try any experiments, I'm happy to do so.
I spent a little more time here and:
a) Managed to come up with a test case that makes it easy for anyone
to reproduce the problem themselves.
b) Managed to come up with a short patch that "fixes" this problem for me.
Since I had a potential fix, I posted this as a patch. Maybe we can
move the discussion to my post of that patch, which will hopefully
show up at:
https://lore.kernel.org/r/20211115170241.1.I94825a614577505bd1a8be9aeff208a49cb39b3d@changeid
...oddly that patch isn't showing up (?). It should be "[PATCH]
sched/rt: Don't reschedule a throttled task even if it's higher
priority". If it's not there tomorrow, I can try to repost again.
-Doug