2019-11-15 00:45:18

by Rafikov, Rustem

[permalink] [raw]
Subject: RT scheduler is suboptimal when an RT thread preempts another RT in terms of choosing a core to migrate

Hi,

When an RT thread preempts another RT thread it migrates the latter one to a core.
The way RT scheduler chooses a core is quite suboptimal. Let me give an example from a "production" server with 32 total physical cores.
There are SCHED_NORMAL threads (affined to particular core each) and 2+ groups of RT threads (allowed to run everywhere).
Scheduler trace showed that most cases RT scheduler preempts a normal prio thread from a core to put evicted RT one on rather than using an idle core the system had a plenty of which according the trace.

I reproduced the behavior on a vanilla 4.18.0 kernel with a micro test where I created 10 SCHED_NORMAL affined to 10 cores,
3 RT/69 with 0xFFFFFFFF affinity and a few RT/79 threads kicking off other RTs from CPUs every 5 msec.
Other cores were idle but RT/69 never migrated to them.

The problem seems to be in how mapping in cpupri structure is updated:
1) Fair scheduler does not update/read from there. So we don't know if a SCHED_NORMAL left a cpu. Well, that may be OK.
2) RT scheduler uses cpupri to find a core to migrate to, but it updates it incorrectly:
- RT->RT works fine [2]
- But RT->IDLE or RT->SCHED_NORMAL [1] is not right - in both cases it sets RT_MAX(100) which is min NORMAL!
It's totally okay to set it to RT_MAX for all of NORMALs but not for IDLE. BTW - IDLE means swapper which has pri=120 :)

See below traced with kprobes.

[1] IDLE->RT/79->IDLE
#1. <idle>-0 [001] d.h. 14717592.107294: myprobe3: (cpupri_set+0x0/0x100) cpu=1 newp=14 oldpri=0001
#2. <...>-157332 [001] d... 14717592.107313: myprobe3: (cpupri_set+0x0/0x100) cpu=1 newp=64 oldpri=0051

Decoding the output at #1 cpu=1 newp=14 oldpri=0001
- cpu = 1 - it happens on core 1
- newp=14 - the priority of a thread being scheduled in is 0x14 which is RT-79 (our test thread)
- oldpri=0001 - a priority of previous thread on that CPU. "1" means NORMAL in 0-101 scale. This is incorrect by itself because the core was IDLE!
Let's try to figure out why it is not '0' (IDLE) by looking at the last line - cpu=1 newp=64 oldpri=0051
- newp=64 says that the priority of a thread being scheduled in is 0x64 which min NORMAL. So, it is not 140 how we could expect when switching to IDLE thread.
- oldpri=0051 this is 81 - priority of our RT-79 thread in 0-101 scale


[2] RT/69->RT/79>RT/69
#1. <...>-158253 [001] d.h. 14723119.396120: myprobe3: (cpupri_set+0x0/0x100) cpu=1 newp=14 oldpri=0047 #2. <...>-158254 [001] d... 14723119.396122: myprobe3: (cpupri_set+0x0/0x100) cpu=1 newp=1e oldpri=0051 Line #1 - "cpu=1 newp=14 oldpri=0047" switching to 0x14, RT-79 thread
- old pri currently on cpu is 0x47 in 0-101 scale OR RT-69
Line#2 - switching to 0x1e - RT-69. This is correct value of the thread being scheduled in!
- oldppri=0051 - RT-69 in 0-101 scale

Thanks,
Rustem



2019-11-27 15:29:40

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: RT scheduler is suboptimal when an RT thread preempts another RT in terms of choosing a core to migrate

On 15/11/2019 01:43, Rafikov, Rustem wrote:
> Hi,
>
> When an RT thread preempts another RT thread it migrates the latter one to a core.
> The way RT scheduler chooses a core is quite suboptimal. Let me give an example from a "production" server with 32 total physical cores.
> There are SCHED_NORMAL threads (affined to particular core each) and 2+ groups of RT threads (allowed to run everywhere).
> Scheduler trace showed that most cases RT scheduler preempts a normal prio thread from a core to put evicted RT one on rather than using an idle core the system had a plenty of which according the trace.
>
> I reproduced the behavior on a vanilla 4.18.0 kernel with a micro test where I created 10 SCHED_NORMAL affined to 10 cores,
> 3 RT/69 with 0xFFFFFFFF affinity and a few RT/79 threads kicking off other RTs from CPUs every 5 msec.
> Other cores were idle but RT/69 never migrated to them.
>
> The problem seems to be in how mapping in cpupri structure is updated:
> 1) Fair scheduler does not update/read from there. So we don't know if a SCHED_NORMAL left a cpu. Well, that may be OK.
> 2) RT scheduler uses cpupri to find a core to migrate to, but it updates it incorrectly:
> - RT->RT works fine [2]
> - But RT->IDLE or RT->SCHED_NORMAL [1] is not right - in both cases it sets RT_MAX(100) which is min NORMAL!
> It's totally okay to set it to RT_MAX for all of NORMALs but not for IDLE. BTW - IDLE means swapper which has pri=120 :)
>
> See below traced with kprobes.
>
> [1] IDLE->RT/79->IDLE
> #1. <idle>-0 [001] d.h. 14717592.107294: myprobe3: (cpupri_set+0x0/0x100) cpu=1 newp=14 oldpri=0001
> #2. <...>-157332 [001] d... 14717592.107313: myprobe3: (cpupri_set+0x0/0x100) cpu=1 newp=64 oldpri=0051
>
> Decoding the output at #1 cpu=1 newp=14 oldpri=0001
> - cpu = 1 - it happens on core 1
> - newp=14 - the priority of a thread being scheduled in is 0x14 which is RT-79 (our test thread)
> - oldpri=0001 - a priority of previous thread on that CPU. "1" means NORMAL in 0-101 scale. This is incorrect by itself because the core was IDLE!
> Let's try to figure out why it is not '0' (IDLE) by looking at the last line - cpu=1 newp=64 oldpri=0051
> - newp=64 says that the priority of a thread being scheduled in is 0x64 which min NORMAL. So, it is not 140 how we could expect when switching to IDLE thread.
> - oldpri=0051 this is 81 - priority of our RT-79 thread in 0-101 scale
>
>
> [2] RT/69->RT/79>RT/69
> #1. <...>-158253 [001] d.h. 14723119.396120: myprobe3: (cpupri_set+0x0/0x100) cpu=1 newp=14 oldpri=0047 #2. <...>-158254 [001] d... 14723119.396122: myprobe3: (cpupri_set+0x0/0x100) cpu=1 newp=1e oldpri=0051 Line #1 - "cpu=1 newp=14 oldpri=0047" switching to 0x14, RT-79 thread
> - old pri currently on cpu is 0x47 in 0-101 scale OR RT-69
> Line#2 - switching to 0x1e - RT-69. This is correct value of the thread being scheduled in!
> - oldppri=0051 - RT-69 in 0-101 scale

I have seen the same thing. cp->pri_to_cpu[CPUPRI_IDLE] (CPUPRI_IDLE=0)
is never used. So cpupri_find() always skips over it.

There was
https://lore.kernel.org/r/[email protected]
in 2014 but it didn't go mainline.