2014-04-24 18:27:18

by Roman Gushchin

[permalink] [raw]
Subject: Real-time scheduling policies and hyper-threading

Hello!


I spend some time investigating why switching runtime* tasks to real-time scheduling policies increases
response time dispersion, while the opposite is expected.

The main reason is hyper-threading. rt-scheduler tries only to load all logical CPUs, selecting topologically
closest when the current is busy. If hyper-threading is enabled, this strategy is counter-productive:
tasks are suffering on busy HT-threads when there is a plenty of idle physical cores.

Also, rt-scheduler doesn't try to balance rt load between physical CPUs. It's significant because of
turbo-boost and frequency scaling technologies: per-core performance depends on the number of
idle cores in the same physical cpu.


Are there any known solutions of this problem except disabling hyper-threading and frequency scaling at all?

Are there any common plans to enhance the load balancing algorithm in the rt-scheduler?

Does anyone use rt-scheduler for runtime-like cpu-bound tasks?


Why just don't use CFS? :-)
Rt-scheduler with modified load balancing shows much better results.
I have a prototype (still incomplete and with many dirty hacks), that shows 10-15%
performance increase in our production.


(*) A simplified model can be described as following:
there is one process per machine, with one thread, that receives request from network and puts them into queue;
n (n ~ NCPU + 1) worker threads, that get requests from the queue and handle them.
Load is cpu-bound, tens of milliseconds per request. Typical CPU load is between 40% and 70%.
A typical system has two physical x86-64 cpus with 8-16 physical cores each (x2 with hyper-threading).


Thanks,
Roman


2014-04-24 18:59:03

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Real-time scheduling policies and hyper-threading

On Thu, Apr 24, 2014 at 10:16:12PM +0400, Roman Gushchin wrote:
> Are there any known solutions of this problem except disabling
> hyper-threading and frequency scaling at all?

This is what we generally tell people to do; disable HT in the BIOS,
offline the siblings or similar approaches.

Similarly we typically tell people to keep each rt workload to a single
NUMA node, and in case the workload really is bigger and must cross into
multiple, use memory interleave where possible.

> Are there any common plans to enhance the load balancing algorithm in
> the rt-scheduler?

Nope; at best the topology driven placement is going to be a heuristic,
and the last thing you want for your realtime tasks is (non deterministic)
placement heuristics.

You, and only you, know your workload and can devise a correct placement
policy. We have cpusets and cpu affinity available to carve up your
system for this.

That said, there might be something we could do for soft(er) realtime,
but I don't think there's been much (if any) research on this.

> Does anyone use rt-scheduler for runtime-like cpu-bound tasks?

So in general cpu bound tasks in the RT classes (FIFO/RR/DEADLINE) are
bad and can make the system go funny.

For general system health it is important that various system tasks
(kthreads usually) can run. Many of these kthreads run at !rt prios, and
by having cpu bound tasks in rt prios they don't get to run.

2014-04-24 20:24:51

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Real-time scheduling policies and hyper-threading

On Fri, Apr 25, 2014 at 12:16:52AM +0400, Kirill Tkhai wrote:
> 24.04.2014, 22:59, "Peter Zijlstra" <[email protected]>:
>
> [snip]
>
> >> ?Does anyone use rt-scheduler for runtime-like cpu-bound tasks?
> >
> > So in general cpu bound tasks in the RT classes (FIFO/RR/DEADLINE) are
> > bad and can make the system go funny.
> >
> > For general system health it is important that various system tasks
> > (kthreads usually) can run. Many of these kthreads run at !rt prios, and
> > by having cpu bound tasks in rt prios they don't get to run.
>
> One more word to this. I had such expirience on 2.6.33 kernel with RT patch
> and weak hardware (sparc32).
>
> Networking was actively used and application did not use any IO operations.
>
> User needs to set all RT priorities by himself. It's necessary to set RT
> priorities at least for softirqs and rcus. RT bandwidth must be switched
> off.
>
> The most giving optimization, which I receive, was after rejection from NAPI
> for network adapters and splitting interrupt handler on hard and threadparts.
> In this case game with binding for everything strongly improves the picture
> for single problem.

Sure, it can be made to work, but you really need to know what you're
doing and you get to keep all pieces when it comes apart :-)

2014-04-24 20:27:12

by Kirill Tkhai

[permalink] [raw]
Subject: Re: Real-time scheduling policies and hyper-threading

24.04.2014, 22:59, "Peter Zijlstra" <[email protected]>:

[snip]

>> ?Does anyone use rt-scheduler for runtime-like cpu-bound tasks?
>
> So in general cpu bound tasks in the RT classes (FIFO/RR/DEADLINE) are
> bad and can make the system go funny.
>
> For general system health it is important that various system tasks
> (kthreads usually) can run. Many of these kthreads run at !rt prios, and
> by having cpu bound tasks in rt prios they don't get to run.

One more word to this. I had such expirience on 2.6.33 kernel with RT patch
and weak hardware (sparc32).

Networking was actively used and application did not use any IO operations.

User needs to set all RT priorities by himself. It's necessary to set RT
priorities at least for softirqs and rcus. RT bandwidth must be switched
off.

The most giving optimization, which I receive, was after rejection from NAPI
for network adapters and splitting interrupt handler on hard and threadparts.
In this case game with binding for everything strongly improves the picture
for single problem.

Kirill

2014-04-25 11:04:56

by Roman Gushchin

[permalink] [raw]
Subject: Re: Real-time scheduling policies and hyper-threading

24.04.2014, 22:59, "Peter Zijlstra" <[email protected]>:
> On Thu, Apr 24, 2014 at 10:16:12PM +0400, Roman Gushchin wrote:
>
>> ?Are there any known solutions of this problem except disabling
>> ?hyper-threading and frequency scaling at all?
>
> This is what we generally tell people to do; disable HT in the BIOS,
> offline the siblings or similar approaches.

It's a good, but expensive approach, if you have many machines.


> Similarly we typically tell people to keep each rt workload to a single
> NUMA node, and in case the workload really is bigger and must cross into
> multiple, use memory interleave where possible.
>
>> ?Are there any common plans to enhance the load balancing algorithm in
>> ?the rt-scheduler?
>
> Nope; at best the topology driven placement is going to be a heuristic,
> and the last thing you want for your realtime tasks is (non deterministic)
> placement heuristics.
>
> You, and only you, know your workload and can devise a correct placement
> policy. We have cpusets and cpu affinity available to carve up your
> system for this.

The problem here is that _all_ userspace tasks should care about cpu placement.
In reality, there is always a number of monitoring/administration/system tasks,
which execution increases latencies. Use of nice/SCHED_BATCH/... doesn't solve
the problem, because of hyper-threading.

Of course, it's possible to divide cpus statically via cpusets/partitioning,
but it's also not cheap in terms of overall performance utilization.


> That said, there might be something we could do for soft(er) realtime,
> but I don't think there's been much (if any) research on this.
>
>> ?Does anyone use rt-scheduler for runtime-like cpu-bound tasks?
>
> So in general cpu bound tasks in the RT classes (FIFO/RR/DEADLINE) are
> bad and can make the system go funny.
>
> For general system health it is important that various system tasks
> (kthreads usually) can run. Many of these kthreads run at !rt prios, and
> by having cpu bound tasks in rt prios they don't get to run.

I also had expected a number of problems here, but actually we caught only
a tricky race in timers code ( https://lkml.org/lkml/2014/3/17/323 ).
We haven't noticed any such problems on several hundreds of machines
working in runtime production for few weeks.

Probably, it's important, that CPU load in our case is never 100% for a period
longer than few hundreds milliseconds.


Many thanks for you comments!

Regards,
Roman

2014-04-25 11:17:31

by Roman Gushchin

[permalink] [raw]
Subject: Re: Real-time scheduling policies and hyper-threading

25.04.2014, 00:16, "Kirill Tkhai" <[email protected]>:
> 24.04.2014, 22:59, "Peter Zijlstra" <[email protected]>:
>
> [snip]
>
>>> ??Does anyone use rt-scheduler for runtime-like cpu-bound tasks?
>> ?So in general cpu bound tasks in the RT classes (FIFO/RR/DEADLINE) are
>> ?bad and can make the system go funny.
>>
>> ?For general system health it is important that various system tasks
>> ?(kthreads usually) can run. Many of these kthreads run at !rt prios, and
>> ?by having cpu bound tasks in rt prios they don't get to run.
>
> One more word to this. I had such expirience on 2.6.33 kernel with RT patch
> and weak hardware (sparc32).
>
> Networking was actively used and application did not use any IO operations.
>
> User needs to set all RT priorities by himself. It's necessary to set RT
> priorities at least for softirqs and rcus. RT bandwidth must be switched
> off.

I disable RT bandwidth sharing and change rt_period/rt_runtime to 1/100 ms
respectively.

Thanks,
Roman

2014-04-25 13:16:32

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Real-time scheduling policies and hyper-threading

On Fri, Apr 25, 2014 at 03:04:49PM +0400, Roman Gushchin wrote:
> 24.04.2014, 22:59, "Peter Zijlstra" <[email protected]>:
> > On Thu, Apr 24, 2014 at 10:16:12PM +0400, Roman Gushchin wrote:
> >
> >> ?Are there any known solutions of this problem except disabling
> >> ?hyper-threading and frequency scaling at all?
> >
> > This is what we generally tell people to do; disable HT in the BIOS,
> > offline the siblings or similar approaches.
>
> It's a good, but expensive approach, if you have many machines.

Expensive how? If you require the latency, there's really no option.

That said, if only a small part of your workload is RT then you don't
need to disable all SMT siblings, you can only disable the few that are
in your RT partition.

> > You, and only you, know your workload and can devise a correct placement
> > policy. We have cpusets and cpu affinity available to carve up your
> > system for this.
>
> The problem here is that _all_ userspace tasks should care about cpu placement.
> In reality, there is always a number of monitoring/administration/system tasks,
> which execution increases latencies. Use of nice/SCHED_BATCH/... doesn't solve
> the problem, because of hyper-threading.

The thing is, normal tasks don't care and its perfectly fine to have
heuristics that work more or less most of the time.

But realtime does care, it must absolutely always work in a predictable
fashion.

> Of course, it's possible to divide cpus statically via cpusets/partitioning,
> but it's also not cheap in terms of overall performance utilization.

Well, realtime isn't about getting full utilization from your hardware
to begin with. Its about getting deterministic behaviour from your
hardware.

If your workload allows both, then yay, but its absolutely not true in
general.

> Probably, it's important, that CPU load in our case is never 100% for a period
> longer than few hundreds milliseconds.

Yeah, that's not a problem. Pegging the CPU for 10s of seconds and
upwards will get funny real quick though.

2014-04-25 15:11:20

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Real-time scheduling policies and hyper-threading

On Fri, Apr 25, 2014 at 07:02:16PM +0400, Roman Gushchin wrote:
> Hm. What I really want (and try to implement), is
> "work as if ht is disabled if there are free physical cores, start using ht siblings otherwise".

At which point I have to ask, what about the rest of the topology?

Also, how is a task to know if its the 16th or 17th and thus should
expect worse latency?

> It's a 32-thread processor with 16 physical cores.

No NUMA? I'm not aware of single node systems with 16 cores.

2014-04-25 15:13:08

by Roman Gushchin

[permalink] [raw]
Subject: Re: Real-time scheduling policies and hyper-threading

25.04.2014, 17:16, "Peter Zijlstra" <[email protected]>:
> On Fri, Apr 25, 2014 at 03:04:49PM +0400, Roman Gushchin wrote:
>
>> ?24.04.2014, 22:59, "Peter Zijlstra" <[email protected]>:
>>> ?On Thu, Apr 24, 2014 at 10:16:12PM +0400, Roman Gushchin wrote:
>>>> ??Are there any known solutions of this problem except disabling
>>>> ??hyper-threading and frequency scaling at all?
>>> ?This is what we generally tell people to do; disable HT in the BIOS,
>>> ?offline the siblings or similar approaches.
>> ?It's a good, but expensive approach, if you have many machines.
>
> Expensive how? If you require the latency, there's really no option.

Hm. What I really want (and try to implement), is
"work as if ht is disabled if there are free physical cores, start using ht siblings otherwise".

> That said, if only a small part of your workload is RT then you don't
> need to disable all SMT siblings, you can only disable the few that are
> in your RT partition.

Ok, I understand, that I have something different than what is usual called RT.

I have requests from a user and I want to handle them without any significant delays,
while possible. For instance, if a request takes 100ms, I don't want to get 165ms
just because the sibling ht thread was loaded. In the same time, it's important to have
good resource utilization, what can be less important in real RT tasks.

Of course, there is always a trade-off between latency and performance utilization,
but the common practice here is to keep cpu load between 40% and 70%. So, there is
fast always a free CPU thread, and a good chance that there is a free core.
In this case, both latency and average performance depends on how effectively
all physical cores are utilized.

Below is the comparison of average response times (in microseconds) under CFS and RT
scheduler with my patches depending on the number of simultaneous requests.
It's a 32-thread processor with 16 physical cores.

nr_req CFS RT* diff
1 9664 9237 +4.41%
2 9179 9107 +0.79%
3 9179 9310 -1.43%
4 9325 9245 +0.86%
5 9562 9387 +1.83%
6 9523 9478 +0.47%
7 10105 9727 +3.74%
8 9900 9781 +1.21%
9 10077 9871 +2.04%
10 10270 9963 +2.99%
11 10604 10231 +3.52%
12 10760 10299 +4.29%
13 11022 10316 +6.40%
14 11595 10310 +11.08%
15 11489 10200 +11.22%
16 12059 10404 +13.72%
17 12056 10691 +11.32%
18 12208 11027 +9.67%
19 12412 11337 +8.66%
20 12634 11663 +7.69%
21 12843 12165 +5.28%
22 13181 12273 +6.89%
23 13214 12560 +4.95%
24 13465 12822 +4.78%
25 13469 13106 +2.70%
26 13725 13329 +2.88%
27 13943 13581 +2.60%
28 14007 13845 +1.16%
29 14242 14040 +1.41%
30 14386 14265 +0.84%
31 14909 14478 +2.90%
32 14772 14769 +0.02%

Probably, it's possible to tune CFS for this case, but I have no idea how.

In my opinion, such load pattern is not very rare. For instance, front-end
web-servers can have similar load.

Thanks,
Roman

2014-04-25 15:16:26

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Real-time scheduling policies and hyper-threading

On Fri, Apr 25, 2014 at 07:02:16PM +0400, Roman Gushchin wrote:
> Of course, there is always a trade-off between latency and performance utilization,
> but the common practice here is to keep cpu load between 40% and 70%. So, there is
> fast always a free CPU thread, and a good chance that there is a free core.
> In this case, both latency and average performance depends on how effectively
> all physical cores are utilized.
>
> Below is the comparison of average response times (in microseconds) under CFS and RT
> scheduler with my patches depending on the number of simultaneous requests.
> It's a 32-thread processor with 16 physical cores.
>
> nr_req CFS RT* diff
> 1 9664 9237 +4.41%
> 2 9179 9107 +0.79%
> 3 9179 9310 -1.43%
> 4 9325 9245 +0.86%
> 5 9562 9387 +1.83%
> 6 9523 9478 +0.47%
> 7 10105 9727 +3.74%
> 8 9900 9781 +1.21%
> 9 10077 9871 +2.04%
> 10 10270 9963 +2.99%
> 11 10604 10231 +3.52%
> 12 10760 10299 +4.29%
> 13 11022 10316 +6.40%
> 14 11595 10310 +11.08%
> 15 11489 10200 +11.22%
> 16 12059 10404 +13.72%
> 17 12056 10691 +11.32%
> 18 12208 11027 +9.67%
> 19 12412 11337 +8.66%
> 20 12634 11663 +7.69%
> 21 12843 12165 +5.28%
> 22 13181 12273 +6.89%
> 23 13214 12560 +4.95%
> 24 13465 12822 +4.78%
> 25 13469 13106 +2.70%
> 26 13725 13329 +2.88%
> 27 13943 13581 +2.60%
> 28 14007 13845 +1.16%
> 29 14242 14040 +1.41%
> 30 14386 14265 +0.84%
> 31 14909 14478 +2.90%
> 32 14772 14769 +0.02%
>
> Probably, it's possible to tune CFS for this case, but I have no idea how.

Well, like with everything, trace the heck out of it, see where it makes
the 'wrong' decision, then try and cure it.

The 'problem' seems to be in the 13-20 requests range.

2014-04-25 16:19:41

by Roman Gushchin

[permalink] [raw]
Subject: Re: Real-time scheduling policies and hyper-threading

25.04.2014, 19:11, "Peter Zijlstra" <[email protected]>:
> On Fri, Apr 25, 2014 at 07:02:16PM +0400, Roman Gushchin wrote:
>
>> ?Hm. What I really want (and try to implement), is
>> ?"work as if ht is disabled if there are free physical cores, start using ht siblings otherwise".
>
> At which point I have to ask, what about the rest of the topology?

My prototype works as follows: all physical cores on every numa node
are linked into circular list. When I have to select a cpu, I traverse the list
and search for a free core. If there is one, I select them. Otherwise, I jump to the
other node and search there too. It's better to save symmetry here: when I start
with local core number 3, it's reasonable to start with remote core number 3 too.
Also, there is a "node balancing" logic, that causes me to start searching
with remote node, if a big imbalance is detected. It helps to work good under
small load (<8 requests). When there are no free cores, I use the similar logic
to find a free thread.

Now I'm trying to make this algorithm more general and scalable. It will be
great to use the current O(1) cpupri approach on each cpu topology level somehow,
but I have no complete solution yet.

> Also, how is a task to know if its the 16th or 17th and thus should
> expect worse latency?

No way.

>> ?It's a 32-thread processor with 16 physical cores.
>
> No NUMA? I'm not aware of single node systems with 16 cores.

Of course, 2 physical processors with 8 cores each :)
Sorry.

Thanks,
Roman