Hello everyone,
This is the v2 of the discussion started for introducing per-task
latency-nice attribute for providing scheduler hints.
v1: https://lkml.org/lkml/2019/9/18/555
In brief, we face two challenges with the introduction of such attr.
1. Name:
==============
( Should be relevant to all the possible usecases, not confuse end-user and
reflect the functionality it provides to the scheduler behaviour )
Curated list of proposed names:
1. latency-nice:
should have a better understanding based on pre-existing concepts
- But poses two interpretation ambiguity
a) -20 (least nice to latency, i.e. sacrifice latency for throughput)
+19 (most nice to latency, i.e. sacrifice throughput for latency)
b) -20 (least nice to other task in terms of sacrificing latency, i.e.
latency-sensitive)
+19 (most nice to other tasks in terms of sacrificing latency, i.e.
latency-forgoing)
2. latency-tolerant:
decouples a bit its meaning from the niceness thus giving maybe a bit
more freedom in its complete definition and perhaps avoid any
possible interpretation confusion
3. latency-nasty
4. latency-sensible
2. Value(s):
==============
( Boolean/Ternary, Range of values, profile tagging )
- Recent discussion plots the range of [-20, 19] to be the most agreed upon.
1. Range:
- [-20, 19]:
Which has similarities with the niceness concept and gives a minimal
continuous range. This can be on hand for things like scaling the
vruntime normalization [3]
2. Profile tagging:
- Can be used just like a flag attribute
e.g., Background, foreground, latency-sensible, reduce-idle-search, etc.
3. Binary:
- 0 for: Latency sensitive/sensible/in-tolerant/hungry...
- 1 for Latency insensitive/insensible/tolerant/nice-to-others/...
Ternary:
- 0: no effect
- -1: require least latency
- +1: no restrictions in terms of lower/higher latency
------------------
**Usecases**
-----------------
1> Reduce search scan time for idle Cores
( -Subhra Mazumadar )
=====================================
Currently, CFS makes search across LLC domain to search for idle core which
is sometimes exhaustive when the core count increases beyond certain count.
This impacts the latency-sensitive tasks where scheduler spends much of it
time to search for idle core to wakeup a task. This could potentially be
solved by limiting the idle core search for the tasks which requires least
latency. The userland providing hints to the scheduler by tagging such
tasks is a solution proposed in the community and has shown positive
results [1].
2> TurboSched
( -Parth Shah )
====================
TurboSched [2] tries to minimize the number of active cores in a socket by
packing an un-important and low-utilization (named jitter) task on an
already active core and thus refrains from waking up of a new core if
possible. This requires tagging of tasks from the userspace hinting which
tasks are un-important and thus waking-up a new core to minimize the
latency is un-necessary for such tasks.
As per the discussion on the posted RFC, it will be appropriate to use the
task latency property where a task with the highest latency-nice value can
be packed.
But for this specific use-cases, having just a binary value to know which
task is latency-sensitive and which not is sufficient enough, but having a
range is also a good way to go where above some threshold the task can be
packed.
3> Wakeup path tunings
( -Patrick Bellasi )
==========================
Some additional possible use-cases was already discussed in [3]:
- dynamically tune the policy of a task among SCHED_{OTHER,BATCH,IDLE}
depending on crossing certain pre-configured threshold of latency
niceness.
- dynamically bias the vruntime updates we do in place_entity()
depending on the actual latency niceness of a task.
Tuning the tweaks we already have for:
- START_DEBIT
- GENTLE_FAIR_SLEEPERS
a bit more parametric and proportional to the latency-nice of a task.
- bias the decisions we take in check_preempt_tick() still depending
on a relative comparison of the current and wakeup task latency
niceness values.
4> Load balance tuning
( -Valentin Schneider )
======================
Already mentioned these in [4]:
- Increase (reduce) nr_balance_failed threshold when trying to active
balance a latency-sensitive (non-latency-sensitive) task.
- Increase (decrease) sched_migration_cost factor in task_hot() for
latency-sensitive (non-latency-sensitive) tasks.
5> Separating AVX512 tasks and latency sensitive tasks on separate cores
( -Tim Chen )
===========================================================================
Another usecase we are considering is to segregate those workload that will
pull down core cpu frequency (e.g. AVX512) from workload that are latency
sensitive. There are certain tasks that need to provide a fast response
time (latency sensitive) and they are best scheduled on cpu that has a
lighter load and not have other tasks running on the sibling cpu that could
pull down the cpu core frequency.
Some users are running machine learning batch tasks with AVX512, and have
observed that these tasks affect the tasks needing a fast response. They
have to rely on manual CPU affinity to separate these tasks. With
appropriate latency hint on task, the scheduler can be taught to separate them.
6> EAS
( -Qais Yousef )
====================
The new knob can help EAS path to switch to spreading behavior when
latency-nice is set instead of packing tasks on the most energy efficient CPU.
ie: pick the most energy efficient idle CPU.
Further doubts requiring community attention
---------------------------------------------
1. Who is the intended user for setting this value? (- Qais Yousef)
- system admin or application developer ?
Thanks everyone for providing your valuable inputs, hence again asking for
the same. (◠﹏◠)
---------------
**References**
---------------
[1]. https://lkml.org/lkml/2019/8/30/829
[2]. https://lkml.org/lkml/2019/7/25/296
[3]. Message-ID: <[email protected]>
https://lore.kernel.org/lkml/[email protected]/
[4]. https://lkml.kernel.org/r/[email protected]
From: Parth Shah
> Sent: 30 September 2019 11:44
...
> 5> Separating AVX512 tasks and latency sensitive tasks on separate cores
> ( -Tim Chen )
> ===========================================================================
> Another usecase we are considering is to segregate those workload that will
> pull down core cpu frequency (e.g. AVX512) from workload that are latency
> sensitive. There are certain tasks that need to provide a fast response
> time (latency sensitive) and they are best scheduled on cpu that has a
> lighter load and not have other tasks running on the sibling cpu that could
> pull down the cpu core frequency.
>
> Some users are running machine learning batch tasks with AVX512, and have
> observed that these tasks affect the tasks needing a fast response. They
> have to rely on manual CPU affinity to separate these tasks. With
> appropriate latency hint on task, the scheduler can be taught to separate them.
Has this been diagnosed properly?
I can't really see how the frequency drop from AVX512 significantly affects latency.
Most tasks that require low latency probably don't do a lot of work.
It is much more likely that the latency issues happen because the AVX512 tasks
are doing very few system calls so can't be pre-empted even by a high priority task.
This 'feature' is hinted by this:
> 2> TurboSched
> ( -Parth Shah )
> ====================
> TurboSched [2] tries to minimize the number of active cores in a socket by
> packing an un-important and low-utilization (named jitter) task on an
> already active core and thus refrains from waking up of a new core if
> possible.
Consider this example of a process that requires low latency (sub 1ms would be good):
- A hardware interrupt (or timer interrupt) wakes up on thread.
- When that thread wakes it wakes up other threads that are sleeping.
- All the threads 'beaver away' for a few ms (processing RTP and other audio).
- They all sleep for the rest of a 10ms period.
The affinities are set so each thread runs on a separate cpu, and all are SCHED_RR.
Now loop all the cpus in userspace (run: while :; do :; done) and see what happens to the latencies.
You really want the SCHED_RR threads to immediately pre-empt the running processes.
But I suspect nothing happens until a timer interrupt to the target cpu.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
On 10/2/19 9:41 PM, David Laight wrote:
> From: Parth Shah
>> Sent: 30 September 2019 11:44
> ...
>> 5> Separating AVX512 tasks and latency sensitive tasks on separate cores
>> ( -Tim Chen )
>> ===========================================================================
>> Another usecase we are considering is to segregate those workload that will
>> pull down core cpu frequency (e.g. AVX512) from workload that are latency
>> sensitive. There are certain tasks that need to provide a fast response
>> time (latency sensitive) and they are best scheduled on cpu that has a
>> lighter load and not have other tasks running on the sibling cpu that could
>> pull down the cpu core frequency.
>>
>> Some users are running machine learning batch tasks with AVX512, and have
>> observed that these tasks affect the tasks needing a fast response. They
>> have to rely on manual CPU affinity to separate these tasks. With
>> appropriate latency hint on task, the scheduler can be taught to separate them.
>
> Has this been diagnosed properly?
> I can't really see how the frequency drop from AVX512 significantly affects latency.
> Most tasks that require low latency probably don't do a lot of work.
> It is much more likely that the latency issues happen because the AVX512 tasks
> are doing very few system calls so can't be pre-empted even by a high priority task.> This 'feature' is hinted by this:
>> 2> TurboSched
>> ( -Parth Shah )
>> ====================
>> TurboSched [2] tries to minimize the number of active cores in a socket by
>> packing an un-important and low-utilization (named jitter) task on an
>> already active core and thus refrains from waking up of a new core if
>> possible.
>
You are correct as both approach contradict each other in some sense.
But what TurboSched tried to achieve is doing task packing only for the
tasks classified by user as *latency in-sensitive*. Whereas, IIUC, what Tim
proposes here is to not pack *latency sensitive* tasks and I guess that
align with the TurboSched approach as well, isn't it?
Probably @Tim can throw some light on this for better clarification?
> Consider this example of a process that requires low latency (sub 1ms would be good):
> - A hardware interrupt (or timer interrupt) wakes up on thread.
> - When that thread wakes it wakes up other threads that are sleeping.
> - All the threads 'beaver away' for a few ms (processing RTP and other audio).
> - They all sleep for the rest of a 10ms period.
>
> The affinities are set so each thread runs on a separate cpu, and all are SCHED_RR.
> Now loop all the cpus in userspace (run: while :; do :; done) and see what happens to the latencies.
> You really want the SCHED_RR threads to immediately pre-empt the running processes.
> But I suspect nothing happens until a timer interrupt to the target cpu.
>
This is a good corner case where scheduler can be optimized further, and
the per-task attribute like the latency-nice can be of some help. Maybe we
can reduce the vslice of a task not having any latency constraints in the
time when any RR/RT tasks are present.
> David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
>
On 10/2/19 9:11 AM, David Laight wrote:
> From: Parth Shah
>> Sent: 30 September 2019 11:44
> ...
>> 5> Separating AVX512 tasks and latency sensitive tasks on separate cores
>> ( -Tim Chen )
>> ===========================================================================
>> Another usecase we are considering is to segregate those workload that will
>> pull down core cpu frequency (e.g. AVX512) from workload that are latency
>> sensitive. There are certain tasks that need to provide a fast response
>> time (latency sensitive) and they are best scheduled on cpu that has a
>> lighter load and not have other tasks running on the sibling cpu that could
>> pull down the cpu core frequency.
>>
>> Some users are running machine learning batch tasks with AVX512, and have
>> observed that these tasks affect the tasks needing a fast response. They
>> have to rely on manual CPU affinity to separate these tasks. With
>> appropriate latency hint on task, the scheduler can be taught to separate them.
>
> Has this been diagnosed properly?
> I can't really see how the frequency drop from AVX512 significantly affects latency.
> Most tasks that require low latency probably don't do a lot of work.
> It is much more likely that the latency issues happen because the AVX512 tasks
> are doing very few system calls so can't be pre-empted even by a high priority task.
This problem was conveyed to us by several customers. The issue is not
that you are slow to preempt an AVX512 task on the same logical cpu thread, but the AVX512
tasks on the sibling CPU thread is dropping the CPU frequency and lowering the performance and
response. Let's say that you make the latency sensitive task a real time task
with high priority so it will immediately run on a cpu after being woken.
But it will be slower if there's an AVX512 running on the sibling versus if other
kind of tasks are running on sibling.
This is the noisy neighbor effect. So it is better to isolate the latency
sensitive tasks on cores that AVX512 tasks don't run on.
Tim
> This 'feature' is hinted by this:
>> 2> TurboSched
>> ( -Parth Shah )
>> ====================
>> TurboSched [2] tries to minimize the number of active cores in a socket by
>> packing an un-important and low-utilization (named jitter) task on an
>> already active core and thus refrains from waking up of a new core if
>> possible.
>
> Consider this example of a process that requires low latency (sub 1ms would be good):
> - A hardware interrupt (or timer interrupt) wakes up on thread.
> - When that thread wakes it wakes up other threads that are sleeping.
> - All the threads 'beaver away' for a few ms (processing RTP and other audio).
> - They all sleep for the rest of a 10ms period.
>
> The affinities are set so each thread runs on a separate cpu, and all are SCHED_RR.
> Now loop all the cpus in userspace (run: while :; do :; done) and see what happens to the latencies.
> You really want the SCHED_RR threads to immediately pre-empt the running processes.
> But I suspect nothing happens until a timer interrupt to the target cpu.
>
> David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
>
On 9/30/19 4:13 PM, Parth Shah wrote:
> Hello everyone,
>
> This is the v2 of the discussion started for introducing per-task
> latency-nice attribute for providing scheduler hints.
>
> v1: https://lkml.org/lkml/2019/9/18/555
>
> In brief, we face two challenges with the introduction of such attr.
>
> 1. Name:
> ==============
> ( Should be relevant to all the possible usecases, not confuse end-user and
> reflect the functionality it provides to the scheduler behaviour )
>
> Curated list of proposed names:
>
> 1. latency-nice:
> should have a better understanding based on pre-existing concepts
>
> - But poses two interpretation ambiguity
> a) -20 (least nice to latency, i.e. sacrifice latency for throughput)
> +19 (most nice to latency, i.e. sacrifice throughput for latency)
> b) -20 (least nice to other task in terms of sacrificing latency, i.e.
> latency-sensitive)
> +19 (most nice to other tasks in terms of sacrificing latency, i.e.
> latency-forgoing)
>
> 2. latency-tolerant:
> decouples a bit its meaning from the niceness thus giving maybe a bit
> more freedom in its complete definition and perhaps avoid any
> possible interpretation confusion
>
> 3. latency-nasty
>
> 4. latency-sensible
+ 5. temper
-20 (short temper, angry tasks, i.e., requires least latency)
+19 (calm tasks, i.e., sacrifice latency for throughput)
>
>
>
> 2. Value(s):
> ==============
> ( Boolean/Ternary, Range of values, profile tagging )
>
> - Recent discussion plots the range of [-20, 19] to be the most agreed upon.
>
> 1. Range:
> - [-20, 19]:
> Which has similarities with the niceness concept and gives a minimal
> continuous range. This can be on hand for things like scaling the
> vruntime normalization [3]
>
> 2. Profile tagging:
> - Can be used just like a flag attribute
> e.g., Background, foreground, latency-sensible, reduce-idle-search, etc.
>
> 3. Binary:
> - 0 for: Latency sensitive/sensible/in-tolerant/hungry...
> - 1 for Latency insensitive/insensible/tolerant/nice-to-others/...
>
> Ternary:
> - 0: no effect
> - -1: require least latency
> - +1: no restrictions in terms of lower/higher latency
>
> [...]
I guess the latency-tolerant name seems to be more relevant and the range
[-20,19] will suit all the discussed usecases.
( ( ( tomatoes target here ) ) )
If this seems alright then I am thinking of writing out some patches to
introduce p->latency-tolerant with the use of "sched_setattr" syscall.
Thanks,
Parth