This is the 3rd revision of the patch set to introduce
latency_{nice/tolerance} as a per task attribute.
The previous version can be found at:
v1: https://lkml.org/lkml/2019/11/25/151
v2: https://lkml.org/lkml/2019/12/8/10
Changes in this revision are:
v2 -> v3:
- This series changes the longer attribute name to "latency_nice" as per
the comment from Dietmar Eggemann https://lkml.org/lkml/2019/12/5/394
v1 -> v2:
- Addressed comments from Qais Yousef
- As per suggestion from Dietmar, moved content from newly created
include/linux/sched/latency_tolerance.h to kernel/sched/sched.h
- Extend sched_setattr() to support latency_tolerance in tools headers UAPI
Introduction:
==============
This patch series introduces a new per-task attribute latency_nice to
provide the scheduler hints about the latency requirements of the task [1].
Latency_nice is a ranged attribute of a task with the value ranging
from [-20, 19] both inclusive which makes it align with the task nice
value.
The value should provide scheduler hints about the relative latency
requirements of tasks, meaning the task with "latency_nice = -20"
should have lower latency requirements than compared to those tasks with
higher values. Similarly a task with "latency_nice = 19" can have higher
latency and hence such tasks may not care much about latency.
The default value is set to 0. The usecases discussed below can use this
range of [-20, 19] for latency_nice for the specific purpose. This
patch does not implement any use cases for such attribute so that any
change in naming or range does not affect much to the other (future)
patches using this. The actual use of latency_nice during task wakeup
and load-balancing is yet to be coded for each of those usecases.
As per my view, this defined attribute can be used in following ways for a
some of the usecases:
1 Reduce search scan time for select_idle_cpu():
- Reduce search scans for finding idle CPU for a waking task with lower
latency_nice values.
2 TurboSched:
- Classify the tasks with higher latency_nice values as a small
background task given that its historic utilization is very low, for
which the scheduler can search for more number of cores to do task
packing. A task with a latency_nice >= some_threshold (e.g, == 19)
and util <= 12.5% can be background tasks.
3 Optimize AVX512 based workload:
- Bias scheduler to not put a task having (latency_nice == -20) on a
core occupying AVX512 based workload.
Series Organization:
====================
- Patch 1: Add new attribute latency_nice to task_struct.
- Patch 2: Clone parent task's attribute to the child task on fork
- Patch 3: Add support for sched_{set,get}attr syscall to modify
latency_nice of the task
The patch series can be applied on tip/sched/core at the
commit 804d402fb6f6 ("sched/rt: Make RT capacity-aware")
References:
============
[1]. Usecases for the per-task latency-nice attribute,
https://lkml.org/lkml/2019/9/30/215
[2]. Task Latency-nice, "Subhra Mazumdar",
https://lkml.org/lkml/2019/8/30/829
[3]. Introduce per-task latency_tolerance for scheduler hints,
https://lkml.org/lkml/2019/12/8/10
Parth Shah (3):
sched: Introduce latency-nice as a per-task attribute
sched/core: Propagate parent task's latency requirements to the child
task
sched: Allow sched_{get,set}attr to change latency_nice of the task
include/linux/sched.h | 1 +
include/uapi/linux/sched.h | 4 +++-
include/uapi/linux/sched/types.h | 19 +++++++++++++++++++
kernel/sched/core.c | 21 +++++++++++++++++++++
kernel/sched/sched.h | 18 ++++++++++++++++++
tools/include/uapi/linux/sched.h | 4 +++-
6 files changed, 65 insertions(+), 2 deletions(-)
--
2.17.2
Latency-nice indicates the latency requirements of a task with respect
to the other tasks in the system. The value of the attribute can be within
the range of [-20, 19] both inclusive to be in-line with the values just
like task nice values.
latency_nice = -20 indicates the task to have the least latency as
compared to the tasks having latency_nice = +19.
The latency_nice may affect only the CFS SCHED_CLASS by getting
latency requirements from the userspace.
Signed-off-by: Parth Shah <[email protected]>
Reviewed-by: Qais Yousef <[email protected]>
---
include/linux/sched.h | 1 +
kernel/sched/sched.h | 18 ++++++++++++++++++
2 files changed, 19 insertions(+)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 467d26046416..0668948fddcd 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -676,6 +676,7 @@ struct task_struct {
int static_prio;
int normal_prio;
unsigned int rt_priority;
+ int latency_nice;
const struct sched_class *sched_class;
struct sched_entity se;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1a88dc8ad11b..edae9277e48d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -101,6 +101,24 @@ extern long calc_load_fold_active(struct rq *this_rq, long adjust);
*/
#define NS_TO_JIFFIES(TIME) ((unsigned long)(TIME) / (NSEC_PER_SEC / HZ))
+/*
+ * Latency nice is meant to provide scheduler hints about the relative
+ * latency requirements of a task with respect to other tasks.
+ * Thus a task with latency_nice == 19 can be hinted as the task with no
+ * latency requirements, in contrast to the task with latency_nice == -20
+ * which should be given priority in terms of lower latency.
+ */
+#define MAX_LATENCY_NICE 19
+#define MIN_LATENCY_NICE -20
+
+#define LATENCY_NICE_WIDTH \
+ (MAX_LATENCY_NICE - MIN_LATENCY_NICE + 1)
+
+/*
+ * Default tasks should be treated as a task with latency_nice = 0.
+ */
+#define DEFAULT_LATENCY_NICE 0
+
/*
* Increase resolution of nice-level calculations for 64-bit architectures.
* The extra resolution improves shares distribution and load balancing of
--
2.17.2
Clone parent task's latency_nice attribute to the forked child task.
Reset the latency_nice value to default value when the child task is
set to sched_reset_on_fork.
Signed-off-by: Parth Shah <[email protected]>
Reviewed-by: Qais Yousef <[email protected]>
---
kernel/sched/core.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e7b08d52db93..a7359836bf17 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2854,6 +2854,9 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
*/
p->prio = current->normal_prio;
+ /* Propagate the parent's latency requirements to the child as well */
+ p->latency_nice = current->latency_nice;
+
uclamp_fork(p);
/*
@@ -2870,6 +2873,7 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
p->prio = p->normal_prio = __normal_prio(p);
set_load_weight(p, false);
+ p->latency_nice = DEFAULT_LATENCY_NICE;
/*
* We don't need the reset flag anymore after the fork. It has
* fulfilled its duty:
--
2.17.2
Introduce the latency_nice attribute to sched_attr and provide a
mechanism to change the value with the use of sched_setattr/sched_getattr
syscall.
Also add new flag "SCHED_FLAG_LATENCY_NICE" to hint the change in
latency_nice of the task on every sched_setattr syscall.
Signed-off-by: Parth Shah <[email protected]>
Reviewed-by: Qais Yousef <[email protected]>
---
include/uapi/linux/sched.h | 4 +++-
include/uapi/linux/sched/types.h | 19 +++++++++++++++++++
kernel/sched/core.c | 17 +++++++++++++++++
tools/include/uapi/linux/sched.h | 4 +++-
4 files changed, 42 insertions(+), 2 deletions(-)
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 4a0217832464..88ce1e8d7553 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -121,6 +121,7 @@ struct clone_args {
#define SCHED_FLAG_KEEP_PARAMS 0x10
#define SCHED_FLAG_UTIL_CLAMP_MIN 0x20
#define SCHED_FLAG_UTIL_CLAMP_MAX 0x40
+#define SCHED_FLAG_LATENCY_NICE 0x80
#define SCHED_FLAG_KEEP_ALL (SCHED_FLAG_KEEP_POLICY | \
SCHED_FLAG_KEEP_PARAMS)
@@ -132,6 +133,7 @@ struct clone_args {
SCHED_FLAG_RECLAIM | \
SCHED_FLAG_DL_OVERRUN | \
SCHED_FLAG_KEEP_ALL | \
- SCHED_FLAG_UTIL_CLAMP)
+ SCHED_FLAG_UTIL_CLAMP | \
+ SCHED_FLAG_LATENCY_NICE)
#endif /* _UAPI_LINUX_SCHED_H */
diff --git a/include/uapi/linux/sched/types.h b/include/uapi/linux/sched/types.h
index c852153ddb0d..626ab61c1145 100644
--- a/include/uapi/linux/sched/types.h
+++ b/include/uapi/linux/sched/types.h
@@ -10,6 +10,7 @@ struct sched_param {
#define SCHED_ATTR_SIZE_VER0 48 /* sizeof first published struct */
#define SCHED_ATTR_SIZE_VER1 56 /* add: util_{min,max} */
+#define SCHED_ATTR_SIZE_VER2 60 /* add: latency_nice */
/*
* Extended scheduling parameters data structure.
@@ -96,6 +97,22 @@ struct sched_param {
* on a CPU with a capacity big enough to fit the specified value.
* A task with a max utilization value smaller than 1024 is more likely
* scheduled on a CPU with no more capacity than the specified value.
+ *
+ * Latency Tolerance Attributes
+ * ===========================
+ *
+ * A subset of sched_attr attributes allows to specify the relative latency
+ * requirements of a task with respect to the other tasks running/queued in the
+ * system.
+ *
+ * @ sched_latency_nice task's latency_nice value
+ *
+ * The latency_nice of a task can have any value in a range of
+ * [LATENCY_NICE_MIN..LATENCY_NICE_MAX].
+ *
+ * A task with latency_nice with the value of LATENCY_NICE_MIN can be
+ * taken for a task with lower latency requirements as opposed to the task with
+ * higher latency_nice.
*/
struct sched_attr {
__u32 size;
@@ -118,6 +135,8 @@ struct sched_attr {
__u32 sched_util_min;
__u32 sched_util_max;
+ /* latency requirement hints */
+ __s32 sched_latency_nice;
};
#endif /* _UAPI_LINUX_SCHED_TYPES_H */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a7359836bf17..a9e5d157b1a5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4713,6 +4713,8 @@ static void __setscheduler_params(struct task_struct *p,
p->rt_priority = attr->sched_priority;
p->normal_prio = normal_prio(p);
set_load_weight(p, true);
+
+ p->latency_nice = attr->sched_latency_nice;
}
/* Actually do priority change: must hold pi & rq lock. */
@@ -4870,6 +4872,13 @@ static int __sched_setscheduler(struct task_struct *p,
return retval;
}
+ if (attr->sched_flags & SCHED_FLAG_LATENCY_NICE) {
+ if (attr->sched_latency_nice > MAX_LATENCY_NICE)
+ return -EINVAL;
+ if (attr->sched_latency_nice < MIN_LATENCY_NICE)
+ return -EINVAL;
+ }
+
if (pi)
cpuset_read_lock();
@@ -4904,6 +4913,9 @@ static int __sched_setscheduler(struct task_struct *p,
goto change;
if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)
goto change;
+ if (attr->sched_flags & SCHED_FLAG_LATENCY_NICE &&
+ attr->sched_latency_nice != p->latency_nice)
+ goto change;
p->sched_reset_on_fork = reset_on_fork;
retval = 0;
@@ -5152,6 +5164,9 @@ static int sched_copy_attr(struct sched_attr __user *uattr, struct sched_attr *a
size < SCHED_ATTR_SIZE_VER1)
return -EINVAL;
+ if ((attr->sched_flags & SCHED_FLAG_LATENCY_NICE) &&
+ size < SCHED_ATTR_SIZE_VER2)
+ return -EINVAL;
/*
* XXX: Do we want to be lenient like existing syscalls; or do we want
* to be strict and return an error on out-of-bounds values?
@@ -5381,6 +5396,8 @@ SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
else
kattr.sched_nice = task_nice(p);
+ kattr.sched_latency_nice = p->latency_nice;
+
#ifdef CONFIG_UCLAMP_TASK
kattr.sched_util_min = p->uclamp_req[UCLAMP_MIN].value;
kattr.sched_util_max = p->uclamp_req[UCLAMP_MAX].value;
diff --git a/tools/include/uapi/linux/sched.h b/tools/include/uapi/linux/sched.h
index 4a0217832464..fa67e2527169 100644
--- a/tools/include/uapi/linux/sched.h
+++ b/tools/include/uapi/linux/sched.h
@@ -121,6 +121,7 @@ struct clone_args {
#define SCHED_FLAG_KEEP_PARAMS 0x10
#define SCHED_FLAG_UTIL_CLAMP_MIN 0x20
#define SCHED_FLAG_UTIL_CLAMP_MAX 0x40
+#define SCHED_FLAG_LATENCY_NICE 0X80
#define SCHED_FLAG_KEEP_ALL (SCHED_FLAG_KEEP_POLICY | \
SCHED_FLAG_KEEP_PARAMS)
@@ -132,6 +133,7 @@ struct clone_args {
SCHED_FLAG_RECLAIM | \
SCHED_FLAG_DL_OVERRUN | \
SCHED_FLAG_KEEP_ALL | \
- SCHED_FLAG_UTIL_CLAMP)
+ SCHED_FLAG_UTIL_CLAMP | \
+ SCHED_FLAG_LATENCY_NICE)
#endif /* _UAPI_LINUX_SCHED_H */
--
2.17.2
On 1/16/20 5:32 PM, Parth Shah wrote:
> This is the 3rd revision of the patch set to introduce
> latency_{nice/tolerance} as a per task attribute.
>
> The previous version can be found at:
> v1: https://lkml.org/lkml/2019/11/25/151
> v2: https://lkml.org/lkml/2019/12/8/10
>
> Changes in this revision are:
> v2 -> v3:
> - This series changes the longer attribute name to "latency_nice" as per
> the comment from Dietmar Eggemann https://lkml.org/lkml/2019/12/5/394
> v1 -> v2:
> - Addressed comments from Qais Yousef
> - As per suggestion from Dietmar, moved content from newly created
> include/linux/sched/latency_tolerance.h to kernel/sched/sched.h
> - Extend sched_setattr() to support latency_tolerance in tools headers UAPI
>
>
> Introduction:
> ==============
> This patch series introduces a new per-task attribute latency_nice to
> provide the scheduler hints about the latency requirements of the task [1].
>
> Latency_nice is a ranged attribute of a task with the value ranging
> from [-20, 19] both inclusive which makes it align with the task nice
> value.
>
> The value should provide scheduler hints about the relative latency
> requirements of tasks, meaning the task with "latency_nice = -20"
> should have lower latency requirements than compared to those tasks with
> higher values. Similarly a task with "latency_nice = 19" can have higher
> latency and hence such tasks may not care much about latency.
>
> The default value is set to 0. The usecases discussed below can use this
> range of [-20, 19] for latency_nice for the specific purpose. This
> patch does not implement any use cases for such attribute so that any
> change in naming or range does not affect much to the other (future)
> patches using this. The actual use of latency_nice during task wakeup
> and load-balancing is yet to be coded for each of those usecases.
>
> As per my view, this defined attribute can be used in following ways for a
> some of the usecases:
> 1 Reduce search scan time for select_idle_cpu():
> - Reduce search scans for finding idle CPU for a waking task with lower
> latency_nice values.
>
> 2 TurboSched:
> - Classify the tasks with higher latency_nice values as a small
> background task given that its historic utilization is very low, for
> which the scheduler can search for more number of cores to do task
> packing. A task with a latency_nice >= some_threshold (e.g, == 19)
> and util <= 12.5% can be background tasks.
>
> 3 Optimize AVX512 based workload:
> - Bias scheduler to not put a task having (latency_nice == -20) on a
> core occupying AVX512 based workload.
>
>
> Series Organization:
> ====================
> - Patch 1: Add new attribute latency_nice to task_struct.
> - Patch 2: Clone parent task's attribute to the child task on fork
> - Patch 3: Add support for sched_{set,get}attr syscall to modify
> latency_nice of the task
>
>
> The patch series can be applied on tip/sched/core at the
> commit 804d402fb6f6 ("sched/rt: Make RT capacity-aware")
>
>
> References:
> ============
> [1]. Usecases for the per-task latency-nice attribute,
> https://lkml.org/lkml/2019/9/30/215
> [2]. Task Latency-nice, "Subhra Mazumdar",
> https://lkml.org/lkml/2019/8/30/829
> [3]. Introduce per-task latency_tolerance for scheduler hints,
> https://lkml.org/lkml/2019/12/8/10
>
>
> Parth Shah (3):
> sched: Introduce latency-nice as a per-task attribute
> sched/core: Propagate parent task's latency requirements to the child
> task
> sched: Allow sched_{get,set}attr to change latency_nice of the task
>
> include/linux/sched.h | 1 +
> include/uapi/linux/sched.h | 4 +++-
> include/uapi/linux/sched/types.h | 19 +++++++++++++++++++
> kernel/sched/core.c | 21 +++++++++++++++++++++
> kernel/sched/sched.h | 18 ++++++++++++++++++
> tools/include/uapi/linux/sched.h | 4 +++-
> 6 files changed, 65 insertions(+), 2 deletions(-)
>
Its been a long time and few revisions since the beginning of the
discussion around the latency-nice. Hence thought of asking if there is/are
any further work that needs to be done for adding latency-nice attribute or
am I missing any piece in here?
Thanks,
Parth
On 2/17/20 3:57 AM, Parth Shah wrote:
>
>
> On 1/16/20 5:32 PM, Parth Shah wrote:
>> This is the 3rd revision of the patch set to introduce
>> latency_{nice/tolerance} as a per task attribute.
>>
>> The previous version can be found at:
>> v1: https://lkml.org/lkml/2019/11/25/151
>> v2: https://lkml.org/lkml/2019/12/8/10
>>
>> Changes in this revision are:
>> v2 -> v3:
>> - This series changes the longer attribute name to "latency_nice" as per
>> the comment from Dietmar Eggemann https://lkml.org/lkml/2019/12/5/394
>> v1 -> v2:
>> - Addressed comments from Qais Yousef
>> - As per suggestion from Dietmar, moved content from newly created
>> include/linux/sched/latency_tolerance.h to kernel/sched/sched.h
>> - Extend sched_setattr() to support latency_tolerance in tools headers UAPI
>>
>>
>> Introduction:
>> ==============
>> This patch series introduces a new per-task attribute latency_nice to
>> provide the scheduler hints about the latency requirements of the task [1].
>>
>> Latency_nice is a ranged attribute of a task with the value ranging
>> from [-20, 19] both inclusive which makes it align with the task nice
>> value.
>>
>> The value should provide scheduler hints about the relative latency
>> requirements of tasks, meaning the task with "latency_nice = -20"
>> should have lower latency requirements than compared to those tasks with
>> higher values. Similarly a task with "latency_nice = 19" can have higher
>> latency and hence such tasks may not care much about latency.
>>
>> The default value is set to 0. The usecases discussed below can use this
>> range of [-20, 19] for latency_nice for the specific purpose. This
>> patch does not implement any use cases for such attribute so that any
>> change in naming or range does not affect much to the other (future)
>> patches using this. The actual use of latency_nice during task wakeup
>> and load-balancing is yet to be coded for each of those usecases.
>>
>> As per my view, this defined attribute can be used in following ways for a
>> some of the usecases:
>> 1 Reduce search scan time for select_idle_cpu():
>> - Reduce search scans for finding idle CPU for a waking task with lower
>> latency_nice values.
>>
>> 2 TurboSched:
>> - Classify the tasks with higher latency_nice values as a small
>> background task given that its historic utilization is very low, for
>> which the scheduler can search for more number of cores to do task
>> packing. A task with a latency_nice >= some_threshold (e.g, == 19)
>> and util <= 12.5% can be background tasks.
>>
>> 3 Optimize AVX512 based workload:
>> - Bias scheduler to not put a task having (latency_nice == -20) on a
>> core occupying AVX512 based workload.
>>
>>
>> Series Organization:
>> ====================
>> - Patch 1: Add new attribute latency_nice to task_struct.
>> - Patch 2: Clone parent task's attribute to the child task on fork
>> - Patch 3: Add support for sched_{set,get}attr syscall to modify
>> latency_nice of the task
>>
>>
>> The patch series can be applied on tip/sched/core at the
>> commit 804d402fb6f6 ("sched/rt: Make RT capacity-aware")
>>
>>
>> References:
>> ============
>> [1]. Usecases for the per-task latency-nice attribute,
>> https://lkml.org/lkml/2019/9/30/215
>> [2]. Task Latency-nice, "Subhra Mazumdar",
>> https://lkml.org/lkml/2019/8/30/829
>> [3]. Introduce per-task latency_tolerance for scheduler hints,
>> https://lkml.org/lkml/2019/12/8/10
>>
>>
>> Parth Shah (3):
>> sched: Introduce latency-nice as a per-task attribute
>> sched/core: Propagate parent task's latency requirements to the child
>> task
>> sched: Allow sched_{get,set}attr to change latency_nice of the task
>>
>> include/linux/sched.h | 1 +
>> include/uapi/linux/sched.h | 4 +++-
>> include/uapi/linux/sched/types.h | 19 +++++++++++++++++++
>> kernel/sched/core.c | 21 +++++++++++++++++++++
>> kernel/sched/sched.h | 18 ++++++++++++++++++
>> tools/include/uapi/linux/sched.h | 4 +++-
>> 6 files changed, 65 insertions(+), 2 deletions(-)
>>
>
> Its been a long time and few revisions since the beginning of the
> discussion around the latency-nice. Hence thought of asking if there is/are
> any further work that needs to be done for adding latency-nice attribute or
> am I missing any piece in here?
All, I was asked to take a look at the original latency_nice patchset. First, to clarify objectives, Oracle is not
interested in trading throughput for latency. What we found is that the DB has specific tasks which do very little but
need to do this as absolutely quickly as possible, ie extreme latency sensitivity. Second, the key to latency reduction
in the task wakeup path seems to be limiting variations of "idle cpu" search. The latter particularly interests me as an
example of "platform size based latency" which I believe to be important given all the varying size VMs and containers.
Parth, I've been using your v3 patchset as the basis of an investigation into the measurable effects of short-circuiting
this search. I'm not quite ready to put anything out, but the patchset is working well. The only feedback I have is that
currently non-root can set the value negative which is inconsistent with 'nice' and I would think a security hole.
-chrish
Hi Chris,
On 2/19/20 4:30 AM, chris hyser wrote:
> On 2/17/20 3:57 AM, Parth Shah wrote:
>>
>>
>> On 1/16/20 5:32 PM, Parth Shah wrote:
>>> This is the 3rd revision of the patch set to introduce
>>> latency_{nice/tolerance} as a per task attribute.
>>>
>>> The previous version can be found at:
>>> v1: https://lkml.org/lkml/2019/11/25/151
>>> v2: https://lkml.org/lkml/2019/12/8/10
>>>
>>> Changes in this revision are:
>>> v2 -> v3:
>>> - This series changes the longer attribute name to "latency_nice" as per
>>> the comment from Dietmar Eggemann https://lkml.org/lkml/2019/12/5/394
>>> v1 -> v2:
>>> - Addressed comments from Qais Yousef
>>> - As per suggestion from Dietmar, moved content from newly created
>>> include/linux/sched/latency_tolerance.h to kernel/sched/sched.h
>>> - Extend sched_setattr() to support latency_tolerance in tools headers UAPI
>>>
>>>
>>> Introduction:
>>> ==============
>>> This patch series introduces a new per-task attribute latency_nice to
>>> provide the scheduler hints about the latency requirements of the task [1].
>>>
>>> Latency_nice is a ranged attribute of a task with the value ranging
>>> from [-20, 19] both inclusive which makes it align with the task nice
>>> value.
>>>
>>> The value should provide scheduler hints about the relative latency
>>> requirements of tasks, meaning the task with "latency_nice = -20"
>>> should have lower latency requirements than compared to those tasks with
>>> higher values. Similarly a task with "latency_nice = 19" can have higher
>>> latency and hence such tasks may not care much about latency.
>>>
>>> The default value is set to 0. The usecases discussed below can use this
>>> range of [-20, 19] for latency_nice for the specific purpose. This
>>> patch does not implement any use cases for such attribute so that any
>>> change in naming or range does not affect much to the other (future)
>>> patches using this. The actual use of latency_nice during task wakeup
>>> and load-balancing is yet to be coded for each of those usecases.
>>>
>>> As per my view, this defined attribute can be used in following ways for a
>>> some of the usecases:
>>> 1 Reduce search scan time for select_idle_cpu():
>>> - Reduce search scans for finding idle CPU for a waking task with lower
>>> latency_nice values.
>>>
>>> 2 TurboSched:
>>> - Classify the tasks with higher latency_nice values as a small
>>> background task given that its historic utilization is very low, for
>>> which the scheduler can search for more number of cores to do task
>>> packing. A task with a latency_nice >= some_threshold (e.g, == 19)
>>> and util <= 12.5% can be background tasks.
>>>
>>> 3 Optimize AVX512 based workload:
>>> - Bias scheduler to not put a task having (latency_nice == -20) on a
>>> core occupying AVX512 based workload.
>>>
>>>
>>> Series Organization:
>>> ====================
>>> - Patch 1: Add new attribute latency_nice to task_struct.
>>> - Patch 2: Clone parent task's attribute to the child task on fork
>>> - Patch 3: Add support for sched_{set,get}attr syscall to modify
>>> latency_nice of the task
>>>
>>>
>>> The patch series can be applied on tip/sched/core at the
>>> commit 804d402fb6f6 ("sched/rt: Make RT capacity-aware")
>>>
>>>
>>> References:
>>> ============
>>> [1]. Usecases for the per-task latency-nice attribute,
>>> https://lkml.org/lkml/2019/9/30/215
>>> [2]. Task Latency-nice, "Subhra Mazumdar",
>>> https://lkml.org/lkml/2019/8/30/829
>>> [3]. Introduce per-task latency_tolerance for scheduler hints,
>>> https://lkml.org/lkml/2019/12/8/10
>>>
>>>
>>> Parth Shah (3):
>>> sched: Introduce latency-nice as a per-task attribute
>>> sched/core: Propagate parent task's latency requirements to the child
>>> task
>>> sched: Allow sched_{get,set}attr to change latency_nice of the task
>>>
>>> include/linux/sched.h | 1 +
>>> include/uapi/linux/sched.h | 4 +++-
>>> include/uapi/linux/sched/types.h | 19 +++++++++++++++++++
>>> kernel/sched/core.c | 21 +++++++++++++++++++++
>>> kernel/sched/sched.h | 18 ++++++++++++++++++
>>> tools/include/uapi/linux/sched.h | 4 +++-
>>> 6 files changed, 65 insertions(+), 2 deletions(-)
>>>
>>
>> Its been a long time and few revisions since the beginning of the
>> discussion around the latency-nice. Hence thought of asking if there is/are
>> any further work that needs to be done for adding latency-nice attribute or
>> am I missing any piece in here?
>
> All, I was asked to take a look at the original latency_nice patchset.
> First, to clarify objectives, Oracle is not interested in trading
> throughput for latency. What we found is that the DB has specific tasks
> which do very little but need to do this as absolutely quickly as possible,
> ie extreme latency sensitivity. Second, the key to latency reduction in the
> task wakeup path seems to be limiting variations of "idle cpu" search. The
> latter particularly interests me as an example of "platform size based
> latency" which I believe to be important given all the varying size VMs and
> containers.
>
> Parth, I've been using your v3 patchset as the basis of an investigation
> into the measurable effects of short-circuiting this search. I'm not quite
> ready to put anything out, but the patchset is working well. The only
That's a good news as you are able to get a usecase of this patch-set.
> feedback I have is that currently non-root can set the value negative which
> is inconsistent with 'nice' and I would think a security hole.
>
I would assume you mean 'latency_nice' here.
From my testing, I was not able to set values for any root owned task's
latency_nice value by the non-root user. Also, my patch-set just piggybacks
on the already existing sched_setattr syscall and hence it should not allow
non-root user to do any modifications. Can you confirm this by changing
nice (renice) value of a root task from non-root user.
I have done the sanity check in the code and thinking where it could
possibly have gone wrong. So, can you please specify what values were you
able to set outside the [-20, 19] range?
Thanks,
Parth
From: chris hyser
> Sent: 18 February 2020 23:00
...
> All, I was asked to take a look at the original latency_nice patchset.
> First, to clarify objectives, Oracle is not
> interested in trading throughput for latency.
> What we found is that the DB has specific tasks which do very little but
> need to do this as absolutely quickly as possible, ie extreme latency
> sensitivity. Second, the key to latency reduction
> in the task wakeup path seems to be limiting variations of "idle cpu" search.
> The latter particularly interests me as an example of "platform size
> based latency" which I believe to be important given all the varying size
> VMs and containers.
From my experiments there are a few things that seem to affect latency
of waking up real time (sched fifo) tasks on a normal kernel:
1) The time taken for the (intel x86) cpu to wakeup from monitor/mwait.
If the cpu is allowed to enter deeper sleep states this can take 900us.
Any changes to this are system-wide not process specific.
2) If the cpu an RT process last ran on (ie the one it is woken on) is
running in kernel, the process switch won't happen until cond_reshed()
is called.
On my system the code to flush the display frame buffer takes 3.3ms.
Compiling a kernel with CONFIG_PREEMPT=y will reduce this.
3) If a hardware interrupt happens just after the process is woken
then you have to wait until it finishes and any 'softint' work
that is scheduled on the same cpu finishes.
The ethernet driver transmit completions an receive ring filling
can easily take 1ms.
Booting with 'threadirq' might help this.
4) If you need to acquire a lock/futex then you need to allow for the
process that holds it being delayed by a hardware interrupt (etc).
So even if the lock is only held for a few instructions it can take
a long time to acquire.
(I need to change some linked lists to arrays indexed by an atomically
incremented global index.)
FWIW I can't imagine how a database can have anything that is that
latency sensitive.
We are doing lots of channels of audio processing and have a lot of work
to do within 10ms to avoid audible errors.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
On 2/19/20 5:09 AM, Parth Shah wrote:
> Hi Chris,
>
> On 2/19/20 4:30 AM, chris hyser wrote:
>> On 2/17/20 3:57 AM, Parth Shah wrote:
>>>
>>>
>>> On 1/16/20 5:32 PM, Parth Shah wrote:
>>>> This is the 3rd revision of the patch set to introduce
>>>> latency_{nice/tolerance} as a per task attribute.
>>>>
>>>> The previous version can be found at:
>>>> v1: https://lkml.org/lkml/2019/11/25/151
>>>> v2: https://lkml.org/lkml/2019/12/8/10
>>>>
>>>> Changes in this revision are:
>>>> v2 -> v3:
>>>> - This series changes the longer attribute name to "latency_nice" as per
>>>> the comment from Dietmar Eggemann https://lkml.org/lkml/2019/12/5/394
>>>> v1 -> v2:
>>>> - Addressed comments from Qais Yousef
>>>> - As per suggestion from Dietmar, moved content from newly created
>>>> include/linux/sched/latency_tolerance.h to kernel/sched/sched.h
>>>> - Extend sched_setattr() to support latency_tolerance in tools headers UAPI
>>>>
>>>>
>>>> Introduction:
>>>> ==============
>>>> This patch series introduces a new per-task attribute latency_nice to
>>>> provide the scheduler hints about the latency requirements of the task [1].
>>>>
>>>> Latency_nice is a ranged attribute of a task with the value ranging
>>>> from [-20, 19] both inclusive which makes it align with the task nice
>>>> value.
>>>>
>>>> The value should provide scheduler hints about the relative latency
>>>> requirements of tasks, meaning the task with "latency_nice = -20"
>>>> should have lower latency requirements than compared to those tasks with
>>>> higher values. Similarly a task with "latency_nice = 19" can have higher
>>>> latency and hence such tasks may not care much about latency.
>>>>
>>>> The default value is set to 0. The usecases discussed below can use this
>>>> range of [-20, 19] for latency_nice for the specific purpose. This
>>>> patch does not implement any use cases for such attribute so that any
>>>> change in naming or range does not affect much to the other (future)
>>>> patches using this. The actual use of latency_nice during task wakeup
>>>> and load-balancing is yet to be coded for each of those usecases.
>>>>
>>>> As per my view, this defined attribute can be used in following ways for a
>>>> some of the usecases:
>>>> 1 Reduce search scan time for select_idle_cpu():
>>>> - Reduce search scans for finding idle CPU for a waking task with lower
>>>> latency_nice values.
>>>>
>>>> 2 TurboSched:
>>>> - Classify the tasks with higher latency_nice values as a small
>>>> background task given that its historic utilization is very low, for
>>>> which the scheduler can search for more number of cores to do task
>>>> packing. A task with a latency_nice >= some_threshold (e.g, == 19)
>>>> and util <= 12.5% can be background tasks.
>>>>
>>>> 3 Optimize AVX512 based workload:
>>>> - Bias scheduler to not put a task having (latency_nice == -20) on a
>>>> core occupying AVX512 based workload.
>>>>
>>>>
>>>> Series Organization:
>>>> ====================
>>>> - Patch 1: Add new attribute latency_nice to task_struct.
>>>> - Patch 2: Clone parent task's attribute to the child task on fork
>>>> - Patch 3: Add support for sched_{set,get}attr syscall to modify
>>>> latency_nice of the task
>>>>
>>>>
>>>> The patch series can be applied on tip/sched/core at the
>>>> commit 804d402fb6f6 ("sched/rt: Make RT capacity-aware")
>>>>
>>>>
>>>> References:
>>>> ============
>>>> [1]. Usecases for the per-task latency-nice attribute,
>>>> https://lkml.org/lkml/2019/9/30/215
>>>> [2]. Task Latency-nice, "Subhra Mazumdar",
>>>> https://lkml.org/lkml/2019/8/30/829
>>>> [3]. Introduce per-task latency_tolerance for scheduler hints,
>>>> https://lkml.org/lkml/2019/12/8/10
>>>>
>>>>
>>>> Parth Shah (3):
>>>> sched: Introduce latency-nice as a per-task attribute
>>>> sched/core: Propagate parent task's latency requirements to the child
>>>> task
>>>> sched: Allow sched_{get,set}attr to change latency_nice of the task
>>>>
>>>> include/linux/sched.h | 1 +
>>>> include/uapi/linux/sched.h | 4 +++-
>>>> include/uapi/linux/sched/types.h | 19 +++++++++++++++++++
>>>> kernel/sched/core.c | 21 +++++++++++++++++++++
>>>> kernel/sched/sched.h | 18 ++++++++++++++++++
>>>> tools/include/uapi/linux/sched.h | 4 +++-
>>>> 6 files changed, 65 insertions(+), 2 deletions(-)
>>>>
>>>
>>> Its been a long time and few revisions since the beginning of the
>>> discussion around the latency-nice. Hence thought of asking if there is/are
>>> any further work that needs to be done for adding latency-nice attribute or
>>> am I missing any piece in here?
>>
>> All, I was asked to take a look at the original latency_nice patchset.
>> First, to clarify objectives, Oracle is not interested in trading
>> throughput for latency. What we found is that the DB has specific tasks
>> which do very little but need to do this as absolutely quickly as possible,
>> ie extreme latency sensitivity. Second, the key to latency reduction in the
>> task wakeup path seems to be limiting variations of "idle cpu" search. The
>> latter particularly interests me as an example of "platform size based
>> latency" which I believe to be important given all the varying size VMs and
>> containers.
>>
>> Parth, I've been using your v3 patchset as the basis of an investigation
>> into the measurable effects of short-circuiting this search. I'm not quite
>> ready to put anything out, but the patchset is working well. The only
>
> That's a good news as you are able to get a usecase of this patch-set.
>
>> feedback I have is that currently non-root can set the value negative which
>> is inconsistent with 'nice' and I would think a security hole.
>>
>
> I would assume you mean 'latency_nice' here.
>
> From my testing, I was not able to set values for any root owned task's
> latency_nice value by the non-root user. Also, my patch-set just piggybacks
> on the already existing sched_setattr syscall and hence it should not allow
> non-root user to do any modifications. Can you confirm this by changing
> nice (renice) value of a root task from non-root user.
>
> I have done the sanity check in the code and thinking where it could
> possibly have gone wrong. So, can you please specify what values were you
> able to set outside the [-20, 19] range?
The checks prevent being outside that range. But negative numbers -20 to -1 did not need root. Let me dig some more. I
verified this explicitly before sending the email so something is up.
-chrish
On 2/19/20 6:18 AM, David Laight wrote:
> From: chris hyser
>> Sent: 18 February 2020 23:00
> ...
>> All, I was asked to take a look at the original latency_nice patchset.
>> First, to clarify objectives, Oracle is not
>> interested in trading throughput for latency.
>> What we found is that the DB has specific tasks which do very little but
>> need to do this as absolutely quickly as possible, ie extreme latency
>> sensitivity. Second, the key to latency reduction
>> in the task wakeup path seems to be limiting variations of "idle cpu" search.
>> The latter particularly interests me as an example of "platform size
>> based latency" which I believe to be important given all the varying size
>> VMs and containers.
>
> From my experiments there are a few things that seem to affect latency
> of waking up real time (sched fifo) tasks on a normal kernel:
Sorry. I was only ever talking about sched_other as per the original patchset. I realize the term extreme latency
sensitivity may have caused confusion. What that means to DB people is no doubt different than audio people. :-)
>
> 1) The time taken for the (intel x86) cpu to wakeup from monitor/mwait.
> If the cpu is allowed to enter deeper sleep states this can take 900us.
> Any changes to this are system-wide not process specific.
>
> 2) If the cpu an RT process last ran on (ie the one it is woken on) is
> running in kernel, the process switch won't happen until cond_reshed()
> is called.
> On my system the code to flush the display frame buffer takes 3.3ms.
> Compiling a kernel with CONFIG_PREEMPT=y will reduce this.
>
> 3) If a hardware interrupt happens just after the process is woken
> then you have to wait until it finishes and any 'softint' work
> that is scheduled on the same cpu finishes.
> The ethernet driver transmit completions an receive ring filling
> can easily take 1ms.
> Booting with 'threadirq' might help this.
>
> 4) If you need to acquire a lock/futex then you need to allow for the
> process that holds it being delayed by a hardware interrupt (etc).
> So even if the lock is only held for a few instructions it can take
> a long time to acquire.
> (I need to change some linked lists to arrays indexed by an atomically
> incremented global index.)
>
> FWIW I can't imagine how a database can have anything that is that
> latency sensitive.
> We are doing lots of channels of audio processing and have a lot of work
> to do within 10ms to avoid audible errors.
There are existing internal numbers that I will ultimately have to duplicate that show that simply short-cutting these
idle cpu searches has a significant benefit on DB performance on large hardware. However that was for a different
patchset involving things I don't like so I'm still exploring how to achieve similar results within the latency_nice
framework.
-chrish
On 2/19/20 9:15 AM, chris hyser wrote:
>
>
> On 2/19/20 5:09 AM, Parth Shah wrote:
>> Hi Chris,
>>
>> On 2/19/20 4:30 AM, chris hyser wrote:
>>> On 2/17/20 3:57 AM, Parth Shah wrote:
>>>>
>>>>
>>>> On 1/16/20 5:32 PM, Parth Shah wrote:
>>>>> This is the 3rd revision of the patch set to introduce
>>>>> latency_{nice/tolerance} as a per task attribute.
>>>>>
>>>>> The previous version can be found at:
>>>>> v1: https://lkml.org/lkml/2019/11/25/151
>>>>> v2: https://lkml.org/lkml/2019/12/8/10
>>>>>
>>>>> Changes in this revision are:
>>>>> v2 -> v3:
>>>>> - This series changes the longer attribute name to "latency_nice" as per
>>>>> the comment from Dietmar Eggemann https://lkml.org/lkml/2019/12/5/394
>>>>> v1 -> v2:
>>>>> - Addressed comments from Qais Yousef
>>>>> - As per suggestion from Dietmar, moved content from newly created
>>>>> include/linux/sched/latency_tolerance.h to kernel/sched/sched.h
>>>>> - Extend sched_setattr() to support latency_tolerance in tools headers UAPI
>>>>>
>>>>>
>>>>> Introduction:
>>>>> ==============
>>>>> This patch series introduces a new per-task attribute latency_nice to
>>>>> provide the scheduler hints about the latency requirements of the task [1].
>>>>>
>>>>> Latency_nice is a ranged attribute of a task with the value ranging
>>>>> from [-20, 19] both inclusive which makes it align with the task nice
>>>>> value.
>>>>>
>>>>> The value should provide scheduler hints about the relative latency
>>>>> requirements of tasks, meaning the task with "latency_nice = -20"
>>>>> should have lower latency requirements than compared to those tasks with
>>>>> higher values. Similarly a task with "latency_nice = 19" can have higher
>>>>> latency and hence such tasks may not care much about latency.
>>>>>
>>>>> The default value is set to 0. The usecases discussed below can use this
>>>>> range of [-20, 19] for latency_nice for the specific purpose. This
>>>>> patch does not implement any use cases for such attribute so that any
>>>>> change in naming or range does not affect much to the other (future)
>>>>> patches using this. The actual use of latency_nice during task wakeup
>>>>> and load-balancing is yet to be coded for each of those usecases.
>>>>>
>>>>> As per my view, this defined attribute can be used in following ways for a
>>>>> some of the usecases:
>>>>> 1 Reduce search scan time for select_idle_cpu():
>>>>> - Reduce search scans for finding idle CPU for a waking task with lower
>>>>> latency_nice values.
>>>>>
>>>>> 2 TurboSched:
>>>>> - Classify the tasks with higher latency_nice values as a small
>>>>> background task given that its historic utilization is very low, for
>>>>> which the scheduler can search for more number of cores to do task
>>>>> packing. A task with a latency_nice >= some_threshold (e.g, == 19)
>>>>> and util <= 12.5% can be background tasks.
>>>>>
>>>>> 3 Optimize AVX512 based workload:
>>>>> - Bias scheduler to not put a task having (latency_nice == -20) on a
>>>>> core occupying AVX512 based workload.
>>>>>
>>>>>
>>>>> Series Organization:
>>>>> ====================
>>>>> - Patch 1: Add new attribute latency_nice to task_struct.
>>>>> - Patch 2: Clone parent task's attribute to the child task on fork
>>>>> - Patch 3: Add support for sched_{set,get}attr syscall to modify
>>>>> latency_nice of the task
>>>>>
>>>>>
>>>>> The patch series can be applied on tip/sched/core at the
>>>>> commit 804d402fb6f6 ("sched/rt: Make RT capacity-aware")
>>>>>
>>>>>
>>>>> References:
>>>>> ============
>>>>> [1]. Usecases for the per-task latency-nice attribute,
>>>>> https://lkml.org/lkml/2019/9/30/215
>>>>> [2]. Task Latency-nice, "Subhra Mazumdar",
>>>>> https://lkml.org/lkml/2019/8/30/829
>>>>> [3]. Introduce per-task latency_tolerance for scheduler hints,
>>>>> https://lkml.org/lkml/2019/12/8/10
>>>>>
>>>>>
>>>>> Parth Shah (3):
>>>>> sched: Introduce latency-nice as a per-task attribute
>>>>> sched/core: Propagate parent task's latency requirements to the child
>>>>> task
>>>>> sched: Allow sched_{get,set}attr to change latency_nice of the task
>>>>>
>>>>> include/linux/sched.h | 1 +
>>>>> include/uapi/linux/sched.h | 4 +++-
>>>>> include/uapi/linux/sched/types.h | 19 +++++++++++++++++++
>>>>> kernel/sched/core.c | 21 +++++++++++++++++++++
>>>>> kernel/sched/sched.h | 18 ++++++++++++++++++
>>>>> tools/include/uapi/linux/sched.h | 4 +++-
>>>>> 6 files changed, 65 insertions(+), 2 deletions(-)
>>>>>
>>>>
>>>> Its been a long time and few revisions since the beginning of the
>>>> discussion around the latency-nice. Hence thought of asking if there is/are
>>>> any further work that needs to be done for adding latency-nice attribute or
>>>> am I missing any piece in here?
>>>
>>> All, I was asked to take a look at the original latency_nice patchset.
>>> First, to clarify objectives, Oracle is not interested in trading
>>> throughput for latency. What we found is that the DB has specific tasks
>>> which do very little but need to do this as absolutely quickly as possible,
>>> ie extreme latency sensitivity. Second, the key to latency reduction in the
>>> task wakeup path seems to be limiting variations of "idle cpu" search. The
>>> latter particularly interests me as an example of "platform size based
>>> latency" which I believe to be important given all the varying size VMs and
>>> containers.
>>>
>>> Parth, I've been using your v3 patchset as the basis of an investigation
>>> into the measurable effects of short-circuiting this search. I'm not quite
>>> ready to put anything out, but the patchset is working well. The only
>>
>> That's a good news as you are able to get a usecase of this patch-set.
>>
>>> feedback I have is that currently non-root can set the value negative which
>>> is inconsistent with 'nice' and I would think a security hole.
>>>
>>
>> I would assume you mean 'latency_nice' here.
>>
>> From my testing, I was not able to set values for any root owned task's
>> latency_nice value by the non-root user. Also, my patch-set just piggybacks
>> on the already existing sched_setattr syscall and hence it should not allow
>> non-root user to do any modifications. Can you confirm this by changing
>> nice (renice) value of a root task from non-root user.
>>
>> I have done the sanity check in the code and thinking where it could
>> possibly have gone wrong. So, can you please specify what values were you
>> able to set outside the [-20, 19] range?
>
> The checks prevent being outside that range. But negative numbers -20 to -1 did not need root. Let me dig some more. I
> verified this explicitly before sending the email so something is up.
I went digging. This is absolutely repeatable. I checked that I do not unknowingly have CAP_SYS_NICE as a user. So
first, are we tying latency_nice to CAP_SYS_NICE? Seems like a reasonable thing, but not sure I saw this stated
anywhere. Second, the only capability checked in __sched_setscheduler() in the patch I have is CAP_SYS_NICE and those
checks will not return a -EPERM for a negative latency_tolerance (in the code, aka latency_nice). Do I have the correct
version of the code? Am I missing something?
-chrish
On 2/19/20 11:53 PM, chris hyser wrote:
>
>
> On 2/19/20 9:15 AM, chris hyser wrote:
>>
>>
>> On 2/19/20 5:09 AM, Parth Shah wrote:
>>> Hi Chris,
>>>
>>> On 2/19/20 4:30 AM, chris hyser wrote:
>>>> On 2/17/20 3:57 AM, Parth Shah wrote:
>>>>>
>>>>>
>>>>> On 1/16/20 5:32 PM, Parth Shah wrote:
>>>>>> This is the 3rd revision of the patch set to introduce
>>>>>> latency_{nice/tolerance} as a per task attribute.
>>>>>>
>>>>>> The previous version can be found at:
>>>>>> v1: https://lkml.org/lkml/2019/11/25/151
>>>>>> v2: https://lkml.org/lkml/2019/12/8/10
>>>>>>
>>>>>> Changes in this revision are:
>>>>>> v2 -> v3:
>>>>>> - This series changes the longer attribute name to "latency_nice" as per
>>>>>> the comment from Dietmar Eggemann
>>>>>> https://lkml.org/lkml/2019/12/5/394
>>>>>> v1 -> v2:
>>>>>> - Addressed comments from Qais Yousef
>>>>>> - As per suggestion from Dietmar, moved content from newly created
>>>>>> include/linux/sched/latency_tolerance.h to kernel/sched/sched.h
>>>>>> - Extend sched_setattr() to support latency_tolerance in tools
>>>>>> headers UAPI
>>>>>>
>>>>>>
>>>>>> Introduction:
>>>>>> ==============
>>>>>> This patch series introduces a new per-task attribute latency_nice to
>>>>>> provide the scheduler hints about the latency requirements of the
>>>>>> task [1].
>>>>>>
>>>>>> Latency_nice is a ranged attribute of a task with the value ranging
>>>>>> from [-20, 19] both inclusive which makes it align with the task nice
>>>>>> value.
>>>>>>
>>>>>> The value should provide scheduler hints about the relative latency
>>>>>> requirements of tasks, meaning the task with "latency_nice = -20"
>>>>>> should have lower latency requirements than compared to those tasks with
>>>>>> higher values. Similarly a task with "latency_nice = 19" can have higher
>>>>>> latency and hence such tasks may not care much about latency.
>>>>>>
>>>>>> The default value is set to 0. The usecases discussed below can use this
>>>>>> range of [-20, 19] for latency_nice for the specific purpose. This
>>>>>> patch does not implement any use cases for such attribute so that any
>>>>>> change in naming or range does not affect much to the other (future)
>>>>>> patches using this. The actual use of latency_nice during task wakeup
>>>>>> and load-balancing is yet to be coded for each of those usecases.
>>>>>>
>>>>>> As per my view, this defined attribute can be used in following ways
>>>>>> for a
>>>>>> some of the usecases:
>>>>>> 1 Reduce search scan time for select_idle_cpu():
>>>>>> - Reduce search scans for finding idle CPU for a waking task with lower
>>>>>> latency_nice values.
>>>>>>
>>>>>> 2 TurboSched:
>>>>>> - Classify the tasks with higher latency_nice values as a small
>>>>>> background task given that its historic utilization is very low, for
>>>>>> which the scheduler can search for more number of cores to do task
>>>>>> packing. A task with a latency_nice >= some_threshold (e.g, == 19)
>>>>>> and util <= 12.5% can be background tasks.
>>>>>>
>>>>>> 3 Optimize AVX512 based workload:
>>>>>> - Bias scheduler to not put a task having (latency_nice == -20) on a
>>>>>> core occupying AVX512 based workload.
>>>>>>
>>>>>>
>>>>>> Series Organization:
>>>>>> ====================
>>>>>> - Patch 1: Add new attribute latency_nice to task_struct.
>>>>>> - Patch 2: Clone parent task's attribute to the child task on fork
>>>>>> - Patch 3: Add support for sched_{set,get}attr syscall to modify
>>>>>> latency_nice of the task
>>>>>>
>>>>>>
>>>>>> The patch series can be applied on tip/sched/core at the
>>>>>> commit 804d402fb6f6 ("sched/rt: Make RT capacity-aware")
>>>>>>
>>>>>>
>>>>>> References:
>>>>>> ============
>>>>>> [1]. Usecases for the per-task latency-nice attribute,
>>>>>> https://lkml.org/lkml/2019/9/30/215
>>>>>> [2]. Task Latency-nice, "Subhra Mazumdar",
>>>>>> https://lkml.org/lkml/2019/8/30/829
>>>>>> [3]. Introduce per-task latency_tolerance for scheduler hints,
>>>>>> https://lkml.org/lkml/2019/12/8/10
>>>>>>
>>>>>>
>>>>>> Parth Shah (3):
>>>>>> sched: Introduce latency-nice as a per-task attribute
>>>>>> sched/core: Propagate parent task's latency requirements to the
>>>>>> child
>>>>>> task
>>>>>> sched: Allow sched_{get,set}attr to change latency_nice of the task
>>>>>>
>>>>>> include/linux/sched.h | 1 +
>>>>>> include/uapi/linux/sched.h | 4 +++-
>>>>>> include/uapi/linux/sched/types.h | 19 +++++++++++++++++++
>>>>>> kernel/sched/core.c | 21 +++++++++++++++++++++
>>>>>> kernel/sched/sched.h | 18 ++++++++++++++++++
>>>>>> tools/include/uapi/linux/sched.h | 4 +++-
>>>>>> 6 files changed, 65 insertions(+), 2 deletions(-)
>>>>>>
>>>>>
>>>>> Its been a long time and few revisions since the beginning of the
>>>>> discussion around the latency-nice. Hence thought of asking if there
>>>>> is/are
>>>>> any further work that needs to be done for adding latency-nice
>>>>> attribute or
>>>>> am I missing any piece in here?
>>>>
>>>> All, I was asked to take a look at the original latency_nice patchset.
>>>> First, to clarify objectives, Oracle is not interested in trading
>>>> throughput for latency. What we found is that the DB has specific tasks
>>>> which do very little but need to do this as absolutely quickly as
>>>> possible,
>>>> ie extreme latency sensitivity. Second, the key to latency reduction in
>>>> the
>>>> task wakeup path seems to be limiting variations of "idle cpu" search. The
>>>> latter particularly interests me as an example of "platform size based
>>>> latency" which I believe to be important given all the varying size VMs
>>>> and
>>>> containers.
>>>>
>>>> Parth, I've been using your v3 patchset as the basis of an investigation
>>>> into the measurable effects of short-circuiting this search. I'm not quite
>>>> ready to put anything out, but the patchset is working well. The only
>>>
>>> That's a good news as you are able to get a usecase of this patch-set.
>>>
>>>> feedback I have is that currently non-root can set the value negative
>>>> which
>>>> is inconsistent with 'nice' and I would think a security hole.
>>>>
>>>
>>> I would assume you mean 'latency_nice' here.
>>>
>>> From my testing, I was not able to set values for any root owned task's
>>> latency_nice value by the non-root user. Also, my patch-set just piggybacks
>>> on the already existing sched_setattr syscall and hence it should not allow
>>> non-root user to do any modifications. Can you confirm this by changing
>>> nice (renice) value of a root task from non-root user.
>>>
>>> I have done the sanity check in the code and thinking where it could
>>> possibly have gone wrong. So, can you please specify what values were you
>>> able to set outside the [-20, 19] range?
>>
>> The checks prevent being outside that range. But negative numbers -20 to
>> -1 did not need root. Let me dig some more. I verified this explicitly
>> before sending the email so something is up.
>
> I went digging. This is absolutely repeatable. I checked that I do not
> unknowingly have CAP_SYS_NICE as a user. So first, are we tying
> latency_nice to CAP_SYS_NICE? Seems like a reasonable thing, but not sure I
> saw this stated anywhere. Second, the only capability checked in
> __sched_setscheduler() in the patch I have is CAP_SYS_NICE and those checks
> will not return a -EPERM for a negative latency_tolerance (in the code, aka
> latency_nice). Do I have the correct version of the code? Am I missing
> something?
You are right. I have not added permission checks for setting the
latency_nice value. For the task_nice, non-root user has no permission to
set the value lower than the current value which is not the case with the
latency_nice.
In order to align with the permission checks like task_nice, I will add the
check similar to task_nice and send out the v4 of the series soon.
Thanks for pointing out.
- Parth
>
> -chrish
>
On 2/20/20 2:04 PM, Parth Shah wrote:
>
>
> On 2/19/20 11:53 PM, chris hyser wrote:
>>
>>
>> On 2/19/20 9:15 AM, chris hyser wrote:
>>>
>>>
>>> On 2/19/20 5:09 AM, Parth Shah wrote:
>>>> Hi Chris,
>>>>
>>>> On 2/19/20 4:30 AM, chris hyser wrote:
>>>>> On 2/17/20 3:57 AM, Parth Shah wrote:
>>>>>>
>>>>>>
>>>>>> On 1/16/20 5:32 PM, Parth Shah wrote:
>>>>>>> This is the 3rd revision of the patch set to introduce
>>>>>>> latency_{nice/tolerance} as a per task attribute.
>>>>>>>
>>>>>>> The previous version can be found at:
>>>>>>> v1: https://lkml.org/lkml/2019/11/25/151
>>>>>>> v2: https://lkml.org/lkml/2019/12/8/10
>>>>>>>
>>>>>>> Changes in this revision are:
>>>>>>> v2 -> v3:
>>>>>>> - This series changes the longer attribute name to "latency_nice" as per
>>>>>>> the comment from Dietmar Eggemann
>>>>>>> https://lkml.org/lkml/2019/12/5/394
>>>>>>> v1 -> v2:
>>>>>>> - Addressed comments from Qais Yousef
>>>>>>> - As per suggestion from Dietmar, moved content from newly created
>>>>>>> include/linux/sched/latency_tolerance.h to kernel/sched/sched.h
>>>>>>> - Extend sched_setattr() to support latency_tolerance in tools
>>>>>>> headers UAPI
>>>>>>>
>>>>>>>
>>>>>>> Introduction:
>>>>>>> ==============
>>>>>>> This patch series introduces a new per-task attribute latency_nice to
>>>>>>> provide the scheduler hints about the latency requirements of the
>>>>>>> task [1].
>>>>>>>
>>>>>>> Latency_nice is a ranged attribute of a task with the value ranging
>>>>>>> from [-20, 19] both inclusive which makes it align with the task nice
>>>>>>> value.
>>>>>>>
>>>>>>> The value should provide scheduler hints about the relative latency
>>>>>>> requirements of tasks, meaning the task with "latency_nice = -20"
>>>>>>> should have lower latency requirements than compared to those tasks with
>>>>>>> higher values. Similarly a task with "latency_nice = 19" can have higher
>>>>>>> latency and hence such tasks may not care much about latency.
>>>>>>>
>>>>>>> The default value is set to 0. The usecases discussed below can use this
>>>>>>> range of [-20, 19] for latency_nice for the specific purpose. This
>>>>>>> patch does not implement any use cases for such attribute so that any
>>>>>>> change in naming or range does not affect much to the other (future)
>>>>>>> patches using this. The actual use of latency_nice during task wakeup
>>>>>>> and load-balancing is yet to be coded for each of those usecases.
>>>>>>>
>>>>>>> As per my view, this defined attribute can be used in following ways
>>>>>>> for a
>>>>>>> some of the usecases:
>>>>>>> 1 Reduce search scan time for select_idle_cpu():
>>>>>>> - Reduce search scans for finding idle CPU for a waking task with lower
>>>>>>> latency_nice values.
>>>>>>>
>>>>>>> 2 TurboSched:
>>>>>>> - Classify the tasks with higher latency_nice values as a small
>>>>>>> background task given that its historic utilization is very low, for
>>>>>>> which the scheduler can search for more number of cores to do task
>>>>>>> packing. A task with a latency_nice >= some_threshold (e.g, == 19)
>>>>>>> and util <= 12.5% can be background tasks.
>>>>>>>
>>>>>>> 3 Optimize AVX512 based workload:
>>>>>>> - Bias scheduler to not put a task having (latency_nice == -20) on a
>>>>>>> core occupying AVX512 based workload.
>>>>>>>
>>>>>>>
>>>>>>> Series Organization:
>>>>>>> ====================
>>>>>>> - Patch 1: Add new attribute latency_nice to task_struct.
>>>>>>> - Patch 2: Clone parent task's attribute to the child task on fork
>>>>>>> - Patch 3: Add support for sched_{set,get}attr syscall to modify
>>>>>>> latency_nice of the task
>>>>>>>
>>>>>>>
>>>>>>> The patch series can be applied on tip/sched/core at the
>>>>>>> commit 804d402fb6f6 ("sched/rt: Make RT capacity-aware")
>>>>>>>
>>>>>>>
>>>>>>> References:
>>>>>>> ============
>>>>>>> [1]. Usecases for the per-task latency-nice attribute,
>>>>>>> https://lkml.org/lkml/2019/9/30/215
>>>>>>> [2]. Task Latency-nice, "Subhra Mazumdar",
>>>>>>> https://lkml.org/lkml/2019/8/30/829
>>>>>>> [3]. Introduce per-task latency_tolerance for scheduler hints,
>>>>>>> https://lkml.org/lkml/2019/12/8/10
>>>>>>>
>>>>>>>
>>>>>>> Parth Shah (3):
>>>>>>> sched: Introduce latency-nice as a per-task attribute
>>>>>>> sched/core: Propagate parent task's latency requirements to the
>>>>>>> child
>>>>>>> task
>>>>>>> sched: Allow sched_{get,set}attr to change latency_nice of the task
>>>>>>>
>>>>>>> include/linux/sched.h | 1 +
>>>>>>> include/uapi/linux/sched.h | 4 +++-
>>>>>>> include/uapi/linux/sched/types.h | 19 +++++++++++++++++++
>>>>>>> kernel/sched/core.c | 21 +++++++++++++++++++++
>>>>>>> kernel/sched/sched.h | 18 ++++++++++++++++++
>>>>>>> tools/include/uapi/linux/sched.h | 4 +++-
>>>>>>> 6 files changed, 65 insertions(+), 2 deletions(-)
>>>>>>>
>>>>>>
>>>>>> Its been a long time and few revisions since the beginning of the
>>>>>> discussion around the latency-nice. Hence thought of asking if there
>>>>>> is/are
>>>>>> any further work that needs to be done for adding latency-nice
>>>>>> attribute or
>>>>>> am I missing any piece in here?
>>>>>
>>>>> All, I was asked to take a look at the original latency_nice patchset.
>>>>> First, to clarify objectives, Oracle is not interested in trading
>>>>> throughput for latency. What we found is that the DB has specific tasks
>>>>> which do very little but need to do this as absolutely quickly as
>>>>> possible,
>>>>> ie extreme latency sensitivity. Second, the key to latency reduction in
>>>>> the
>>>>> task wakeup path seems to be limiting variations of "idle cpu" search. The
>>>>> latter particularly interests me as an example of "platform size based
>>>>> latency" which I believe to be important given all the varying size VMs
>>>>> and
>>>>> containers.
>>>>>
>>>>> Parth, I've been using your v3 patchset as the basis of an investigation
>>>>> into the measurable effects of short-circuiting this search. I'm not quite
>>>>> ready to put anything out, but the patchset is working well. The only
>>>>
>>>> That's a good news as you are able to get a usecase of this patch-set.
>>>>
>>>>> feedback I have is that currently non-root can set the value negative
>>>>> which
>>>>> is inconsistent with 'nice' and I would think a security hole.
>>>>>
>>>>
>>>> I would assume you mean 'latency_nice' here.
>>>>
>>>> From my testing, I was not able to set values for any root owned task's
>>>> latency_nice value by the non-root user. Also, my patch-set just piggybacks
>>>> on the already existing sched_setattr syscall and hence it should not allow
>>>> non-root user to do any modifications. Can you confirm this by changing
>>>> nice (renice) value of a root task from non-root user.
>>>>
>>>> I have done the sanity check in the code and thinking where it could
>>>> possibly have gone wrong. So, can you please specify what values were you
>>>> able to set outside the [-20, 19] range?
>>>
>>> The checks prevent being outside that range. But negative numbers -20 to
>>> -1 did not need root. Let me dig some more. I verified this explicitly
>>> before sending the email so something is up.
>>
>> I went digging. This is absolutely repeatable. I checked that I do not
>> unknowingly have CAP_SYS_NICE as a user. So first, are we tying
>> latency_nice to CAP_SYS_NICE? Seems like a reasonable thing, but not sure I
>> saw this stated anywhere. Second, the only capability checked in
>> __sched_setscheduler() in the patch I have is CAP_SYS_NICE and those checks
>> will not return a -EPERM for a negative latency_tolerance (in the code, aka
>> latency_nice). Do I have the correct version of the code? Am I missing
>> something?
>
> You are right. I have not added permission checks for setting the
> latency_nice value. For the task_nice, non-root user has no permission to
> set the value lower than the current value which is not the case with the
> latency_nice.
>
> In order to align with the permission checks like task_nice, I will add the
> check similar to task_nice and send out the v4 of the series soon.
>
>
> Thanks for pointing out.
> - Parth
>
The below diff works out well enough in-order to align permission checks
with NICE.
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2bfcff5623f9..ef4a397c9170 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4878,6 +4878,10 @@ static int __sched_setscheduler(struct task_struct *p,
return -EINVAL;
if (attr->sched_latency_nice < MIN_LATENCY_NICE)
return -EINVAL;
+ /* Use the same security checks as NICE */
+ if (attr->sched_latency_nice < p->latency_nice &&
+ !can_nice(p, attr->sched_latency_nice))
+ return -EPERM;
}
if (pi)
With the above in effect,
A non-root user can only increase the value upto +19, and once increased
cannot be decreased. e.g., a user once sets the value latency_nice = 19,
the same user cannot set the value latency_nice = 18. This is the same
effect as with NICE.
Is such permission checks required?
Unlike NICE, we are going to use latency_nice for scheduler hints only, and
so won't it make more sense to allow a user to increase/decrease the values
of their owned tasks?
- Parth
On 2/20/20 3:50 AM, Parth Shah wrote:
>
>
> On 2/20/20 2:04 PM, Parth Shah wrote:
>>
>>
>> On 2/19/20 11:53 PM, chris hyser wrote:
>>>
>>>
>>> On 2/19/20 9:15 AM, chris hyser wrote:
>>>>
>>>>
>>>> On 2/19/20 5:09 AM, Parth Shah wrote:
>>>>> Hi Chris,
>>>>>
>>>>> On 2/19/20 4:30 AM, chris hyser wrote:
>>>>>> On 2/17/20 3:57 AM, Parth Shah wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 1/16/20 5:32 PM, Parth Shah wrote:
>>>>>>>> This is the 3rd revision of the patch set to introduce
>>>>>>>> latency_{nice/tolerance} as a per task attribute.
>>>>>>>>
>>>>>>>> The previous version can be found at:
>>>>>>>> v1: https://lkml.org/lkml/2019/11/25/151
>>>>>>>> v2: https://lkml.org/lkml/2019/12/8/10
>>>>>>>>
>>>>>>>> Changes in this revision are:
>>>>>>>> v2 -> v3:
>>>>>>>> - This series changes the longer attribute name to "latency_nice" as per
>>>>>>>> the comment from Dietmar Eggemann
>>>>>>>> https://lkml.org/lkml/2019/12/5/394
>>>>>>>> v1 -> v2:
>>>>>>>> - Addressed comments from Qais Yousef
>>>>>>>> - As per suggestion from Dietmar, moved content from newly created
>>>>>>>> include/linux/sched/latency_tolerance.h to kernel/sched/sched.h
>>>>>>>> - Extend sched_setattr() to support latency_tolerance in tools
>>>>>>>> headers UAPI
>>>>>>>>
>>>>>>>>
>>>>>>>> Introduction:
>>>>>>>> ==============
>>>>>>>> This patch series introduces a new per-task attribute latency_nice to
>>>>>>>> provide the scheduler hints about the latency requirements of the
>>>>>>>> task [1].
>>>>>>>>
>>>>>>>> Latency_nice is a ranged attribute of a task with the value ranging
>>>>>>>> from [-20, 19] both inclusive which makes it align with the task nice
>>>>>>>> value.
>>>>>>>>
>>>>>>>> The value should provide scheduler hints about the relative latency
>>>>>>>> requirements of tasks, meaning the task with "latency_nice = -20"
>>>>>>>> should have lower latency requirements than compared to those tasks with
>>>>>>>> higher values. Similarly a task with "latency_nice = 19" can have higher
>>>>>>>> latency and hence such tasks may not care much about latency.
>>>>>>>>
>>>>>>>> The default value is set to 0. The usecases discussed below can use this
>>>>>>>> range of [-20, 19] for latency_nice for the specific purpose. This
>>>>>>>> patch does not implement any use cases for such attribute so that any
>>>>>>>> change in naming or range does not affect much to the other (future)
>>>>>>>> patches using this. The actual use of latency_nice during task wakeup
>>>>>>>> and load-balancing is yet to be coded for each of those usecases.
>>>>>>>>
>>>>>>>> As per my view, this defined attribute can be used in following ways
>>>>>>>> for a
>>>>>>>> some of the usecases:
>>>>>>>> 1 Reduce search scan time for select_idle_cpu():
>>>>>>>> - Reduce search scans for finding idle CPU for a waking task with lower
>>>>>>>> latency_nice values.
>>>>>>>>
>>>>>>>> 2 TurboSched:
>>>>>>>> - Classify the tasks with higher latency_nice values as a small
>>>>>>>> background task given that its historic utilization is very low, for
>>>>>>>> which the scheduler can search for more number of cores to do task
>>>>>>>> packing. A task with a latency_nice >= some_threshold (e.g, == 19)
>>>>>>>> and util <= 12.5% can be background tasks.
>>>>>>>>
>>>>>>>> 3 Optimize AVX512 based workload:
>>>>>>>> - Bias scheduler to not put a task having (latency_nice == -20) on a
>>>>>>>> core occupying AVX512 based workload.
>>>>>>>>
>>>>>>>>
>>>>>>>> Series Organization:
>>>>>>>> ====================
>>>>>>>> - Patch 1: Add new attribute latency_nice to task_struct.
>>>>>>>> - Patch 2: Clone parent task's attribute to the child task on fork
>>>>>>>> - Patch 3: Add support for sched_{set,get}attr syscall to modify
>>>>>>>> latency_nice of the task
>>>>>>>>
>>>>>>>>
>>>>>>>> The patch series can be applied on tip/sched/core at the
>>>>>>>> commit 804d402fb6f6 ("sched/rt: Make RT capacity-aware")
>>>>>>>>
>>>>>>>>
>>>>>>>> References:
>>>>>>>> ============
>>>>>>>> [1]. Usecases for the per-task latency-nice attribute,
>>>>>>>> https://lkml.org/lkml/2019/9/30/215
>>>>>>>> [2]. Task Latency-nice, "Subhra Mazumdar",
>>>>>>>> https://lkml.org/lkml/2019/8/30/829
>>>>>>>> [3]. Introduce per-task latency_tolerance for scheduler hints,
>>>>>>>> https://lkml.org/lkml/2019/12/8/10
>>>>>>>>
>>>>>>>>
>>>>>>>> Parth Shah (3):
>>>>>>>> sched: Introduce latency-nice as a per-task attribute
>>>>>>>> sched/core: Propagate parent task's latency requirements to the
>>>>>>>> child
>>>>>>>> task
>>>>>>>> sched: Allow sched_{get,set}attr to change latency_nice of the task
>>>>>>>>
>>>>>>>> include/linux/sched.h | 1 +
>>>>>>>> include/uapi/linux/sched.h | 4 +++-
>>>>>>>> include/uapi/linux/sched/types.h | 19 +++++++++++++++++++
>>>>>>>> kernel/sched/core.c | 21 +++++++++++++++++++++
>>>>>>>> kernel/sched/sched.h | 18 ++++++++++++++++++
>>>>>>>> tools/include/uapi/linux/sched.h | 4 +++-
>>>>>>>> 6 files changed, 65 insertions(+), 2 deletions(-)
>>>>>>>>
>>>>>>>
>>>>>>> Its been a long time and few revisions since the beginning of the
>>>>>>> discussion around the latency-nice. Hence thought of asking if there
>>>>>>> is/are
>>>>>>> any further work that needs to be done for adding latency-nice
>>>>>>> attribute or
>>>>>>> am I missing any piece in here?
>>>>>>
>>>>>> All, I was asked to take a look at the original latency_nice patchset.
>>>>>> First, to clarify objectives, Oracle is not interested in trading
>>>>>> throughput for latency. What we found is that the DB has specific tasks
>>>>>> which do very little but need to do this as absolutely quickly as
>>>>>> possible,
>>>>>> ie extreme latency sensitivity. Second, the key to latency reduction in
>>>>>> the
>>>>>> task wakeup path seems to be limiting variations of "idle cpu" search. The
>>>>>> latter particularly interests me as an example of "platform size based
>>>>>> latency" which I believe to be important given all the varying size VMs
>>>>>> and
>>>>>> containers.
>>>>>>
>>>>>> Parth, I've been using your v3 patchset as the basis of an investigation
>>>>>> into the measurable effects of short-circuiting this search. I'm not quite
>>>>>> ready to put anything out, but the patchset is working well. The only
>>>>>
>>>>> That's a good news as you are able to get a usecase of this patch-set.
>>>>>
>>>>>> feedback I have is that currently non-root can set the value negative
>>>>>> which
>>>>>> is inconsistent with 'nice' and I would think a security hole.
>>>>>>
>>>>>
>>>>> I would assume you mean 'latency_nice' here.
>>>>>
>>>>> From my testing, I was not able to set values for any root owned task's
>>>>> latency_nice value by the non-root user. Also, my patch-set just piggybacks
>>>>> on the already existing sched_setattr syscall and hence it should not allow
>>>>> non-root user to do any modifications. Can you confirm this by changing
>>>>> nice (renice) value of a root task from non-root user.
>>>>>
>>>>> I have done the sanity check in the code and thinking where it could
>>>>> possibly have gone wrong. So, can you please specify what values were you
>>>>> able to set outside the [-20, 19] range?
>>>>
>>>> The checks prevent being outside that range. But negative numbers -20 to
>>>> -1 did not need root. Let me dig some more. I verified this explicitly
>>>> before sending the email so something is up.
>>>
>>> I went digging. This is absolutely repeatable. I checked that I do not
>>> unknowingly have CAP_SYS_NICE as a user. So first, are we tying
>>> latency_nice to CAP_SYS_NICE? Seems like a reasonable thing, but not sure I
>>> saw this stated anywhere. Second, the only capability checked in
>>> __sched_setscheduler() in the patch I have is CAP_SYS_NICE and those checks
>>> will not return a -EPERM for a negative latency_tolerance (in the code, aka
>>> latency_nice). Do I have the correct version of the code? Am I missing
>>> something?
>>
>> You are right. I have not added permission checks for setting the
>> latency_nice value. For the task_nice, non-root user has no permission to
>> set the value lower than the current value which is not the case with the
>> latency_nice.
>>
>> In order to align with the permission checks like task_nice, I will add the
>> check similar to task_nice and send out the v4 of the series soon.
>>
>>
>> Thanks for pointing out.
>> - Parth
>>
>
> The below diff works out well enough in-order to align permission checks
> with NICE.
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 2bfcff5623f9..ef4a397c9170 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4878,6 +4878,10 @@ static int __sched_setscheduler(struct task_struct *p,
> return -EINVAL;
> if (attr->sched_latency_nice < MIN_LATENCY_NICE)
> return -EINVAL;
> + /* Use the same security checks as NICE */
> + if (attr->sched_latency_nice < p->latency_nice &&
> + !can_nice(p, attr->sched_latency_nice))
> + return -EPERM;
> }
>
> if (pi)
>
> With the above in effect,
> A non-root user can only increase the value upto +19, and once increased
> cannot be decreased. e.g., a user once sets the value latency_nice = 19,
> the same user cannot set the value latency_nice = 18. This is the same
> effect as with NICE.
>
> Is such permission checks required?
>
> Unlike NICE, we are going to use latency_nice for scheduler hints only, and
> so won't it make more sense to allow a user to increase/decrease the values
> of their owned tasks?
Whether called a hint or not, it is a trade-off to reduce latency of select tasks at the expense of the throughput of
the other tasks in the the system. If any of the other tasks belong to other users, you would presumably require permission.
-chrish
From: chris hyser <[email protected]>
> Sent: 19 February 2020 17:17
>
> On 2/19/20 6:18 AM, David Laight wrote:
> > From: chris hyser
> >> Sent: 18 February 2020 23:00
> > ...
> >> All, I was asked to take a look at the original latency_nice patchset.
> >> First, to clarify objectives, Oracle is not
> >> interested in trading throughput for latency.
> >> What we found is that the DB has specific tasks which do very little but
> >> need to do this as absolutely quickly as possible, ie extreme latency
> >> sensitivity. Second, the key to latency reduction
> >> in the task wakeup path seems to be limiting variations of "idle cpu" search.
> >> The latter particularly interests me as an example of "platform size
> >> based latency" which I believe to be important given all the varying size
> >> VMs and containers.
> >
> > From my experiments there are a few things that seem to affect latency
> > of waking up real time (sched fifo) tasks on a normal kernel:
>
> Sorry. I was only ever talking about sched_other as per the original patchset. I realize the term
> extreme latency
> sensitivity may have caused confusion. What that means to DB people is no doubt different than audio
> people. :-)
Shorter lines.....
ISTM you are making some already complicated code even more complex.
Better to make it simpler instead.
If you need a thread to run as soon as possible after it is woken
why not use the RT scheduler (eg SCHED_FIFO) that is what it is for.
If there are delays finding an idle cpu to migrate a process to
(especially on systems with large numbers of cpu) then that is a
general problem that can be addressed without extra knobs.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
On 02/20/20 09:30, chris hyser wrote:
> > The below diff works out well enough in-order to align permission checks
> > with NICE.
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 2bfcff5623f9..ef4a397c9170 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -4878,6 +4878,10 @@ static int __sched_setscheduler(struct task_struct *p,
> > return -EINVAL;
> > if (attr->sched_latency_nice < MIN_LATENCY_NICE)
> > return -EINVAL;
> > + /* Use the same security checks as NICE */
> > + if (attr->sched_latency_nice < p->latency_nice &&
> > + !can_nice(p, attr->sched_latency_nice))
> > + return -EPERM;
> > }
> >
> > if (pi)
> >
> > With the above in effect,
> > A non-root user can only increase the value upto +19, and once increased
> > cannot be decreased. e.g., a user once sets the value latency_nice = 19,
> > the same user cannot set the value latency_nice = 18. This is the same
> > effect as with NICE.
> >
> > Is such permission checks required?
> >
> > Unlike NICE, we are going to use latency_nice for scheduler hints only, and
> > so won't it make more sense to allow a user to increase/decrease the values
> > of their owned tasks?
>
> Whether called a hint or not, it is a trade-off to reduce latency of select
> tasks at the expense of the throughput of the other tasks in the the system.
Does it actually affect the throughput of the other tasks? I thought this will
allow the scheduler to reduce latencies, for instance, when selecting which cpu
it should land on. I can't see how this could hurt other tasks.
Can you expand on the scenario you have in mind please?
> If any of the other tasks belong to other users, you would presumably
> require permission.
AFAIU security_task_setscheduler() will only allow a change if the task is
changing its own attribute value or has SYS_CAP_NICE.
If you're able to change the attribute of another task, then its not only
latency_nice that's broken here.
Thanks
--
Qais Yousef
On 2/20/20 9:39 AM, David Laight wrote:
> From: chris hyser <[email protected]>
>> Sent: 19 February 2020 17:17
>>
>> On 2/19/20 6:18 AM, David Laight wrote:
>>> From: chris hyser
>>>> Sent: 18 February 2020 23:00
>>> ...
>>>> All, I was asked to take a look at the original latency_nice patchset.
>>>> First, to clarify objectives, Oracle is not
>>>> interested in trading throughput for latency.
>>>> What we found is that the DB has specific tasks which do very little but
>>>> need to do this as absolutely quickly as possible, ie extreme latency
>>>> sensitivity. Second, the key to latency reduction
>>>> in the task wakeup path seems to be limiting variations of "idle cpu" search.
>>>> The latter particularly interests me as an example of "platform size
>>>> based latency" which I believe to be important given all the varying size
>>>> VMs and containers.
>>>
>>> From my experiments there are a few things that seem to affect latency
>>> of waking up real time (sched fifo) tasks on a normal kernel:
>>
>> Sorry. I was only ever talking about sched_other as per the original patchset. I realize the term
>> extreme latency
>> sensitivity may have caused confusion. What that means to DB people is no doubt different than audio
>> people. :-)
>
> Shorter lines.....
>
> ISTM you are making some already complicated code even more complex.
> Better to make it simpler instead.
The code already exists to set a limit to bail out of what is sometimes a needlessly excessive search. Setting that
based on an integer doesn't seem particularly complex. Now whether that is actually useful is what I'm currently looking at.
>
> If you need a thread to run as soon as possible after it is woken
> why not use the RT scheduler (eg SCHED_FIFO) that is what it is for.
Overkill and doesn't play well with cpu cgroup controller.
>
> If there are delays finding an idle cpu to migrate a process to
> (especially on systems with large numbers of cpu) then that is a
> general problem that can be addressed without extra knobs.
There is no if. It is a brute force search. There are delays proportional to the search domain size. You can optimize
the hell of out the brute force, or you use obtained knowledge to bail out early. Getting that knowledge from the user
is a time honored tradition. :-)
-chrish
On 2/20/20 10:03 AM, Qais Yousef wrote:
> On 02/20/20 09:30, chris hyser wrote:
>>> The below diff works out well enough in-order to align permission checks
>>> with NICE.
>>>
>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>> index 2bfcff5623f9..ef4a397c9170 100644
>>> --- a/kernel/sched/core.c
>>> +++ b/kernel/sched/core.c
>>> @@ -4878,6 +4878,10 @@ static int __sched_setscheduler(struct task_struct *p,
>>> return -EINVAL;
>>> if (attr->sched_latency_nice < MIN_LATENCY_NICE)
>>> return -EINVAL;
>>> + /* Use the same security checks as NICE */
>>> + if (attr->sched_latency_nice < p->latency_nice &&
>>> + !can_nice(p, attr->sched_latency_nice))
>>> + return -EPERM;
>>> }
>>>
>>> if (pi)
>>>
>>> With the above in effect,
>>> A non-root user can only increase the value upto +19, and once increased
>>> cannot be decreased. e.g., a user once sets the value latency_nice = 19,
>>> the same user cannot set the value latency_nice = 18. This is the same
>>> effect as with NICE.
>>>
>>> Is such permission checks required?
>>>
>>> Unlike NICE, we are going to use latency_nice for scheduler hints only, and
>>> so won't it make more sense to allow a user to increase/decrease the values
>>> of their owned tasks?
>>
>> Whether called a hint or not, it is a trade-off to reduce latency of select
>> tasks at the expense of the throughput of the other tasks in the the system.
>
> Does it actually affect the throughput of the other tasks? I thought this will
> allow the scheduler to reduce latencies, for instance, when selecting which cpu
> it should land on. I can't see how this could hurt other tasks.
This is why it is hard to argue about pure abstractions. The primary idea mentioned so far for how these latencies are
reduced is by short cutting the brute-force search for something idle. If you don't find an idle cpu because you didn't
spend the time to look, then you pre-empted a task, possibly with a large nice warm cache footprint that was cranking
away on throughput. It is ultimately going to be the usual latency vs throughput trade off. If latency reduction were
"free" we wouldn't need a per task attribute. We would just do the reduction for all tasks, everywhere, all the time.
>
> Can you expand on the scenario you have in mind please?
Hopefully, the above helps. It was my original plan to introduce this with a data laden RFC on the topic, but I felt the
need to respond to Parth immediately. I'm not currently pushing any particular change.
-chrish
On 02/20/20 11:34, chris hyser wrote:
> > > Whether called a hint or not, it is a trade-off to reduce latency of select
> > > tasks at the expense of the throughput of the other tasks in the the system.
> >
> > Does it actually affect the throughput of the other tasks? I thought this will
> > allow the scheduler to reduce latencies, for instance, when selecting which cpu
> > it should land on. I can't see how this could hurt other tasks.
>
> This is why it is hard to argue about pure abstractions. The primary idea
> mentioned so far for how these latencies are reduced is by short cutting the
> brute-force search for something idle. If you don't find an idle cpu because
> you didn't spend the time to look, then you pre-empted a task, possibly with
> a large nice warm cache footprint that was cranking away on throughput. It
> is ultimately going to be the usual latency vs throughput trade off. If
> latency reduction were "free" we wouldn't need a per task attribute. We
> would just do the reduction for all tasks, everywhere, all the time.
This could still happen without the latency nice bias. I'm not sure if this
falls under DoS; maybe if you end up spawning a lot of task with high latency
nice value, then you might end up cramming a lot of tasks on a small subset of
CPUs. But then, shouldn't the logic that uses latency_nice try to handle this
case anyway since it could be legit?
Not sure if this can be used by someone to trigger timing based attacks on
another process.
I can't fully see the whole security implications, but regardless. I do agree
it is prudent to not allow tasks to set their own latency_nice. Mainly because
the meaning of this flag will be system dependent and I think Admins are the
better ones to decide how to use this flag for the system they're running on.
I don't think application writers should be able to tweak their tasks
latency_nice value. Not if they can't get the right privilege at least.
>
> >
> > Can you expand on the scenario you have in mind please?
>
> Hopefully, the above helps. It was my original plan to introduce this with a
> data laden RFC on the topic, but I felt the need to respond to Parth
> immediately. I'm not currently pushing any particular change.
Thanks!
--
Qais Yousef
On 2/21/20 2:59 PM, Qais Yousef wrote:
> On 02/20/20 11:34, chris hyser wrote:
>>>> Whether called a hint or not, it is a trade-off to reduce latency of select
>>>> tasks at the expense of the throughput of the other tasks in the the system.
>>>
>>> Does it actually affect the throughput of the other tasks? I thought this will
>>> allow the scheduler to reduce latencies, for instance, when selecting which cpu
>>> it should land on. I can't see how this could hurt other tasks.
>>
>> This is why it is hard to argue about pure abstractions. The primary idea
>> mentioned so far for how these latencies are reduced is by short cutting the
>> brute-force search for something idle. If you don't find an idle cpu because
>> you didn't spend the time to look, then you pre-empted a task, possibly with
>> a large nice warm cache footprint that was cranking away on throughput. It
>> is ultimately going to be the usual latency vs throughput trade off. If
>> latency reduction were "free" we wouldn't need a per task attribute. We
>> would just do the reduction for all tasks, everywhere, all the time.
>
> This could still happen without the latency nice bias. I'm not sure if this
> falls under DoS; maybe if you end up spawning a lot of task with high latency
> nice value, then you might end up cramming a lot of tasks on a small subset of
> CPUs. But then, shouldn't the logic that uses latency_nice try to handle this
> case anyway since it could be legit?
>
> Not sure if this can be used by someone to trigger timing based attacks on
> another process.
>
> I can't fully see the whole security implications, but regardless. I do agree
> it is prudent to not allow tasks to set their own latency_nice. Mainly because
> the meaning of this flag will be system dependent and I think Admins are the
> better ones to decide how to use this flag for the system they're running on.
> I don't think application writers should be able to tweak their tasks
> latency_nice value. Not if they can't get the right privilege at least.
>
AFAICT if latency_nice is really used for scheduler hints to provide
latency requirements, then the worst that can happen is the user can
request to have all the created tasks get least latency (which might not
even be possible). Hence, unless we bias priority or vruntime, I don't see
the DoS possibility with latency_nice.
Being conservative for future and considering that only Admin may make
correct decision, I do think to not allow non-root user to give -ve values
for latency_nice.
NICE don't allow reducing the task_nice value even if the value has been
increase by the same user in the past. From my understandings, this don't
make sense much for the latency_nice and user should be free to choose any
value in the range [0,19].
- Parth
>>
>>>
>>> Can you expand on the scenario you have in mind please?
>>
>> Hopefully, the above helps. It was my original plan to introduce this with a
>> data laden RFC on the topic, but I felt the need to respond to Parth
>> immediately. I'm not currently pushing any particular change.
>
> Thanks!
>
> --
> Qais Yousef
>
On 2/21/20 5:01 AM, Parth Shah wrote:
>
>
> On 2/21/20 2:59 PM, Qais Yousef wrote:
>> On 02/20/20 11:34, chris hyser wrote:
>>>>> Whether called a hint or not, it is a trade-off to reduce latency of select
>>>>> tasks at the expense of the throughput of the other tasks in the the system.
>>>>
>>>> Does it actually affect the throughput of the other tasks? I thought this will
>>>> allow the scheduler to reduce latencies, for instance, when selecting which cpu
>>>> it should land on. I can't see how this could hurt other tasks.
>>>
>>> This is why it is hard to argue about pure abstractions. The primary idea
>>> mentioned so far for how these latencies are reduced is by short cutting the
>>> brute-force search for something idle. If you don't find an idle cpu because
>>> you didn't spend the time to look, then you pre-empted a task, possibly with
>>> a large nice warm cache footprint that was cranking away on throughput. It
>>> is ultimately going to be the usual latency vs throughput trade off. If
>>> latency reduction were "free" we wouldn't need a per task attribute. We
>>> would just do the reduction for all tasks, everywhere, all the time.
>>
>> This could still happen without the latency nice bias. I'm not sure if this
>> falls under DoS; maybe if you end up spawning a lot of task with high latency
>> nice value, then you might end up cramming a lot of tasks on a small subset of
>> CPUs. But then, shouldn't the logic that uses latency_nice try to handle this
>> case anyway since it could be legit?
>>
>> Not sure if this can be used by someone to trigger timing based attacks on
>> another process.
>>
>> I can't fully see the whole security implications, but regardless. I do agree
>> it is prudent to not allow tasks to set their own latency_nice. Mainly because
>> the meaning of this flag will be system dependent and I think Admins are the
>> better ones to decide how to use this flag for the system they're running on.
>> I don't think application writers should be able to tweak their tasks
>> latency_nice value. Not if they can't get the right privilege at least.
>>
>
> AFAICT if latency_nice is really used for scheduler hints to provide
> latency requirements, then the worst that can happen is the user can
> request to have all the created tasks get least latency (which might not
> even be possible). Hence, unless we bias priority or vruntime, I don't see
> the DoS possibility with latency_nice.
A latency nice that does not allow reducing the latency of select tasks at the expense of the throughput of other tasks
(ie the classic trade-off) doesn't seem to be all that useful and would fail to meet the requirements. Latency nice in
this view is a directive on how to make that trade-off much like nice is a directive on how to trade-off what should get
to run next and we don't accept the system optionally treating +19 and -20 tasks the same. Even when we don't get
exactly what we want we expect "best effort" and that is not the same as optional.
-chrish
On 2/21/20 4:29 AM, Qais Yousef wrote:
> On 02/20/20 11:34, chris hyser wrote:
>>>> Whether called a hint or not, it is a trade-off to reduce latency of select
>>>> tasks at the expense of the throughput of the other tasks in the the system.
>>>
>>> Does it actually affect the throughput of the other tasks? I thought this will
>>> allow the scheduler to reduce latencies, for instance, when selecting which cpu
>>> it should land on. I can't see how this could hurt other tasks.
>>
>> This is why it is hard to argue about pure abstractions. The primary idea
>> mentioned so far for how these latencies are reduced is by short cutting the
>> brute-force search for something idle. If you don't find an idle cpu because
>> you didn't spend the time to look, then you pre-empted a task, possibly with
>> a large nice warm cache footprint that was cranking away on throughput. It
>> is ultimately going to be the usual latency vs throughput trade off. If
>> latency reduction were "free" we wouldn't need a per task attribute. We
>> would just do the reduction for all tasks, everywhere, all the time.
>
> This could still happen without the latency nice bias. I'm not sure if this
True, but without a bias towards a user at the users request, it's just normal scheduler policy. :-)
> falls under DoS; maybe if you end up spawning a lot of task with high latency
> nice value, then you might end up cramming a lot of tasks on a small subset of
> CPUs. But then, shouldn't the logic that uses latency_nice try to handle this
> case anyway since it could be legit?
One of the experiments I'm planning is basically that; how badly can this be abused by root. Like pretty much everything
else, if root wants to destroy the system, not much you can do, but I find it useful to look at the pathological cases
as well.
>
> Not sure if this can be used by someone to trigger timing based attacks on
> another process.
>
> I can't fully see the whole security implications, but regardless. I do agree
> it is prudent to not allow tasks to set their own latency_nice. Mainly because
> the meaning of this flag will be system dependent and I think Admins are the
> better ones to decide how to use this flag for the system they're running on.
> I don't think application writers should be able to tweak their tasks
> latency_nice value. Not if they can't get the right privilege at least.
Agreed.
>
>>
>>>
>>> Can you expand on the scenario you have in mind please?
>>
>> Hopefully, the above helps. It was my original plan to introduce this with a
>> data laden RFC on the topic, but I felt the need to respond to Parth
>> immediately. I'm not currently pushing any particular change.
>
> Thanks!
No problem.
-chrish
On 2/19/20 5:09 AM, Parth Shah wrote:
> Hi Chris,
>> Parth, I've been using your v3 patchset as the basis of an investigation
>> into the measurable effects of short-circuiting this search. I'm not quite
>> ready to put anything out, but the patchset is working well. The only
>
> That's a good news as you are able to get a usecase of this patch-set.
Parth, I wanted to make sure to thank you for all this effort. It has made it really easy to start experimenting with
these ideas and concepts and that's been an enormous help. Thanks again.
-chrish