LinuxLists.cc - [PATCH 1/2] sched: introduce helper function to calculate distribution over sched class

2024-02-20 06:17:07

Subject: [PATCH 1/2] sched: introduce helper function to calculate distribution over sched class

From: Zhaoyang Huang <[email protected]>

As RT, DL, IRQ time could be deemed as lost time of CFS's task, some
timing value want to know the distribution of how these spread
approximately by using utilization account value (nivcsw is not enough
sometimes). This commit would like to introduce a helper function to
achieve this goal.

eg.
Effective part of A = Total_time * cpu_util_cfs / cpu_util

Timing value A
(should be a process last for several TICKs or statistics of a repeadted
process)

Timing start
|
|
preempted by RT, DL or IRQ
|\
| This period time is nonvoluntary CPU give up, need to know how long
|/
sched in again
|
|
|
Timing end

Signed-off-by: Zhaoyang Huang <[email protected]>
---
include/linux/sched.h | 1 +
kernel/sched/core.c | 20 ++++++++++++++++++++
2 files changed, 21 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 77f01ac385f7..99cf09c47f72 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2318,6 +2318,7 @@ static inline bool owner_on_cpu(struct task_struct *owner)

/* Returns effective CPU energy utilization, as seen by the scheduler */
unsigned long sched_cpu_util(int cpu);
+unsigned long cfs_prop_by_util(struct task_struct *tsk, unsigned long val);
#endif /* CONFIG_SMP */

#ifdef CONFIG_RSEQ
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 802551e0009b..217e2220fdc1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7494,6 +7494,26 @@ unsigned long sched_cpu_util(int cpu)
{
return effective_cpu_util(cpu, cpu_util_cfs(cpu), ENERGY_UTIL, NULL);
}
+
+/*
+ * Calculate the approximate proportion of timing value consumed in cfs.
+ * The user must be aware of this is done by avg_util which is tracked by
+ * the geometric series as decaying the load by y^32 = 0.5 (unit is 1ms).
+ * That is, only the period last for at least several TICKs or the statistics
+ * of repeated timing value are suitable for this helper function.
+ */
+unsigned long cfs_prop_by_util(struct task_struct *tsk, unsigned long val)
+{
+ unsigned int cpu = task_cpu(tsk);
+ struct rq *rq = cpu_rq(cpu);
+ unsigned long util;
+
+ if (tsk->sched_class != &fair_sched_class)
+ return val;
+ util = cpu_util_rt(rq) + cpu_util_cfs(cpu) + cpu_util_irq(rq) + cpu_util_dl(rq);
+ return min(val, cpu_util_cfs(cpu) * val / util);
+}
+
#endif /* CONFIG_SMP */

/**
--
2.25.1

2024-02-20 06:17:19

by zhaoyang.huang

[permalink] [raw]

Subject: [PATCH 2/2] block: adjust CFS request expire time

From: Zhaoyang Huang <[email protected]>

According to current policy, CFS's may suffer involuntary IO-latency by
being preempted by RT/DL tasks or IRQ since they possess the privilege for
both of CPU and IO scheduler. This commit introduce an approximate and
light method to decrease these affection by adjusting the expire time
via the CFS's proportion among the whole cpu active time.
The average utilization of cpu's run queue could reflect the historical
active proportion of different types of task that can be proved valid for
this goal from belowing three perspective,

1. All types of sched class's load(util) are tracked and calculated in the
same way(using a geometric series which known as PELT)
2. Keep the legacy policy by NOT adjusting rq's position in fifo_list
but only make changes over expire_time.
3. The fixed expire time(hundreds of ms) is in the same range of cpu
avg_load's account series(the utilization will be decayed to 0.5 in 32ms)

TaskA
sched in
|
|
|
submit_bio
|
|
|
fifo_time = jiffies + expire
(insert_request)

TaskB
sched in
|
|
vfs_xxx
|
|preempted by RT,DL,IRQ
|\
| This period time is unfair to TaskB's IO request, should be adjust
|/
|
submit_bio
|
|
|
fifo_time = jiffies + expire * CFS_PROPORTION(rq)
(insert_request)

Signed-off-by: Zhaoyang Huang <[email protected]>
---
block/mq-deadline.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index f958e79277b8..1e538cb2783b 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -839,8 +839,15 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,

/*
* set expire time and add to fifo list
+ * The expire time is adjusted via calculating the proportion of
+ * CFS's activation among whole cpu time during last several
+ * dazen's ms.Whearas, this would NOT affect the rq's position in
+ * fifo_list but only take effect when this rq is checked for its
+ * expire time when at head.
*/
- rq->fifo_time = jiffies + dd->fifo_expire[data_dir];
+ rq->fifo_time = jiffies +
+ cfs_prop_by_util(current, dd->fifo_expire[data_dir]);
+
insert_before = &per_prio->fifo_list[data_dir];
#ifdef CONFIG_BLK_DEV_ZONED
/*
--
2.25.1

2024-02-20 10:38:23

by Zhaoyang Huang

[permalink] [raw]

Subject: Re: [PATCH 2/2] block: adjust CFS request expire time

On Tue, Feb 20, 2024 at 5:42 PM Christoph Hellwig <[email protected]> wrote:
>
> On Tue, Feb 20, 2024 at 02:15:42PM +0800, zhaoyang.huang wrote:
> > From: Zhaoyang Huang <[email protected]>
> >
> > According to current policy, CFS's may suffer involuntary IO-latency by
> > being preempted by RT/DL tasks or IRQ since they possess the privilege for
> > both of CPU and IO scheduler.
>
> What is 'current policy', what is CFS, what is RT/DL? What privilege
> is possessed?
CFS and RT/DL are types of sched class in which CFS has the least
privilege to get CPU.
IMO, ‘current policy’ refers to two perspectives:
1. the RT task in the same core with the CFS task gets privileges in
both CPU and IO scheduler(deadline on duty) than CFS. Could we make
the CFS requests' expire_time be earlier than it used to be now.
2. In terms of the timing of inserting the request, preempted CFS
tasks lose the fairness involuntary when compared with none-preempted
CFS tasks. Could we decrease this impact in some way.
>
> > 1. All types of sched class's load(util) are tracked and calculated in the
> > same way(using a geometric series which known as PELT)
> > 2. Keep the legacy policy by NOT adjusting rq's position in fifo_list
> > but only make changes over expire_time.
> > 3. The fixed expire time(hundreds of ms) is in the same range of cpu
> > avg_load's account series(the utilization will be decayed to 0.5 in 32ms)
>
> What problem does this fix, i.e. what performance number are improved
> or what other effects does it have?
I have verified this commit via some benchmark tools like fio and
Androbench. Neither regression nor improvement is found. By analysing
the log below[2], where I find that CFS occupies most of the CPU for
the most part. If it makes more sense in the way of [1] where CFS is
over-preempted than a threshold.

[1]
- rq->fifo_time = jiffies + dd->fifo_expire[data_dir];

/*adjust expire time when cfs is over-preempted than 50%*/
+ fifo_expire = cfs_prop_by_util(current,100) < 50 ?
dd->fifo_expire[data_dir] :
+ cfs_prop_by_util(current, dd->fifo_expire[data_dir]);
+ rq->fifo_time = jiffies + fifo_expire;

[2]
//prop is the proportion of CFS's util which is mostly above 90(90%)
during common benchmark test
kworker/u16:3-73 [000] ...1. 321.140143: dd_insert_request:
dir 1,cfs 513, prop 91, orig_expire 1250, expire 1149
kworker/u16:3-73 [000] ...1. 321.140414: dd_insert_request:
dir 1,cfs 513, prop 91, orig_expire 1250, expire 1149
kworker/u16:3-73 [000] ...1. 321.140505: dd_insert_request:
dir 1,cfs 513, prop 91, orig_expire 1250, expire 1149
kworker/u16:3-73 [000] ...1. 321.140574: dd_insert_request:
dir 1,cfs 513, prop 91, orig_expire 1250, expire 1149
kworker/u16:3-73 [000] ...1. 321.140630: dd_insert_request:
dir 1,cfs 513, prop 91, orig_expire 1250, expire 1149
kworker/u16:3-73 [000] ...1. 321.140682: dd_insert_request:
dir 1,cfs 513, prop 91, orig_expire 1250, expire 1149
kworker/u16:3-73 [000] ...1. 321.140736: dd_insert_request:
dir 1,cfs 513, prop 91, orig_expire 1250, expire 1149
dd-7296 [006] ...1. 321.143139: dd_insert_request:
dir 0,cfs 610, prop 92, orig_expire 125, expire 115
dd-7296 [006] ...1. 321.143287: dd_insert_request:
dir 0,cfs 610, prop 92, orig_expire 125, expire 115
dd-7296 [004] ...1. 321.156074: dd_insert_request:
dir 0,cfs 691, prop 97, orig_expire 125, expire 122
dd-7296 [004] ...1. 321.156202: dd_insert_request:
dir 0,cfs 691, prop 97, orig_expire 125, expire 122

>
> > + * The expire time is adjusted via calculating the proportion of
> > + * CFS's activation among whole cpu time during last several
> > + * dazen's ms.Whearas, this would NOT affect the rq's position in
> > + * fifo_list but only take effect when this rq is checked for its
> > + * expire time when at head.
> > */
>
> Please speel check the comment and fix the formatting to have white
> spaces after sentences and never exceed 80 characters in block comments.
ok.
>

2024-02-20 12:20:25

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH 2/2] block: adjust CFS request expire time

On Tue, Feb 20, 2024 at 02:15:42PM +0800, zhaoyang.huang wrote:
> From: Zhaoyang Huang <[email protected]>
>
> According to current policy, CFS's may suffer involuntary IO-latency by
> being preempted by RT/DL tasks or IRQ since they possess the privilege for
> both of CPU and IO scheduler.

What is 'current policy', what is CFS, what is RT/DL? What privilege
is possessed?

> 1. All types of sched class's load(util) are tracked and calculated in the
> same way(using a geometric series which known as PELT)
> 2. Keep the legacy policy by NOT adjusting rq's position in fifo_list
> but only make changes over expire_time.
> 3. The fixed expire time(hundreds of ms) is in the same range of cpu
> avg_load's account series(the utilization will be decayed to 0.5 in 32ms)

What problem does this fix, i.e. what performance number are improved
or what other effects does it have?

> + * The expire time is adjusted via calculating the proportion of
> + * CFS's activation among whole cpu time during last several
> + * dazen's ms.Whearas, this would NOT affect the rq's position in
> + * fifo_list but only take effect when this rq is checked for its
> + * expire time when at head.
> */

Please speel check the comment and fix the formatting to have white
spaces after sentences and never exceed 80 characters in block comments.

2024-02-21 17:51:44

by Vincent Guittot

[permalink] [raw]

Subject: Re: [PATCH 1/2] sched: introduce helper function to calculate distribution over sched class

On Tue, 20 Feb 2024 at 07:16, zhaoyang.huang <[email protected]> wrote:
>
> From: Zhaoyang Huang <[email protected]>
>
> As RT, DL, IRQ time could be deemed as lost time of CFS's task, some

It's lost only if cfs has been actually preempted

> timing value want to know the distribution of how these spread
> approximately by using utilization account value (nivcsw is not enough
> sometimes). This commit would like to introduce a helper function to
> achieve this goal.
>
> eg.
> Effective part of A = Total_time * cpu_util_cfs / cpu_util
>
> Timing value A
> (should be a process last for several TICKs or statistics of a repeadted
> process)
>
> Timing start
> |
> |
> preempted by RT, DL or IRQ
> |\
> | This period time is nonvoluntary CPU give up, need to know how long
> |/

preempted means that a cfs task stops running on the cpu and lets
another rt/dl task or an irq run on the cpu instead. We can't know
that. We know an average ratio of time spent in rt/dl and irq contexts
but not if the cpu was idle or running cfs task

> sched in again
> |
> |
> |
> Timing end
>
> Signed-off-by: Zhaoyang Huang <[email protected]>
> ---
> include/linux/sched.h | 1 +
> kernel/sched/core.c | 20 ++++++++++++++++++++
> 2 files changed, 21 insertions(+)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 77f01ac385f7..99cf09c47f72 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2318,6 +2318,7 @@ static inline bool owner_on_cpu(struct task_struct *owner)
>
> /* Returns effective CPU energy utilization, as seen by the scheduler */
> unsigned long sched_cpu_util(int cpu);
> +unsigned long cfs_prop_by_util(struct task_struct *tsk, unsigned long val);
> #endif /* CONFIG_SMP */
>
> #ifdef CONFIG_RSEQ
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 802551e0009b..217e2220fdc1 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7494,6 +7494,26 @@ unsigned long sched_cpu_util(int cpu)
> {
> return effective_cpu_util(cpu, cpu_util_cfs(cpu), ENERGY_UTIL, NULL);
> }
> +
> +/*
> + * Calculate the approximate proportion of timing value consumed in cfs.
> + * The user must be aware of this is done by avg_util which is tracked by
> + * the geometric series as decaying the load by y^32 = 0.5 (unit is 1ms).
> + * That is, only the period last for at least several TICKs or the statistics
> + * of repeated timing value are suitable for this helper function.
> + */
> +unsigned long cfs_prop_by_util(struct task_struct *tsk, unsigned long val)
> +{
> + unsigned int cpu = task_cpu(tsk);
> + struct rq *rq = cpu_rq(cpu);
> + unsigned long util;
> +
> + if (tsk->sched_class != &fair_sched_class)
> + return val;
> + util = cpu_util_rt(rq) + cpu_util_cfs(cpu) + cpu_util_irq(rq) + cpu_util_dl(rq);

This is not correct as irq is not on the same clock domain: look at
effective_cpu_util()

You don't care about idle time ?

> + return min(val, cpu_util_cfs(cpu) * val / util);
> +}
> +
> #endif /* CONFIG_SMP */
>
> /**
> --
> 2.25.1
>

2024-02-22 02:58:47

by Zhaoyang Huang

[permalink] [raw]

Subject: Re: [PATCH 1/2] sched: introduce helper function to calculate distribution over sched class

On Thu, Feb 22, 2024 at 1:51 AM Vincent Guittot
<[email protected]> wrote:
>
> On Tue, 20 Feb 2024 at 07:16, zhaoyang.huang <[email protected]> wrote:
> >
> > From: Zhaoyang Huang <[email protected]>
> >
> > As RT, DL, IRQ time could be deemed as lost time of CFS's task, some
>
> It's lost only if cfs has been actually preempted
Yes. Actually, I just want to get the approximate proportion of how
CFS tasks(whole runq) is preempted. The preemption among CFS is not
considered.
>
> > timing value want to know the distribution of how these spread
> > approximately by using utilization account value (nivcsw is not enough
> > sometimes). This commit would like to introduce a helper function to
> > achieve this goal.
> >
> > eg.
> > Effective part of A = Total_time * cpu_util_cfs / cpu_util
> >
> > Timing value A
> > (should be a process last for several TICKs or statistics of a repeadted
> > process)
> >
> > Timing start
> > |
> > |
> > preempted by RT, DL or IRQ
> > |\
> > | This period time is nonvoluntary CPU give up, need to know how long
> > |/
>
> preempted means that a cfs task stops running on the cpu and lets
> another rt/dl task or an irq run on the cpu instead. We can't know
> that. We know an average ratio of time spent in rt/dl and irq contexts
> but not if the cpu was idle or running cfs task
ok, will take idle into consideration and as explained above,
preemption among cfs tasks is not considered on purpose
>
> > sched in again
> > |
> > |
> > |
> > Timing end
> >
> > Signed-off-by: Zhaoyang Huang <[email protected]>
> > ---
> > include/linux/sched.h | 1 +
> > kernel/sched/core.c | 20 ++++++++++++++++++++
> > 2 files changed, 21 insertions(+)
> >
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 77f01ac385f7..99cf09c47f72 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -2318,6 +2318,7 @@ static inline bool owner_on_cpu(struct task_struct *owner)
> >
> > /* Returns effective CPU energy utilization, as seen by the scheduler */
> > unsigned long sched_cpu_util(int cpu);
> > +unsigned long cfs_prop_by_util(struct task_struct *tsk, unsigned long val);
> > #endif /* CONFIG_SMP */
> >
> > #ifdef CONFIG_RSEQ
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 802551e0009b..217e2220fdc1 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -7494,6 +7494,26 @@ unsigned long sched_cpu_util(int cpu)
> > {
> > return effective_cpu_util(cpu, cpu_util_cfs(cpu), ENERGY_UTIL, NULL);
> > }
> > +
> > +/*
> > + * Calculate the approximate proportion of timing value consumed in cfs.
> > + * The user must be aware of this is done by avg_util which is tracked by
> > + * the geometric series as decaying the load by y^32 = 0.5 (unit is 1ms).
> > + * That is, only the period last for at least several TICKs or the statistics
> > + * of repeated timing value are suitable for this helper function.
> > + */
> > +unsigned long cfs_prop_by_util(struct task_struct *tsk, unsigned long val)
> > +{
> > + unsigned int cpu = task_cpu(tsk);
> > + struct rq *rq = cpu_rq(cpu);
> > + unsigned long util;
> > +
> > + if (tsk->sched_class != &fair_sched_class)
> > + return val;
> > + util = cpu_util_rt(rq) + cpu_util_cfs(cpu) + cpu_util_irq(rq) + cpu_util_dl(rq);
>
> This is not correct as irq is not on the same clock domain: look at
> effective_cpu_util()
>
> You don't care about idle time ?
ok, will check. thanks
>
> > + return min(val, cpu_util_cfs(cpu) * val / util);
> > +}
> > +
> > #endif /* CONFIG_SMP */
> >
> > /**
> > --
> > 2.25.1
> >