LinuxLists.cc - [PATCH] sched/fair: Reschedule the cfs

2024-05-24 13:41:35

Subject: [PATCH] sched/fair: Reschedule the cfs_rq when current is ineligible

I found that some tasks have been running for a long enough time and
have become illegal, but they are still not releasing the CPU. This
will increase the scheduling delay of other processes. Therefore, I
tried checking the current process in wakeup_preempt and entity_tick,
and if it is illegal, reschedule that cfs queue.

The modification can reduce the scheduling delay by about 30% when
RUN_TO_PARITY is enabled.
So far, it has been running well in my test environment, and I have
pasted some test results below.

I isolated four cores for testing. I ran Hackbench in the background
and observed the test results of cyclictest.

hackbench -g 4 -l 100000000 &
cyclictest --mlockall -D 5m -q

EEVDF PATCH EEVDF-NO_PARITY PATCH-NO_PARITY

# Min Latencies: 00006 00006 00006 00006
LNICE(-19) # Avg Latencies: 00191 00122 00089 00066
# Max Latencies: 15442 07648 14133 07713

# Min Latencies: 00006 00010 00006 00006
LNICE(0) # Avg Latencies: 00466 00277 00289 00257
# Max Latencies: 38917 32391 32665 17710

# Min Latencies: 00019 00053 00010 00013
LNICE(19) # Avg Latencies: 37151 31045 18293 23035
# Max Latencies: 2688299 7031295 426196 425708

I'm actually a bit hesitant about placing this modification under the
NO_PARITY feature. This is because the modification conflicts with the
semantics of RUN_TO_PARITY. So, I captured and compared the number of
resched occurrences in wakeup_preempt to see if it introduced any
additional overhead.

Similarly, hackbench is used to stress the utilization of four cores to
100%, and the method for capturing the number of PREEMPT occurrences is
referenced from [1].

schedstats EEVDF PATCH EEVDF-NO_PARITY PATCH-NO_PARITY CFS(6.5)
stats.check_preempt_count 5053054 5057286 5003806 5018589 5031908
stats.patch_cause_preempt_count ------- 858044 ------- 765726 -------
stats.need_preempt_count 570520 858684 3380513 3426977 1140821

From the above test results, there is a slight increase in the number of
resched occurrences in wakeup_preempt. However, the results vary with each
test, and sometimes the difference is not that significant. But overall,
the count of reschedules remains lower than that of CFS and is much less
than that of NO_PARITY.

[1]: https://lore.kernel.org/all/[email protected]/T/#m52057282ceb6203318be1ce9f835363de3bef5cb

Signed-off-by: Chunxin Zang <[email protected]>
Reviewed-by: Chen Yang <[email protected]>
---
kernel/sched/fair.c | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 03be0d1330a6..a0005d240db5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5523,6 +5523,9 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
hrtimer_active(&rq_of(cfs_rq)->hrtick_timer))
return;
#endif
+
+ if (!entity_eligible(cfs_rq, curr))
+ resched_curr(rq_of(cfs_rq));
}

@@ -8325,6 +8328,9 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
if (unlikely(p->policy != SCHED_NORMAL) || !sched_feat(WAKEUP_PREEMPTION))
return;

+ if (!entity_eligible(cfs_rq, se))
+ goto preempt;
+
find_matching_se(&se, &pse);
WARN_ON_ONCE(!pse);

--
2.34.1

2024-05-24 15:35:15

by Chen Yu

[permalink] [raw]

Subject: Re: [PATCH] sched/fair: Reschedule the cfs_rq when current is ineligible

On 2024-05-24 at 21:40:11 +0800, Chunxin Zang wrote:
> I found that some tasks have been running for a long enough time and
> have become illegal, but they are still not releasing the CPU. This
> will increase the scheduling delay of other processes. Therefore, I
> tried checking the current process in wakeup_preempt and entity_tick,
> and if it is illegal, reschedule that cfs queue.
>
> The modification can reduce the scheduling delay by about 30% when
> RUN_TO_PARITY is enabled.
> So far, it has been running well in my test environment, and I have
> pasted some test results below.
>

Interesting, besides hackbench, I assume that you have workload in
real production environment that is sensitive to wakeup latency?

>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 03be0d1330a6..a0005d240db5 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5523,6 +5523,9 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
> hrtimer_active(&rq_of(cfs_rq)->hrtick_timer))
> return;
> #endif
> +
> + if (!entity_eligible(cfs_rq, curr))
> + resched_curr(rq_of(cfs_rq));
> }
>

entity_tick() -> update_curr() -> update_deadline():
se->vruntime >= se->deadline ? resched_curr()
only current has expired its slice will it be scheduled out.

So here you want to schedule current out if its lag becomes 0.

In lastest sched/eevdf branch, it is controlled by two sched features:
RESPECT_SLICE: Inhibit preemption until the current task has exhausted it's slice.
RUN_TO_PARITY: Relax RESPECT_SLICE and only protect current until 0-lag.
https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=sched/eevdf&id=e04f5454d68590a239092a700e9bbaf84270397c

Maybe something like this can achieve your goal
if (sched_feat(RUN_TOPARITY) && !entity_eligible(cfs_rq, curr))
resched_curr

>
> @@ -8325,6 +8328,9 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
> if (unlikely(p->policy != SCHED_NORMAL) || !sched_feat(WAKEUP_PREEMPTION))
> return;
>
> + if (!entity_eligible(cfs_rq, se))
> + goto preempt;
> +

Not sure if this is applicable, later in this function, pick_eevdf() checks
if the current is eligible, !entity_eligible(cfs_rq, curr), if not, curr will
be evicted. And this change does not consider the cgroup hierarchy.

Besides, the check of current eligiblity can get false negative result,
if the enqueued entity has a positive lag. Prateek proposed to
remove the check of current's eligibility in pick_eevdf():
https://lore.kernel.org/lkml/[email protected]/

If I understand your requirement correctly, you want to reduce the wakeup
latency. There are some codes under developed by Peter, which could
customized task's wakeup latency via setting its slice:
https://lore.kernel.org/lkml/[email protected]/

thanks,
Chenyu

2024-05-25 06:42:13

by Mike Galbraith

[permalink] [raw]

Subject: Re: [PATCH] sched/fair: Reschedule the cfs_rq when current is ineligible

On Fri, 2024-05-24 at 21:40 +0800, Chunxin Zang wrote:
> I found that some tasks have been running for a long enough time and
> have become illegal, but they are still not releasing the CPU. This
> will increase the scheduling delay of other processes. Therefore, I
> tried checking the current process in wakeup_preempt and entity_tick,
> and if it is illegal, reschedule that cfs queue.

My box gave making the XXX below reality a two thumbs up when fiddling
with the original unfettered and a bit harsh RUN_TO_PARITY.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8a5b1ae0aa55..922834f172b0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8413,12 +8413,13 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
update_curr(cfs_rq);

/*
- * XXX pick_eevdf(cfs_rq) != se ?
+ * Run @curr until it is no longer our best option. Basing the preempt
+ * decision on @curr reselection puts any previous decisions back on the
+ * table in context "now", including granularity preservation decisions
+ * by RUN_TO_PARITY.
*/
- if (pick_eevdf(cfs_rq) == pse)
- goto preempt;
-
- return;
+ if (pick_eevdf(cfs_rq) == se)
+ return;

preempt:
resched_curr(rq);

2024-05-25 11:58:11

by Chen Yu

[permalink] [raw]

Subject: Re: [PATCH] sched/fair: Reschedule the cfs_rq when current is ineligible

On 2024-05-25 at 08:41:28 +0200, Mike Galbraith wrote:
> On Fri, 2024-05-24 at 21:40 +0800, Chunxin Zang wrote:
> > I found that some tasks have been running for a long enough time and
> > have become illegal, but they are still not releasing the CPU. This
> > will increase the scheduling delay of other processes. Therefore, I
> > tried checking the current process in wakeup_preempt and entity_tick,
> > and if it is illegal, reschedule that cfs queue.
>
> My box gave making the XXX below reality a two thumbs up when fiddling
> with the original unfettered and a bit harsh RUN_TO_PARITY.
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8a5b1ae0aa55..922834f172b0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8413,12 +8413,13 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
> update_curr(cfs_rq);
>
> /*
> - * XXX pick_eevdf(cfs_rq) != se ?
> + * Run @curr until it is no longer our best option. Basing the preempt
> + * decision on @curr reselection puts any previous decisions back on the
> + * table in context "now", including granularity preservation decisions
> + * by RUN_TO_PARITY.
> */
> - if (pick_eevdf(cfs_rq) == pse)
> - goto preempt;
> -
> - return;
> + if (pick_eevdf(cfs_rq) == se)
> + return;
>

I suppose this change benefits the overloaded scenario:
neither current nor the wakee is the best one.

before: current continues to run.
after: best se in the tree preempts current.

hackbench -g 12 -l 1000000000 & (480 tasks, 2x of the CPUs)

cyclictest --mlockall -D 1m -q
before:
T: 0 (15983) P: 0 I:1000 C: 43054 Min: 11 Act: 144 Avg: 627 Max: 11446

after:
T: 0 (16473) P: 0 I:1000 C: 49822 Min: 7 Act: 160 Avg: 388 Max: 10190

Min, Avg, Max latency all decreased.

thanks,
Chenyu

2024-05-25 12:40:37

by Honglei Wang

[permalink] [raw]

Subject: Re: [PATCH] sched/fair: Reschedule the cfs_rq when current is ineligible

On 2024/5/24 21:40, Chunxin Zang wrote:
> I found that some tasks have been running for a long enough time and
> have become illegal, but they are still not releasing the CPU. This
> will increase the scheduling delay of other processes. Therefore, I
> tried checking the current process in wakeup_preempt and entity_tick,
> and if it is illegal, reschedule that cfs queue.
>
> The modification can reduce the scheduling delay by about 30% when
> RUN_TO_PARITY is enabled.
> So far, it has been running well in my test environment, and I have
> pasted some test results below.
>
> I isolated four cores for testing. I ran Hackbench in the background
> and observed the test results of cyclictest.
>
> hackbench -g 4 -l 100000000 &
> cyclictest --mlockall -D 5m -q
>
> EEVDF PATCH EEVDF-NO_PARITY PATCH-NO_PARITY
>
> # Min Latencies: 00006 00006 00006 00006
> LNICE(-19) # Avg Latencies: 00191 00122 00089 00066
> # Max Latencies: 15442 07648 14133 07713
>
> # Min Latencies: 00006 00010 00006 00006
> LNICE(0) # Avg Latencies: 00466 00277 00289 00257
> # Max Latencies: 38917 32391 32665 17710
>
> # Min Latencies: 00019 00053 00010 00013
> LNICE(19) # Avg Latencies: 37151 31045 18293 23035
> # Max Latencies: 2688299 7031295 426196 425708
>
> I'm actually a bit hesitant about placing this modification under the
> NO_PARITY feature. This is because the modification conflicts with the
> semantics of RUN_TO_PARITY. So, I captured and compared the number of
> resched occurrences in wakeup_preempt to see if it introduced any
> additional overhead.
>
> Similarly, hackbench is used to stress the utilization of four cores to
> 100%, and the method for capturing the number of PREEMPT occurrences is
> referenced from [1].
>
> schedstats EEVDF PATCH EEVDF-NO_PARITY PATCH-NO_PARITY CFS(6.5)
> stats.check_preempt_count 5053054 5057286 5003806 5018589 5031908
> stats.patch_cause_preempt_count ------- 858044 ------- 765726 -------
> stats.need_preempt_count 570520 858684 3380513 3426977 1140821
>
> From the above test results, there is a slight increase in the number of
> resched occurrences in wakeup_preempt. However, the results vary with each
> test, and sometimes the difference is not that significant. But overall,
> the count of reschedules remains lower than that of CFS and is much less
> than that of NO_PARITY.
>
> [1]: https://lore.kernel.org/all/[email protected]/T/#m52057282ceb6203318be1ce9f835363de3bef5cb
>
> Signed-off-by: Chunxin Zang <[email protected]>
> Reviewed-by: Chen Yang <[email protected]>
> ---
> kernel/sched/fair.c | 6 ++++++
> 1 file changed, 6 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 03be0d1330a6..a0005d240db5 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5523,6 +5523,9 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
> hrtimer_active(&rq_of(cfs_rq)->hrtick_timer))
> return;
> #endif
> +
> + if (!entity_eligible(cfs_rq, curr))
> + resched_curr(rq_of(cfs_rq));
> }
>
>
> @@ -8325,6 +8328,9 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
> if (unlikely(p->policy != SCHED_NORMAL) || !sched_feat(WAKEUP_PREEMPTION))
> return;
>
> + if (!entity_eligible(cfs_rq, se))
> + goto preempt;
> +
> find_matching_se(&se, &pse);
> WARN_ON_ONCE(!pse);
>
Hi Chunxin,

Did you run a comparative test to see which modification is more helpful
on improve the latency? Modification at tick point makes more sense to
me. But, seems just resched arbitrarily in wakeup might introduce too
much preemption (and maybe more context switch?) in complex environment
such as cgroup hierarchy.

Thanks,
Honglei

2024-05-25 17:23:06

by Mike Galbraith

[permalink] [raw]

Subject: Re: [PATCH] sched/fair: Reschedule the cfs_rq when current is ineligible

2024-05-27 08:05:35

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH] sched/fair: Reschedule the cfs_rq when current is ineligible

On Sat, May 25, 2024 at 08:41:28AM +0200, Mike Galbraith wrote:

> - if (pick_eevdf(cfs_rq) == pse)
> - goto preempt;
> -
> - return;
> + if (pick_eevdf(cfs_rq) == se)
> + return;

Right, this will preempt more.

This is probably going to make Prateek's case worse though. Then again,
I was already leaning to towards not making his stronger slice
protection default, because it simply hurts too much elsewhere.

Still, his observation that placing tasks can move V left which in turn
can cause the just scheduled in current non-eligible and cause
over-scheduling is valid -- just not sure what to do about it yet.

2024-05-27 09:54:27

by Mike Galbraith

[permalink] [raw]

Subject: Re: [PATCH] sched/fair: Reschedule the cfs_rq when current is ineligible

On Mon, 2024-05-27 at 10:05 +0200, Peter Zijlstra wrote:
> On Sat, May 25, 2024 at 08:41:28AM +0200, Mike Galbraith wrote:
>
> > -       if (pick_eevdf(cfs_rq) == pse)
> > -               goto preempt;
> > -
> > -       return;
> > +       if (pick_eevdf(cfs_rq) == se)
> > +               return;
>
> Right, this will preempt more.

Yeah, and for no tangible benefit that I can see. Repeating the mixed
load GUI vs compute testing a bunch of times, there's enough variance
to swamp any signal.

-Mike

2024-05-28 02:43:18

by Chunxin Zang

[permalink] [raw]

Subject: Re: [PATCH] sched/fair: Reschedule the cfs_rq when current is ineligible

> On May 24, 2024, at 23:30, Chen Yu <[email protected]> wrote:
>
> On 2024-05-24 at 21:40:11 +0800, Chunxin Zang wrote:
>> I found that some tasks have been running for a long enough time and
>> have become illegal, but they are still not releasing the CPU. This
>> will increase the scheduling delay of other processes. Therefore, I
>> tried checking the current process in wakeup_preempt and entity_tick,
>> and if it is illegal, reschedule that cfs queue.
>>
>> The modification can reduce the scheduling delay by about 30% when
>> RUN_TO_PARITY is enabled.
>> So far, it has been running well in my test environment, and I have
>> pasted some test results below.
>>
>
> Interesting, besides hackbench, I assume that you have workload in
> real production environment that is sensitive to wakeup latency?

Hi Chen

Yes, my workload are quite sensitive to wakeup latency .
>
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 03be0d1330a6..a0005d240db5 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -5523,6 +5523,9 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
>> hrtimer_active(&rq_of(cfs_rq)->hrtick_timer))
>> return;
>> #endif
>> +
>> + if (!entity_eligible(cfs_rq, curr))
>> + resched_curr(rq_of(cfs_rq));
>> }
>>
>
> entity_tick() -> update_curr() -> update_deadline():
> se->vruntime >= se->deadline ? resched_curr()
> only current has expired its slice will it be scheduled out.
>
> So here you want to schedule current out if its lag becomes 0.
>
> In lastest sched/eevdf branch, it is controlled by two sched features:
> RESPECT_SLICE: Inhibit preemption until the current task has exhausted it's slice.
> RUN_TO_PARITY: Relax RESPECT_SLICE and only protect current until 0-lag.
> https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=sched/eevdf&id=e04f5454d68590a239092a700e9bbaf84270397c
>
> Maybe something like this can achieve your goal
> if (sched_feat(RUN_TOPARITY) && !entity_eligible(cfs_rq, curr))
> resched_curr
>
>>
>> @@ -8325,6 +8328,9 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>> if (unlikely(p->policy != SCHED_NORMAL) || !sched_feat(WAKEUP_PREEMPTION))
>> return;
>>
>> + if (!entity_eligible(cfs_rq, se))
>> + goto preempt;
>> +
>
> Not sure if this is applicable, later in this function, pick_eevdf() checks
> if the current is eligible, !entity_eligible(cfs_rq, curr), if not, curr will
> be evicted. And this change does not consider the cgroup hierarchy.
>
> Besides, the check of current eligiblity can get false negative result,
> if the enqueued entity has a positive lag. Prateek proposed to
> remove the check of current's eligibility in pick_eevdf():
> https://lore.kernel.org/lkml/[email protected]/

Thank you for letting me know about Peter's latest updates and thoughts.
Actually, the original intention of my modification was to minimize the
traversal of the rb-tree as much as possible. For example, in the following
scenario, if 'curr' is ineligible, the system would still traverse the rb-tree in
'pick_eevdf' to return an optimal 'se', and then trigger 'resched_curr'. After
resched, the scheduler will call 'pick_eevdf' again, traversing the
rb-tree once more. This ultimately results in the rb-tree being traversed
twice. If it's possible to determine that 'curr' is ineligible within 'wakeup_preempt'
and directly trigger a 'resched', it would reduce the traversal of the rb-tree
by one time.

wakeup_preempt-> pick_eevdf -> resched_curr
|->'traverse the rb-tree' |
schedule->pick_eevdf
|->'traverse the rb-tree'

Of course, this would break the semantics of RESPECT_SLICE as well as
RUN_TO_PARITY. So, this might be considered a performance enhancement
for scenarios without NO_RESPECT_SLICE/NO_RUN_TO_PARITY.

thanks
Chunxin

> If I understand your requirement correctly, you want to reduce the wakeup
> latency. There are some codes under developed by Peter, which could
> customized task's wakeup latency via setting its slice:
> https://lore.kernel.org/lkml/[email protected]/
>
> thanks,
> Chenyu

2024-05-28 05:02:56

by K Prateek Nayak

[permalink] [raw]

Subject: Re: [PATCH] sched/fair: Reschedule the cfs_rq when current is ineligible

Hello Chunxin,

On 5/28/2024 8:12 AM, Chunxin Zang wrote:
>
>> On May 24, 2024, at 23:30, Chen Yu <[email protected]> wrote:
>>
>> On 2024-05-24 at 21:40:11 +0800, Chunxin Zang wrote:
>>> I found that some tasks have been running for a long enough time and
>>> have become illegal, but they are still not releasing the CPU. This
>>> will increase the scheduling delay of other processes. Therefore, I
>>> tried checking the current process in wakeup_preempt and entity_tick,
>>> and if it is illegal, reschedule that cfs queue.
>>>
>>> The modification can reduce the scheduling delay by about 30% when
>>> RUN_TO_PARITY is enabled.
>>> So far, it has been running well in my test environment, and I have
>>> pasted some test results below.
>>>
>>
>> Interesting, besides hackbench, I assume that you have workload in
>> real production environment that is sensitive to wakeup latency?
>
> Hi Chen
>
> Yes, my workload are quite sensitive to wakeup latency .
>>
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 03be0d1330a6..a0005d240db5 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -5523,6 +5523,9 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
>>> hrtimer_active(&rq_of(cfs_rq)->hrtick_timer))
>>> return;
>>> #endif
>>> +
>>> + if (!entity_eligible(cfs_rq, curr))
>>> + resched_curr(rq_of(cfs_rq));
>>> }
>>>
>>
>> entity_tick() -> update_curr() -> update_deadline():
>> se->vruntime >= se->deadline ? resched_curr()
>> only current has expired its slice will it be scheduled out.
>>
>> So here you want to schedule current out if its lag becomes 0.
>>
>> In lastest sched/eevdf branch, it is controlled by two sched features:
>> RESPECT_SLICE: Inhibit preemption until the current task has exhausted it's slice.
>> RUN_TO_PARITY: Relax RESPECT_SLICE and only protect current until 0-lag.
>> https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=sched/eevdf&id=e04f5454d68590a239092a700e9bbaf84270397c
>>
>> Maybe something like this can achieve your goal
>> if (sched_feat(RUN_TOPARITY) && !entity_eligible(cfs_rq, curr))
>> resched_curr
>>
>>>
>>> @@ -8325,6 +8328,9 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>>> if (unlikely(p->policy != SCHED_NORMAL) || !sched_feat(WAKEUP_PREEMPTION))
>>> return;
>>>
>>> + if (!entity_eligible(cfs_rq, se))
>>> + goto preempt;
>>> +
>>
>> Not sure if this is applicable, later in this function, pick_eevdf() checks
>> if the current is eligible, !entity_eligible(cfs_rq, curr), if not, curr will
>> be evicted. And this change does not consider the cgroup hierarchy.

The above line will be referred to as [1] below.

>>
>> Besides, the check of current eligiblity can get false negative result,
>> if the enqueued entity has a positive lag. Prateek proposed to
>> remove the check of current's eligibility in pick_eevdf():
>> https://lore.kernel.org/lkml/[email protected]/
>
> Thank you for letting me know about Peter's latest updates and thoughts.
> Actually, the original intention of my modification was to minimize the
> traversal of the rb-tree as much as possible. For example, in the following
> scenario, if 'curr' is ineligible, the system would still traverse the rb-tree in
> 'pick_eevdf' to return an optimal 'se', and then trigger 'resched_curr'. After
> resched, the scheduler will call 'pick_eevdf' again, traversing the
> rb-tree once more. This ultimately results in the rb-tree being traversed
> twice. If it's possible to determine that 'curr' is ineligible within 'wakeup_preempt'
> and directly trigger a 'resched', it would reduce the traversal of the rb-tree
> by one time.
>
>
> wakeup_preempt-> pick_eevdf -> resched_curr
> |->'traverse the rb-tree' |
> schedule->pick_eevdf
> |->'traverse the rb-tree'

I see what you mean but a couple of things:

(I'm adding the check_preempt_wakeup_fair() hunk from the original patch
below for ease of interpretation)

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 03be0d1330a6..a0005d240db5 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8325,6 +8328,9 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
> if (unlikely(p->policy != SCHED_NORMAL) || !sched_feat(WAKEUP_PREEMPTION))
> return;
>
> + if (!entity_eligible(cfs_rq, se))
> + goto preempt;
> +

This check uses the root cfs_rq since "task_cfs_rq()" returns the
"rq->cfs" of the runqueue the task is on. In presence of cgroups or
CONFIG_SCHED_AUTOGROUP, there is a good chance this the task is queued
on a higher order cfs_rq and this entity_eligible() calculation might
not be valid since the vruntime calculation for the "se" is relative to
the "cfs_rq" where it is queued on. Please correct me if I'm wrong but
I believe that is what Chenyu was referring to in [1].

> find_matching_se(&se, &pse);
> WARN_ON_ONCE(!pse);
>
> --

In addition to that, There is an update_curr() call below for the first
cfs_rq where both the entities' hierarchy is queued which is found by
find_matching_se(). I believe that is required too to update the
vruntime and deadline of the entity where preemption can happen.

If you want to circumvent a second call to pick_eevdf(), could you
perhaps do:

(Only build tested)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9eb63573110c..653b1bee1e62 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8407,9 +8407,13 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
update_curr(cfs_rq);

/*
- * XXX pick_eevdf(cfs_rq) != se ?
+ * If the hierarchy of current task is ineligible at the common
+ * point on the newly woken entity, there is a good chance of
+ * wakeup preemption by the newly woken entity. Mark for resched
+ * and allow pick_eevdf() in schedule() to judge which task to
+ * run next.
*/
- if (pick_eevdf(cfs_rq) == pse)
+ if (!entity_eligible(cfs_rq, se))
goto preempt;

return;

--

There are other implications here which is specifically highlighted by
the "XXX pick_eevdf(cfs_rq) != se ?" comment. If the current waking
entity is not the entity with the earliest eligible virtual deadline,
the current task is still preempted if any other entity has the EEVD.

Mike's box gave switching to above two thumbs up; I have to check what
my box says :)

Following are DeathStarBench results with your original patch compared
to v6.9-rc5 based tip:sched/core:

==================================================================
Test : DeathStarBench
Why? : Some tasks here do no like aggressive preemption
Units : Normalized throughput
Interpretation: Higher is better
Statistic : Mean
==================================================================
Pinning scaling tip eager_preempt (pct imp)
1CCD 1 1.00 0.99 (%diff: -1.13%)
2CCD 2 1.00 0.97 (%diff: -3.21%)
4CCD 3 1.00 0.97 (%diff: -3.41%)
8CCD 6 1.00 0.97 (%diff: -3.20%)
--

I'll give the variants mentioned in the thread a try too to see if
some of my assumptions around heavy preemption hold good. I was also
able to dig up an old patch by Balakumaran Kannan which skipped
pick_eevdf() altogether if "pse" is ineligible which also seems like
a good optimization based on current check in
check_preempt_wakeup_fair() but it perhaps doesn't help the case of
wakeup-latency sensitivity you are optimizing for; only reduces
rb-tree traversal if there is no chance of pick_eevdf() returning "pse"
https://lore.kernel.org/lkml/[email protected]/

--
Thanks and Regards,
Prateek

>
>
> Of course, this would break the semantics of RESPECT_SLICE as well as
> RUN_TO_PARITY. So, this might be considered a performance enhancement
> for scenarios without NO_RESPECT_SLICE/NO_RUN_TO_PARITY.
>
> thanks
> Chunxin
>
>
>> If I understand your requirement correctly, you want to reduce the wakeup
>> latency. There are some codes under developed by Peter, which could
>> customized task's wakeup latency via setting its slice:
>> https://lore.kernel.org/lkml/[email protected]/
>>
>> thanks,
>> Chenyu

2024-05-28 06:42:02

by Chunxin Zang

[permalink] [raw]

Subject: Re: [PATCH] sched/fair: Reschedule the cfs_rq when current is ineligible

> On May 28, 2024, at 10:42, Chunxin Zang <[email protected]> wrote:
>
>>
>> On May 24, 2024, at 23:30, Chen Yu <[email protected]> wrote:
>>
>> On 2024-05-24 at 21:40:11 +0800, Chunxin Zang wrote:
>>> I found that some tasks have been running for a long enough time and
>>> have become illegal, but they are still not releasing the CPU. This
>>> will increase the scheduling delay of other processes. Therefore, I
>>> tried checking the current process in wakeup_preempt and entity_tick,
>>> and if it is illegal, reschedule that cfs queue.
>>>
>>> The modification can reduce the scheduling delay by about 30% when
>>> RUN_TO_PARITY is enabled.
>>> So far, it has been running well in my test environment, and I have
>>> pasted some test results below.
>>>
>>
>> Interesting, besides hackbench, I assume that you have workload in
>> real production environment that is sensitive to wakeup latency?
>
> Hi Chen
>
> Yes, my workload are quite sensitive to wakeup latency .
>>
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 03be0d1330a6..a0005d240db5 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -5523,6 +5523,9 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
>>> hrtimer_active(&rq_of(cfs_rq)->hrtick_timer))
>>> return;
>>> #endif
>>> +
>>> + if (!entity_eligible(cfs_rq, curr))
>>> + resched_curr(rq_of(cfs_rq));
>>> }
>>>
>>
>> entity_tick() -> update_curr() -> update_deadline():
>> se->vruntime >= se->deadline ? resched_curr()
>> only current has expired its slice will it be scheduled out.
>>
>> So here you want to schedule current out if its lag becomes 0.
>>
>> In lastest sched/eevdf branch, it is controlled by two sched features:
>> RESPECT_SLICE: Inhibit preemption until the current task has exhausted it's slice.
>> RUN_TO_PARITY: Relax RESPECT_SLICE and only protect current until 0-lag.
>> https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=sched/eevdf&id=e04f5454d68590a239092a700e9bbaf84270397c
>>
>> Maybe something like this can achieve your goal
>> if (sched_feat(RUN_TOPARITY) && !entity_eligible(cfs_rq, curr))
>> resched_curr
>>
>>>
>>> @@ -8325,6 +8328,9 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>>> if (unlikely(p->policy != SCHED_NORMAL) || !sched_feat(WAKEUP_PREEMPTION))
>>> return;
>>>
>>> + if (!entity_eligible(cfs_rq, se))
>>> + goto preempt;
>>> +
>>
>> Not sure if this is applicable, later in this function, pick_eevdf() checks
>> if the current is eligible, !entity_eligible(cfs_rq, curr), if not, curr will
>> be evicted. And this change does not consider the cgroup hierarchy.
>>
>> Besides, the check of current eligiblity can get false negative result,
>> if the enqueued entity has a positive lag. Prateek proposed to
>> remove the check of current's eligibility in pick_eevdf():
>> https://lore.kernel.org/lkml/[email protected]/
>
> Thank you for letting me know about Peter's latest updates and thoughts.
> Actually, the original intention of my modification was to minimize the
> traversal of the rb-tree as much as possible. For example, in the following
> scenario, if 'curr' is ineligible, the system would still traverse the rb-tree in
> 'pick_eevdf' to return an optimal 'se', and then trigger 'resched_curr'. After
> resched, the scheduler will call 'pick_eevdf' again, traversing the
> rb-tree once more. This ultimately results in the rb-tree being traversed
> twice. If it's possible to determine that 'curr' is ineligible within 'wakeup_preempt'
> and directly trigger a 'resched', it would reduce the traversal of the rb-tree
> by one time.
>
>
> wakeup_preempt-> pick_eevdf -> resched_curr
> |->'traverse the rb-tree' |
> schedule->pick_eevdf
> |->'traverse the rb-tree'
>
>
> Of course, this would break the semantics of RESPECT_SLICE as well as
> RUN_TO_PARITY. So, this might be considered a performance enhancement
> for scenarios without NO_RESPECT_SLICE/NO_RUN_TO_PARITY.
>
Sorry for the mistake. I mean it should be a performance enhancement for scenarios
with NO_RESPECT_SLICE/NO_RUN_TO_PARITY.

Maybe it should be like this

@@ -8325,6 +8328,9 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
if (unlikely(p->policy != SCHED_NORMAL) || !sched_feat(WAKEUP_PREEMPTION))
return;

+ if (!sched_feat(RESPECT_SLICE) && !sched_feat(RUN_TO_PARITY) && !entity_eligible(cfs_rq, se))
+ goto preempt;
+

> thanks
> Chunxin
>
>
>> If I understand your requirement correctly, you want to reduce the wakeup
>> latency. There are some codes under developed by Peter, which could
>> customized task's wakeup latency via setting its slice:
>> https://lore.kernel.org/lkml/[email protected]/
>>
>> thanks,
>> Chenyu

2024-05-28 07:19:30

by Chunxin Zang

[permalink] [raw]

Subject: Re: [PATCH] sched/fair: Reschedule the cfs_rq when current is ineligible

Hi Prateek

> On May 28, 2024, at 13:02, K Prateek Nayak <[email protected]> wrote:
>
> Hello Chunxin,
>
> On 5/28/2024 8:12 AM, Chunxin Zang wrote:
>>
>>> On May 24, 2024, at 23:30, Chen Yu <[email protected]> wrote:
>>>
>>> On 2024-05-24 at 21:40:11 +0800, Chunxin Zang wrote:
>>>> I found that some tasks have been running for a long enough time and
>>>> have become illegal, but they are still not releasing the CPU. This
>>>> will increase the scheduling delay of other processes. Therefore, I
>>>> tried checking the current process in wakeup_preempt and entity_tick,
>>>> and if it is illegal, reschedule that cfs queue.
>>>>
>>>> The modification can reduce the scheduling delay by about 30% when
>>>> RUN_TO_PARITY is enabled.
>>>> So far, it has been running well in my test environment, and I have
>>>> pasted some test results below.
>>>>
>>>
>>> Interesting, besides hackbench, I assume that you have workload in
>>> real production environment that is sensitive to wakeup latency?
>>
>> Hi Chen
>>
>> Yes, my workload are quite sensitive to wakeup latency .
>>>
>>>>
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index 03be0d1330a6..a0005d240db5 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -5523,6 +5523,9 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
>>>> hrtimer_active(&rq_of(cfs_rq)->hrtick_timer))
>>>> return;
>>>> #endif
>>>> +
>>>> + if (!entity_eligible(cfs_rq, curr))
>>>> + resched_curr(rq_of(cfs_rq));
>>>> }
>>>>
>>>
>>> entity_tick() -> update_curr() -> update_deadline():
>>> se->vruntime >= se->deadline ? resched_curr()
>>> only current has expired its slice will it be scheduled out.
>>>
>>> So here you want to schedule current out if its lag becomes 0.
>>>
>>> In lastest sched/eevdf branch, it is controlled by two sched features:
>>> RESPECT_SLICE: Inhibit preemption until the current task has exhausted it's slice.
>>> RUN_TO_PARITY: Relax RESPECT_SLICE and only protect current until 0-lag.
>>> https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=sched/eevdf&id=e04f5454d68590a239092a700e9bbaf84270397c
>>>
>>> Maybe something like this can achieve your goal
>>> if (sched_feat(RUN_TOPARITY) && !entity_eligible(cfs_rq, curr))
>>> resched_curr
>>>
>>>>
>>>> @@ -8325,6 +8328,9 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>>>> if (unlikely(p->policy != SCHED_NORMAL) || !sched_feat(WAKEUP_PREEMPTION))
>>>> return;
>>>>
>>>> + if (!entity_eligible(cfs_rq, se))
>>>> + goto preempt;
>>>> +
>>>
>>> Not sure if this is applicable, later in this function, pick_eevdf() checks
>>> if the current is eligible, !entity_eligible(cfs_rq, curr), if not, curr will
>>> be evicted. And this change does not consider the cgroup hierarchy.
>
> The above line will be referred to as [1] below.
>
>>>
>>> Besides, the check of current eligiblity can get false negative result,
>>> if the enqueued entity has a positive lag. Prateek proposed to
>>> remove the check of current's eligibility in pick_eevdf():
>>> https://lore.kernel.org/lkml/[email protected]/
>>
>> Thank you for letting me know about Peter's latest updates and thoughts.
>> Actually, the original intention of my modification was to minimize the
>> traversal of the rb-tree as much as possible. For example, in the following
>> scenario, if 'curr' is ineligible, the system would still traverse the rb-tree in
>> 'pick_eevdf' to return an optimal 'se', and then trigger 'resched_curr'. After
>> resched, the scheduler will call 'pick_eevdf' again, traversing the
>> rb-tree once more. This ultimately results in the rb-tree being traversed
>> twice. If it's possible to determine that 'curr' is ineligible within 'wakeup_preempt'
>> and directly trigger a 'resched', it would reduce the traversal of the rb-tree
>> by one time.
>>
>>
>> wakeup_preempt-> pick_eevdf -> resched_curr
>> |->'traverse the rb-tree' |
>> schedule->pick_eevdf
>> |->'traverse the rb-tree'
>
> I see what you mean but a couple of things:
>
> (I'm adding the check_preempt_wakeup_fair() hunk from the original patch
> below for ease of interpretation)
>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 03be0d1330a6..a0005d240db5 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -8325,6 +8328,9 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>> if (unlikely(p->policy != SCHED_NORMAL) || !sched_feat(WAKEUP_PREEMPTION))
>> return;
>>
>> + if (!entity_eligible(cfs_rq, se))
>> + goto preempt;
>> +
>
> This check uses the root cfs_rq since "task_cfs_rq()" returns the
> "rq->cfs" of the runqueue the task is on. In presence of cgroups or
> CONFIG_SCHED_AUTOGROUP, there is a good chance this the task is queued
> on a higher order cfs_rq and this entity_eligible() calculation might
> not be valid since the vruntime calculation for the "se" is relative to
> the "cfs_rq" where it is queued on. Please correct me if I'm wrong but
> I believe that is what Chenyu was referring to in [1].

Thank you for explaining so much to me; I am trying to understand all of this. :)

>
>> find_matching_se(&se, &pse);
>> WARN_ON_ONCE(!pse);
>>
>> --
>
> In addition to that, There is an update_curr() call below for the first
> cfs_rq where both the entities' hierarchy is queued which is found by
> find_matching_se(). I believe that is required too to update the
> vruntime and deadline of the entity where preemption can happen.
>
> If you want to circumvent a second call to pick_eevdf(), could you
> perhaps do:
>
> (Only build tested)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 9eb63573110c..653b1bee1e62 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8407,9 +8407,13 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
> update_curr(cfs_rq);
>
> /*
> - * XXX pick_eevdf(cfs_rq) != se ?
> + * If the hierarchy of current task is ineligible at the common
> + * point on the newly woken entity, there is a good chance of
> + * wakeup preemption by the newly woken entity. Mark for resched
> + * and allow pick_eevdf() in schedule() to judge which task to
> + * run next.
> */
> - if (pick_eevdf(cfs_rq) == pse)
> + if (!entity_eligible(cfs_rq, se))
> goto preempt;
>
> return;
>
> --
>
> There are other implications here which is specifically highlighted by
> the "XXX pick_eevdf(cfs_rq) != se ?" comment. If the current waking
> entity is not the entity with the earliest eligible virtual deadline,
> the current task is still preempted if any other entity has the EEVD.
>
> Mike's box gave switching to above two thumbs up; I have to check what
> my box says :)
>
> Following are DeathStarBench results with your original patch compared
> to v6.9-rc5 based tip:sched/core:
>
> ==================================================================
> Test : DeathStarBench
> Why? : Some tasks here do no like aggressive preemption
> Units : Normalized throughput
> Interpretation: Higher is better
> Statistic : Mean
> ==================================================================
> Pinning scaling tip eager_preempt (pct imp)
> 1CCD 1 1.00 0.99 (%diff: -1.13%)
> 2CCD 2 1.00 0.97 (%diff: -3.21%)
> 4CCD 3 1.00 0.97 (%diff: -3.41%)
> 8CCD 6 1.00 0.97 (%diff: -3.20%)
> --

Please forgive me as I have not used the DeathStarBench suite before. Does
this test result indicate that my modifications have resulted in tasks that do no
like aggressive preemption being even less likely to be preempted?

thanks
Chunxin

> I'll give the variants mentioned in the thread a try too to see if
> some of my assumptions around heavy preemption hold good. I was also
> able to dig up an old patch by Balakumaran Kannan which skipped
> pick_eevdf() altogether if "pse" is ineligible which also seems like
> a good optimization based on current check in
> check_preempt_wakeup_fair() but it perhaps doesn't help the case of
> wakeup-latency sensitivity you are optimizing for; only reduces
> rb-tree traversal if there is no chance of pick_eevdf() returning "pse"
> https://lore.kernel.org/lkml/[email protected]/
>
> --
> Thanks and Regards,
> Prateek
>
>>
>>
>> Of course, this would break the semantics of RESPECT_SLICE as well as
>> RUN_TO_PARITY. So, this might be considered a performance enhancement
>> for scenarios without NO_RESPECT_SLICE/NO_RUN_TO_PARITY.
>>
>> thanks
>> Chunxin
>>
>>
>>> If I understand your requirement correctly, you want to reduce the wakeup
>>> latency. There are some codes under developed by Peter, which could
>>> customized task's wakeup latency via setting its slice:
>>> https://lore.kernel.org/lkml/[email protected]/
>>>
>>> thanks,
>>> Chenyu

2024-05-28 07:47:50

by K Prateek Nayak

[permalink] [raw]

Subject: Re: [PATCH] sched/fair: Reschedule the cfs_rq when current is ineligible

Hello Chunxin,

On 5/28/2024 12:48 PM, Chunxin Zang wrote:
> [..snip..]
>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 03be0d1330a6..a0005d240db5 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -8325,6 +8328,9 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>>> if (unlikely(p->policy != SCHED_NORMAL) || !sched_feat(WAKEUP_PREEMPTION))
>>> return;
>>>
>>> + if (!entity_eligible(cfs_rq, se))
>>> + goto preempt;
>>> +
>>
>> This check uses the root cfs_rq since "task_cfs_rq()" returns the
>> "rq->cfs" of the runqueue the task is on. In presence of cgroups or
>> CONFIG_SCHED_AUTOGROUP, there is a good chance this the task is queued
>> on a higher order cfs_rq and this entity_eligible() calculation might
>> not be valid since the vruntime calculation for the "se" is relative to
>> the "cfs_rq" where it is queued on. Please correct me if I'm wrong but
>> I believe that is what Chenyu was referring to in [1].
>
>
> Thank you for explaining so much to me; I am trying to understand all of this. :)
>
>>
>>> find_matching_se(&se, &pse);
>>> WARN_ON_ONCE(!pse);
>>>
>>> --
>>
>> In addition to that, There is an update_curr() call below for the first
>> cfs_rq where both the entities' hierarchy is queued which is found by
>> find_matching_se(). I believe that is required too to update the
>> vruntime and deadline of the entity where preemption can happen.
>>
>> If you want to circumvent a second call to pick_eevdf(), could you
>> perhaps do:
>>
>> (Only build tested)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 9eb63573110c..653b1bee1e62 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -8407,9 +8407,13 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>> update_curr(cfs_rq);
>>
>> /*
>> - * XXX pick_eevdf(cfs_rq) != se ?
>> + * If the hierarchy of current task is ineligible at the common
>> + * point on the newly woken entity, there is a good chance of
>> + * wakeup preemption by the newly woken entity. Mark for resched
>> + * and allow pick_eevdf() in schedule() to judge which task to
>> + * run next.
>> */
>> - if (pick_eevdf(cfs_rq) == pse)
>> + if (!entity_eligible(cfs_rq, se))
>> goto preempt;
>>
>> return;
>>
>> --
>>
>> There are other implications here which is specifically highlighted by
>> the "XXX pick_eevdf(cfs_rq) != se ?" comment. If the current waking
>> entity is not the entity with the earliest eligible virtual deadline,
>> the current task is still preempted if any other entity has the EEVD.
>>
>> Mike's box gave switching to above two thumbs up; I have to check what
>> my box says :)
>>
>> Following are DeathStarBench results with your original patch compared
>> to v6.9-rc5 based tip:sched/core:
>>
>> ==================================================================
>> Test : DeathStarBench
>> Why? : Some tasks here do no like aggressive preemption
>> Units : Normalized throughput
>> Interpretation: Higher is better
>> Statistic : Mean
>> ==================================================================
>> Pinning scaling tip eager_preempt (pct imp)
>> 1CCD 1 1.00 0.99 (%diff: -1.13%)
>> 2CCD 2 1.00 0.97 (%diff: -3.21%)
>> 4CCD 3 1.00 0.97 (%diff: -3.41%)
>> 8CCD 6 1.00 0.97 (%diff: -3.20%)
>> --
>
> Please forgive me as I have not used the DeathStarBench suite before. Does
> this test result indicate that my modifications have resulted in tasks that do no
> like aggressive preemption being even less likely to be preempted?

It is actually the opposite. In case of DeathStarBench, the nginx server
tasks responsible for being the entrypoint into the microservice chain
do not like to be preempted. A regression generally indicates that these
tasks have very likely been preempted as a result of which the throughput
drops. More information for DeathStarBench and the problem is highlighted
in https://lore.kernel.org/lkml/[email protected]/

I'll test with more workloads later today and update the thread. Please
forgive for any delay, I'm slowly crawling through a backlog of
testing.

--
Thanks and Regards,
Prateek

>
> thanks
> Chunxin
>
>> I'll give the variants mentioned in the thread a try too to see if
>> some of my assumptions around heavy preemption hold good. I was also
>> able to dig up an old patch by Balakumaran Kannan which skipped
>> pick_eevdf() altogether if "pse" is ineligible which also seems like
>> a good optimization based on current check in
>> check_preempt_wakeup_fair() but it perhaps doesn't help the case of
>> wakeup-latency sensitivity you are optimizing for; only reduces
>> rb-tree traversal if there is no chance of pick_eevdf() returning "pse"
>> https://lore.kernel.org/lkml/[email protected]/
>>
>> [..snip..]
>>

2024-05-29 06:28:08

by kernel test robot

[permalink] [raw]

Subject: Re: [PATCH] sched/fair: Reschedule the cfs_rq when current is ineligible

Hello,

kernel test robot

commit: e2bbd1c498980c5cb6 url:
Details are as below:
--------------------------

The kernel config
noticed a -11.8% regression of netperf.Throughput_Mbps on:
8f9973f418ae09f353258d ("[PATCH] sched/fair: Reschedule the cfs_rq when current is ineligible")
hub.com/intel-lab-lkp/linux/commits/Chunxin-Zang/sched-fair-Reschedule-the-cfs_rq-when-current-is-ineligible/20240524-214314">https://github.com/intel-lab-lkp/linux/commits/Chunxin-Zang/sched-fair-Reschedule-the-cfs_rq-when-current-is-ineligible/20240524-214314
t.kernel.org/cgit/linux/kernel/git/tip/tip.git">https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git 97450eb909658573dcacc1063b06d3d08642c0c1
s://lore.kernel.org/all/20240524134011.270861-1-spring.cxz@gmail.com/">https://lore.kernel.org/all/[email protected]/
sched/fair: Reschedule the cfs_rq when current is ineligible
threads 2 sockets Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz (Ice Lake) with 256G memory
the commit also has significant impact on the following tests:
----------------------------------------------------------------------------------------------+
| stress-ng: stress-ng.fstat.ops_per_sec -3.9% regression |
| 64 threads 2 sockets Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz (Ice Lake) with 256G memory |
| cpufreq_governor=performance |
|
|
|
|
|
----------------------------------------------------------------------------------------------+
| aim7: aim7.jobs-per-min 9.6% improvement |
| 96 threads 2 sockets Intel(R) Xeon(R) Platinum 8260L CPU @ 2.40GHz (Cascade Lake) with 128G memory |
| cpufreq_governor=performance |
|
|
|
|
| test=sync_disk_rw |
----------------------------------------------------------------------------------------------+
| kbuild: kbuild.user_time_per_iteration 2.3% regression |
| 96 threads 2 sockets Intel(R) Xeon(R) Platinum 8260L CPU @ 2.40GHz (Cascade Lake) with 128G memory |
| build_kconfig=defconfig |
| cpufreq_governor=performance |
|
|
|
----------------------------------------------------------------------------------------------+
in a separate patch/commit (i.e. not just a new version of
kindly add following tags
test robot <[email protected]>
//lore.kernel.org/oe-lkp/202405291359.3f662525-oliver.sang@intel.com">https://lore.kernel.org/oe-lkp/[email protected]
------------------------------------------------------------------------>
and materials to reproduce are available at:
.01.org/0day-ci/archive/20240529/202405291359.3f662525-oliver.sang@intel.com">https://download.01.org/0day-ci/archive/20240529/[email protected]
===============================================================
overnor/ip/kconfig/nr_threads/rootfs/runtime/tbox_group/test/testcase:
ormance/ipv4/x86_64-rhel-8.3/200%/debian-12-x86_64-20240206.cgz/300s/lkp-icl-2sp2/UDP_STREAM/netperf
Remove shift of thermal clock")
Reschedule the cfs_rq when current is ineligible")
e2bbd1c498980c5cb68f9973f41
---------------------------
%stddev
\
0.56 ? 4% mpstat.cpu.all.irq%
-12.6% 1632289 ? 8% meminfo.Active
-12.6% 1632257 ? 8% meminfo.Active(anon)
-12.7% 1629647 ? 9% numa-meminfo.node1.Active
-12.7% 1629633 ? 9% numa-meminfo.node1.Active(anon)
160.67 ? 18% perf-c2c.DRAM.local
5858 perf-c2c.DRAM.remote
-5.6% 6656686 vmstat.system.cs
-9.7% 173722 vmstat.system.in
-10.7% 1.458e+09 numa-numastat.node0.local_node
-10.7% 1.458e+09 numa-numastat.node0.numa_hit
-11.4% 1.446e+09 numa-numastat.node1.local_node
-11.4% 1.447e+09 numa-numastat.node1.numa_hit
-10.7% 1.458e+09 numa-vmstat.node0.numa_hit
-10.7% 1.458e+09 numa-vmstat.node0.numa_local
-12.6% 407484 ? 8% numa-vmstat.node1.nr_active_anon
-12.6% 407484 ? 8% numa-vmstat.node1.nr_zone_active_anon
-11.4% 1.447e+09 numa-vmstat.node1.numa_hit
-11.4% 1.446e+09 numa-vmstat.node1.numa_local
-12.7% 407846 ? 9% proc-vmstat.nr_active_anon
+2.0% 32110 proc-vmstat.nr_kernel_stack
-12.7% 407846 ? 9% proc-vmstat.nr_zone_active_anon
-11.0% 2.905e+09 proc-vmstat.numa_hit
-11.0% 2.904e+09 proc-vmstat.numa_local
-11.0% 2.32e+10 proc-vmstat.pgalloc_normal
-11.0% 2.32e+10 proc-vmstat.pgfree
26584 netperf.ThroughputBoth_Mbps
-10.3% 6783505 netperf.ThroughputBoth_total_Mbps
7388 netperf.ThroughputRecv_Mbps
-5.4% 1885347 netperf.ThroughputRecv_total_Mbps
19196 netperf.Throughput_Mbps
-12.1% 4898158 netperf.Throughput_total_Mbps
-5.3% 1.025e+09 netperf.time.involuntary_context_switches
8116 netperf.time.percent_of_cpu_this_job_got
-3.5% 24000 netperf.time.system_time
+1.2% 799.09 netperf.time.user_time
-10.3% 3.883e+09 netperf.workload
4.81 ? 4% sched_debug.cfs_rq:/.h_nr_running.max
0.77 sched_debug.cfs_rq:/.h_nr_running.stddev
19.70 ? 6% sched_debug.cfs_rq:/.load_avg.avg
8.07 ? 17% sched_debug.cfs_rq:/.removed.load_avg.avg
35.75 ? 7% sched_debug.cfs_rq:/.removed.load_avg.stddev
3.68 ? 9% sched_debug.cfs_rq:/.removed.runnable_avg.avg
3.68 ? 9% sched_debug.cfs_rq:/.removed.util_avg.avg
-10.3% 102.79 ? 4% sched_debug.cfs_rq:/.util_avg.stddev
55.47 ? 11% sched_debug.cpu.clock.stddev
0.00 ? 9% sched_debug.cpu.next_balance.stddev
4.75 ? 3% sched_debug.cpu.nr_running.max
0.76 ? 2% sched_debug.cpu.nr_running.stddev
-9.9% 6466454 ? 4% sched_debug.cpu.nr_switches.min
0.92 ? 24% sched_debug.rt_rq:.rt_time.avg
117.32 ? 24% sched_debug.rt_rq:.rt_time.max
10.33 ? 24% sched_debug.rt_rq:.rt_time.stddev
4.63 perf-stat.i.MPKI
-9.9% 2.113e+10 perf-stat.i.branch-instructions
1.09 perf-stat.i.branch-miss-rate%
-6.1% 2.27e+08 perf-stat.i.branch-misses
9.96 perf-stat.i.cache-miss-rate%
+200.3% 5.312e+08 perf-stat.i.cache-misses
+20.9% 5.355e+09 perf-stat.i.cache-references
-5.9% 6699288 perf-stat.i.context-switches
2.56 perf-stat.i.cpi
-63.3% 644.95 perf-stat.i.cycles-between-cache-misses
-9.9% 1.145e+11 perf-stat.i.instructions
0.40 perf-stat.i.ipc
-6.0% 52.29 perf-stat.i.metric.K/sec
4.64 perf-stat.overall.MPKI
1.07 perf-stat.overall.branch-miss-rate%
9.91 perf-stat.overall.cache-miss-rate%
2.55 perf-stat.overall.cpi
-66.9% 548.75 perf-stat.overall.cycles-between-cache-misses
0.39 perf-stat.overall.ipc
-10.0% 2.103e+10 perf-stat.ps.branch-instructions
-6.2% 2.26e+08 perf-stat.ps.branch-misses
+199.8% 5.29e+08 perf-stat.ps.cache-misses
+20.8% 5.336e+09 perf-stat.ps.cache-references
-6.0% 6672410 perf-stat.ps.context-switches
-10.0% 1.14e+11 perf-stat.ps.instructions
-10.0% 3.492e+13 perf-stat.total.instructions
64.24 perf-profile.calltrace.cycles-pp.__sys_sendto.__x64_sys_sendto.do_syscall_64.entry_SYSCALL_64_after_hwframe.sendto
64.47 perf-profile.calltrace.cycles-pp.__x64_sys_sendto.do_syscall_64.entry_SYSCALL_64_after_hwframe.sendto.send_omni_inner
68.03 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.sendto.send_omni_inner.send_udp_stream
61.72 perf-profile.calltrace.cycles-pp.udp_sendmsg.__sys_sendto.__x64_sys_sendto.do_syscall_64.entry_SYSCALL_64_after_hwframe
69.94 perf-profile.calltrace.cycles-pp.send_omni_inner.send_udp_stream.main
69.97 perf-profile.calltrace.cycles-pp.send_udp_stream.main
68.64 perf-profile.calltrace.cycles-pp.sendto.send_omni_inner.send_udp_stream.main
67.87 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.sendto.send_omni_inner.send_udp_stream.main
44.69 perf-profile.calltrace.cycles-pp.ip_make_skb.udp_sendmsg.__sys_sendto.__x64_sys_sendto.do_syscall_64
14.03 perf-profile.calltrace.cycles-pp.udp_send_skb.udp_sendmsg.__sys_sendto.__x64_sys_sendto.do_syscall_64
13.62 perf-profile.calltrace.cycles-pp.ip_send_skb.udp_send_skb.udp_sendmsg.__sys_sendto.__x64_sys_sendto
12.70 perf-profile.calltrace.cycles-pp.ip_finish_output2.ip_send_skb.udp_send_skb.udp_sendmsg.__sys_sendto
12.09 perf-profile.calltrace.cycles-pp.__dev_queue_xmit.ip_finish_output2.ip_send_skb.udp_send_skb.udp_sendmsg
10.16 perf-profile.calltrace.cycles-pp.__local_bh_enable_ip.__dev_queue_xmit.ip_finish_output2.ip_send_skb.udp_send_skb
10.07 perf-profile.calltrace.cycles-pp.do_softirq.__local_bh_enable_ip.__dev_queue_xmit.ip_finish_output2.ip_send_skb
9.94 perf-profile.calltrace.cycles-pp.__do_softirq.do_softirq.__local_bh_enable_ip.__dev_queue_xmit.ip_finish_output2
34.20 perf-profile.calltrace.cycles-pp.ip_generic_getfrag.__ip_append_data.ip_make_skb.udp_sendmsg.__sys_sendto
4.19 perf-profile.calltrace.cycles-pp.__ip_make_skb.ip_make_skb.udp_sendmsg.__sys_sendto.__x64_sys_sendto
9.10 perf-profile.calltrace.cycles-pp.net_rx_action.__do_softirq.do_softirq.__local_bh_enable_ip.__dev_queue_xmit
3.73 perf-profile.calltrace.cycles-pp.__ip_select_ident.__ip_make_skb.ip_make_skb.udp_sendmsg.__sys_sendto
8.82 perf-profile.calltrace.cycles-pp.__napi_poll.net_rx_action.__do_softirq.do_softirq.__local_bh_enable_ip
8.74 perf-profile.calltrace.cycles-pp.process_backlog.__napi_poll.net_rx_action.__do_softirq.do_softirq
33.42 perf-profile.calltrace.cycles-pp._copy_from_iter.ip_generic_getfrag.__ip_append_data.ip_make_skb.udp_sendmsg
8.17 perf-profile.calltrace.cycles-pp.__netif_receive_skb_one_core.process_backlog.__napi_poll.net_rx_action.__do_softirq
6.84 ? 2% perf-profile.calltrace.cycles-pp.ip_local_deliver_finish.__netif_receive_skb_one_core.process_backlog.__napi_poll.net_rx_action
6.78 ? 2% perf-profile.calltrace.cycles-pp.ip_protocol_deliver_rcu.ip_local_deliver_finish.__netif_receive_skb_one_core.process_backlog.__napi_poll
6.54 ? 2% perf-profile.calltrace.cycles-pp.__udp4_lib_rcv.ip_protocol_deliver_rcu.ip_local_deliver_finish.__netif_receive_skb_one_core.process_backlog
5.56 ? 2% perf-profile.calltrace.cycles-pp.udp_unicast_rcv_skb.__udp4_lib_rcv.ip_protocol_deliver_rcu.ip_local_deliver_finish.__netif_receive_skb_one_core
5.47 ? 2% perf-profile.calltrace.cycles-pp.udp_queue_rcv_one_skb.udp_unicast_rcv_skb.__udp4_lib_rcv.ip_protocol_deliver_rcu.ip_local_deliver_finish
0.25 ?100% perf-profile.calltrace.cycles-pp.irqtime_account_irq.__do_softirq.do_softirq.__local_bh_enable_ip.__dev_queue_xmit
1.22 perf-profile.calltrace.cycles-pp.dev_hard_start_xmit.__dev_queue_xmit.ip_finish_output2.ip_send_skb.udp_send_skb
1.08 perf-profile.calltrace.cycles-pp.loopback_xmit.dev_hard_start_xmit.__dev_queue_xmit.ip_finish_output2.ip_send_skb
1.31 perf-profile.calltrace.cycles-pp.kfree_skb_reason.udp_queue_rcv_one_skb.udp_unicast_rcv_skb.__udp4_lib_rcv.ip_protocol_deliver_rcu
1.27 perf-profile.calltrace.cycles-pp.skb_release_data.kfree_skb_reason.udp_queue_rcv_one_skb.udp_unicast_rcv_skb.__udp4_lib_rcv
1.77 perf-profile.calltrace.cycles-pp.ip_route_output_flow.udp_sendmsg.__sys_sendto.__x64_sys_sendto.do_syscall_64
1.38 perf-profile.calltrace.cycles-pp.ttwu_do_activate.try_to_wake_up.autoremove_wake_function.__wake_up_common.__wake_up_sync_key
1.26 perf-profile.calltrace.cycles-pp.ip_route_output_key_hash_rcu.ip_route_output_flow.udp_sendmsg.__sys_sendto.__x64_sys_sendto
1.12 perf-profile.calltrace.cycles-pp.fib_table_lookup.ip_route_output_key_hash_rcu.ip_route_output_flow.udp_sendmsg.__sys_sendto
1.63 perf-profile.calltrace.cycles-pp.sock_alloc_send_pskb.__ip_append_data.ip_make_skb.udp_sendmsg.__sys_sendto
0.62 perf-profile.calltrace.cycles-pp.__udp4_lib_lookup.__udp4_lib_rcv.ip_protocol_deliver_rcu.ip_local_deliver_finish.__netif_receive_skb_one_core
1.26 perf-profile.calltrace.cycles-pp.alloc_skb_with_frags.sock_alloc_send_pskb.__ip_append_data.ip_make_skb.udp_sendmsg
0.59 perf-profile.calltrace.cycles-pp.__check_object_size.ip_generic_getfrag.__ip_append_data.ip_make_skb.udp_sendmsg
0.71 perf-profile.calltrace.cycles-pp.free_unref_page.skb_release_data.kfree_skb_reason.udp_queue_rcv_one_skb.udp_unicast_rcv_skb
1.19 perf-profile.calltrace.cycles-pp.__alloc_skb.alloc_skb_with_frags.sock_alloc_send_pskb.__ip_append_data.ip_make_skb
0.57 perf-profile.calltrace.cycles-pp.move_addr_to_kernel.__sys_sendto.__x64_sys_sendto.do_syscall_64.entry_SYSCALL_64_after_hwframe
1.11 perf-profile.calltrace.cycles-pp.recvfrom
0.52 perf-profile.calltrace.cycles-pp.sockfd_lookup_light.__sys_sendto.__x64_sys_sendto.do_syscall_64.entry_SYSCALL_64_after_hwframe
3.13 perf-profile.calltrace.cycles-pp.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe.sendto.send_omni_inner
1.07 perf-profile.calltrace.cycles-pp.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe.recvfrom.recv_omni
2.41 perf-profile.calltrace.cycles-pp.skb_page_frag_refill.sk_page_frag_refill.__ip_append_data.ip_make_skb.udp_sendmsg
2.46 perf-profile.calltrace.cycles-pp.sk_page_frag_refill.__ip_append_data.ip_make_skb.udp_sendmsg.__sys_sendto
1.95 perf-profile.calltrace.cycles-pp.alloc_pages_mpol.skb_page_frag_refill.sk_page_frag_refill.__ip_append_data.ip_make_skb
0.89 perf-profile.calltrace.cycles-pp.rmqueue.get_page_from_freelist.__alloc_pages.alloc_pages_mpol.skb_page_frag_refill
1.81 perf-profile.calltrace.cycles-pp.__alloc_pages.alloc_pages_mpol.skb_page_frag_refill.sk_page_frag_refill.__ip_append_data
1.52 perf-profile.calltrace.cycles-pp.get_page_from_freelist.__alloc_pages.alloc_pages_mpol.skb_page_frag_refill.sk_page_frag_refill
0.53 ? 4% perf-profile.calltrace.cycles-pp.dequeue_entity.dequeue_task_fair.__schedule.schedule.schedule_timeout
0.55 perf-profile.calltrace.cycles-pp.free_unref_page_commit.free_unref_page.skb_release_data.__consume_stateless_skb.udp_recvmsg
1.40 perf-profile.calltrace.cycles-pp.skb_release_data.__consume_stateless_skb.udp_recvmsg.inet_recvmsg.sock_recvmsg
1.40 perf-profile.calltrace.cycles-pp.__consume_stateless_skb.udp_recvmsg.inet_recvmsg.sock_recvmsg.__sys_recvfrom
1.01 perf-profile.calltrace.cycles-pp.free_unref_page.skb_release_data.__consume_stateless_skb.udp_recvmsg.inet_recvmsg
16.40 perf-profile.calltrace.cycles-pp._copy_to_iter.__skb_datagram_iter.skb_copy_datagram_iter.udp_recvmsg.inet_recvmsg
17.18 perf-profile.calltrace.cycles-pp.skb_copy_datagram_iter.udp_recvmsg.inet_recvmsg.sock_recvmsg.__sys_recvfrom
17.16 perf-profile.calltrace.cycles-pp.__skb_datagram_iter.skb_copy_datagram_iter.udp_recvmsg.inet_recvmsg.sock_recvmsg
24.22 perf-profile.calltrace.cycles-pp.inet_recvmsg.sock_recvmsg.__sys_recvfrom.__x64_sys_recvfrom.do_syscall_64
24.42 perf-profile.calltrace.cycles-pp.sock_recvmsg.__sys_recvfrom.__x64_sys_recvfrom.do_syscall_64.entry_SYSCALL_64_after_hwframe
24.14 perf-profile.calltrace.cycles-pp.udp_recvmsg.inet_recvmsg.sock_recvmsg.__sys_recvfrom.__x64_sys_recvfrom
25.36 perf-profile.calltrace.cycles-pp.__sys_recvfrom.__x64_sys_recvfrom.do_syscall_64.entry_SYSCALL_64_after_hwframe.recvfrom
24.81 perf-profile.calltrace.cycles-pp.__x64_sys_recvfrom.do_syscall_64.entry_SYSCALL_64_after_hwframe.recvfrom.recv_omni
26.02 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.recvfrom.recv_omni.process_requests.spawn_child
25.99 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.recvfrom.recv_omni.process_requests
26.33 perf-profile.calltrace.cycles-pp.recvfrom.recv_omni.process_requests.spawn_child.accept_connection
26.87 perf-profile.calltrace.cycles-pp.recv_omni.process_requests.spawn_child.accept_connection.accept_connections
26.88 perf-profile.calltrace.cycles-pp.accept_connection.accept_connections.main
26.88 perf-profile.calltrace.cycles-pp.accept_connections.main
26.88 perf-profile.calltrace.cycles-pp.process_requests.spawn_child.accept_connection.accept_connections.main
26.88 perf-profile.calltrace.cycles-pp.spawn_child.accept_connection.accept_connections.main
70.38 perf-profile.children.cycles-pp.send_udp_stream
70.64 perf-profile.children.cycles-pp.sendto
70.53 perf-profile.children.cycles-pp.send_omni_inner
64.66 perf-profile.children.cycles-pp.__sys_sendto
64.88 perf-profile.children.cycles-pp.__x64_sys_sendto
62.13 perf-profile.children.cycles-pp.udp_sendmsg
44.97 perf-profile.children.cycles-pp.ip_make_skb
14.17 perf-profile.children.cycles-pp.udp_send_skb
13.76 perf-profile.children.cycles-pp.ip_send_skb
12.82 perf-profile.children.cycles-pp.ip_finish_output2
12.23 perf-profile.children.cycles-pp.__dev_queue_xmit
10.29 perf-profile.children.cycles-pp.__local_bh_enable_ip
10.18 perf-profile.children.cycles-pp.do_softirq
10.08 perf-profile.children.cycles-pp.__do_softirq
34.47 perf-profile.children.cycles-pp.ip_generic_getfrag
4.25 perf-profile.children.cycles-pp.__ip_make_skb
9.19 perf-profile.children.cycles-pp.net_rx_action
3.77 perf-profile.children.cycles-pp.__ip_select_ident
8.90 perf-profile.children.cycles-pp.__napi_poll
8.84 perf-profile.children.cycles-pp.process_backlog
33.68 perf-profile.children.cycles-pp._copy_from_iter
40.38 perf-profile.children.cycles-pp.__ip_append_data
8.26 perf-profile.children.cycles-pp.__netif_receive_skb_one_core
6.91 ? 2% perf-profile.children.cycles-pp.ip_local_deliver_finish
6.86 ? 2% perf-profile.children.cycles-pp.ip_protocol_deliver_rcu
6.62 ? 2% perf-profile.children.cycles-pp.__udp4_lib_rcv
5.61 ? 2% perf-profile.children.cycles-pp.udp_unicast_rcv_skb
5.55 ? 2% perf-profile.children.cycles-pp.udp_queue_rcv_one_skb
1.25 perf-profile.children.cycles-pp.dev_hard_start_xmit
1.13 perf-profile.children.cycles-pp.loopback_xmit
1.33 perf-profile.children.cycles-pp.kfree_skb_reason
1.80 perf-profile.children.cycles-pp.ip_route_output_flow
1.40 perf-profile.children.cycles-pp.ttwu_do_activate
0.15 perf-profile.children.cycles-pp.wakeup_preempt
1.28 perf-profile.children.cycles-pp.ip_route_output_key_hash_rcu
0.06 ? 6% perf-profile.children.cycles-pp.check_preempt_wakeup_fair
1.14 perf-profile.children.cycles-pp.fib_table_lookup
0.64 perf-profile.children.cycles-pp.__udp4_lib_lookup
1.66 perf-profile.children.cycles-pp.sock_alloc_send_pskb
1.29 perf-profile.children.cycles-pp.alloc_skb_with_frags
1.22 perf-profile.children.cycles-pp.__alloc_skb
0.29 perf-profile.children.cycles-pp.sock_wfree
0.46 perf-profile.children.cycles-pp.__netif_rx
1.14 perf-profile.children.cycles-pp.__check_object_size
0.43 perf-profile.children.cycles-pp.netif_rx_internal
0.40 perf-profile.children.cycles-pp.enqueue_to_backlog
0.44 ? 2% perf-profile.children.cycles-pp.udp4_lib_lookup2
0.26 perf-profile.children.cycles-pp.pick_eevdf
0.59 perf-profile.children.cycles-pp.move_addr_to_kernel
0.53 perf-profile.children.cycles-pp.irqtime_account_irq
0.78 perf-profile.children.cycles-pp.kmem_cache_alloc_node
0.50 perf-profile.children.cycles-pp.sched_clock_cpu
0.30 perf-profile.children.cycles-pp.validate_xmit_skb
0.44 perf-profile.children.cycles-pp.sched_clock
0.36 perf-profile.children.cycles-pp._raw_spin_trylock
0.35 perf-profile.children.cycles-pp.reweight_entity
0.42 perf-profile.children.cycles-pp._copy_from_user
0.90 perf-profile.children.cycles-pp.update_load_avg
1.07 perf-profile.children.cycles-pp.switch_mm_irqs_off
0.32 perf-profile.children.cycles-pp._raw_spin_lock_irq
0.36 perf-profile.children.cycles-pp.kmalloc_reserve
0.39 perf-profile.children.cycles-pp.native_sched_clock
0.61 perf-profile.children.cycles-pp.kmem_cache_free
1.23 ? 2% perf-profile.children.cycles-pp.activate_task
1.20 ? 2% perf-profile.children.cycles-pp.enqueue_task_fair
0.45 perf-profile.children.cycles-pp.__virt_addr_valid
0.75 perf-profile.children.cycles-pp.check_heap_object
0.14 ? 2% perf-profile.children.cycles-pp.destroy_large_folio
0.35 perf-profile.children.cycles-pp.entry_SYSCALL_64
0.30 perf-profile.children.cycles-pp.__cond_resched
0.09 ? 4% perf-profile.children.cycles-pp.__mem_cgroup_uncharge
0.33 perf-profile.children.cycles-pp.__mkroute_output
0.27 perf-profile.children.cycles-pp.ip_output
0.53 perf-profile.children.cycles-pp.syscall_return_via_sysret
0.21 ? 2% perf-profile.children.cycles-pp.ip_setup_cork
0.16 ? 3% perf-profile.children.cycles-pp.ipv4_pktinfo_prepare
0.24 perf-profile.children.cycles-pp.entry_SYSRETQ_unsafe_stack
0.16 perf-profile.children.cycles-pp.netif_skb_features
0.25 ? 2% perf-profile.children.cycles-pp.get_pfnblock_flags_mask
0.49 perf-profile.children.cycles-pp.__netif_receive_skb_core
0.24 ? 2% perf-profile.children.cycles-pp.__ip_local_out
0.15 ? 2% perf-profile.children.cycles-pp.dst_release
0.13 perf-profile.children.cycles-pp.update_curr_se
0.10 perf-profile.children.cycles-pp.rcu_all_qs
0.08 ? 6% perf-profile.children.cycles-pp.vruntime_eligible
0.11 ? 6% perf-profile.children.cycles-pp.security_sock_rcv_skb
0.27 perf-profile.children.cycles-pp.__update_load_avg_cfs_rq
0.13 ? 2% perf-profile.children.cycles-pp.ip_send_check
0.18 ? 2% perf-profile.children.cycles-pp.siphash_3u32
0.19 perf-profile.children.cycles-pp.sk_filter_trim_cap
0.13 ? 3% perf-profile.children.cycles-pp.__folio_put
0.18 perf-profile.children.cycles-pp.udp4_csum_init
0.19 ? 2% perf-profile.children.cycles-pp.ipv4_mtu
0.23 perf-profile.children.cycles-pp.skb_set_owner_w
0.24 perf-profile.children.cycles-pp.rseq_update_cpu_node_id
0.13 ? 3% perf-profile.children.cycles-pp.__ip_finish_output
0.29 perf-profile.children.cycles-pp.__update_load_avg_se
0.13 ? 3% perf-profile.children.cycles-pp.avg_vruntime
0.07 ? 7% perf-profile.children.cycles-pp.skb_network_protocol
0.13 ? 2% perf-profile.children.cycles-pp.check_stack_object
0.12 perf-profile.children.cycles-pp.xfrm_lookup_route
0.13 perf-profile.children.cycles-pp.__put_user_8
0.08 perf-profile.children.cycles-pp.nf_hook_slow
0.08 perf-profile.children.cycles-pp.raw_v4_input
0.05 perf-profile.children.cycles-pp.validate_xmit_xfrm
0.10 perf-profile.children.cycles-pp.xfrm_lookup_with_ifid
0.12 perf-profile.children.cycles-pp.security_socket_recvmsg
0.08 ? 6% perf-profile.children.cycles-pp.demo_interval_tick
0.10 ? 4% perf-profile.children.cycles-pp.__build_skb_around
0.09 perf-profile.children.cycles-pp.should_failslab
0.08 ? 4% perf-profile.children.cycles-pp.skb_clone_tx_timestamp
0.39 perf-profile.children.cycles-pp.simple_copy_to_iter
0.09 ? 4% perf-profile.children.cycles-pp.task_work_run
0.09 perf-profile.children.cycles-pp.task_mm_cid_work
0.24 ? 5% perf-profile.children.cycles-pp.recv_data
0.73 ? 2% perf-profile.children.cycles-pp.ip_rcv
0.30 ? 2% perf-profile.children.cycles-pp.ip_rcv_core
0.11 ? 3% perf-profile.children.cycles-pp.__free_one_page
0.91 perf-profile.children.cycles-pp.switch_fpu_return
0.78 ? 2% perf-profile.children.cycles-pp.restore_fpregs_from_fpstate
0.36 ? 2% perf-profile.children.cycles-pp.update_process_times
0.40 ? 2% perf-profile.children.cycles-pp.tick_nohz_handler
4.88 perf-profile.children.cycles-pp.syscall_exit_to_user_mode
0.44 ? 2% perf-profile.children.cycles-pp.__hrtimer_run_queues
0.56 ? 5% perf-profile.children.cycles-pp.hrtimer_interrupt
0.59 ? 5% perf-profile.children.cycles-pp.__sysvec_apic_timer_interrupt
0.64 ? 5% perf-profile.children.cycles-pp.sysvec_apic_timer_interrupt
0.75 ? 4% perf-profile.children.cycles-pp.asm_sysvec_apic_timer_interrupt
2.50 perf-profile.children.cycles-pp.sk_page_frag_refill
2.69 perf-profile.children.cycles-pp.skb_release_data
2.44 perf-profile.children.cycles-pp.skb_page_frag_refill
1.99 perf-profile.children.cycles-pp.alloc_pages_mpol
0.92 ? 2% perf-profile.children.cycles-pp.rmqueue
1.85 perf-profile.children.cycles-pp.__alloc_pages
1.56 perf-profile.children.cycles-pp.get_page_from_freelist
0.41 ? 3% perf-profile.children.cycles-pp.rmqueue_bulk
0.52 ? 2% perf-profile.children.cycles-pp.__rmqueue_pcplist
1.75 perf-profile.children.cycles-pp.free_unref_page
0.47 ? 2% perf-profile.children.cycles-pp.free_pcppages_bulk
0.65 perf-profile.children.cycles-pp.free_unref_page_commit
1.41 perf-profile.children.cycles-pp.__consume_stateless_skb
1.10 perf-profile.children.cycles-pp._raw_spin_lock_irqsave
0.66 ? 3% perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
16.43 perf-profile.children.cycles-pp._copy_to_iter
17.17 perf-profile.children.cycles-pp.__skb_datagram_iter
17.19 perf-profile.children.cycles-pp.skb_copy_datagram_iter
24.24 perf-profile.children.cycles-pp.inet_recvmsg
24.43 perf-profile.children.cycles-pp.sock_recvmsg
24.16 perf-profile.children.cycles-pp.udp_recvmsg
25.43 perf-profile.children.cycles-pp.__x64_sys_recvfrom
25.37 perf-profile.children.cycles-pp.__sys_recvfrom
27.55 perf-profile.children.cycles-pp.recvfrom
26.88 perf-profile.children.cycles-pp.accept_connection
26.88 perf-profile.children.cycles-pp.accept_connections
26.88 perf-profile.children.cycles-pp.process_requests
26.88 perf-profile.children.cycles-pp.spawn_child
26.88 perf-profile.children.cycles-pp.recv_omni
33.41 perf-profile.self.cycles-pp._copy_from_iter
3.56 ? 2% perf-profile.self.cycles-pp.__ip_select_ident
0.77 perf-profile.self.cycles-pp.__sys_sendto
1.04 perf-profile.self.cycles-pp.udp_sendmsg
0.84 ? 2% perf-profile.self.cycles-pp.fib_table_lookup
0.28 perf-profile.self.cycles-pp.sock_wfree
1.72 perf-profile.self.cycles-pp.__ip_append_data
0.26 perf-profile.self.cycles-pp.loopback_xmit
0.58 ? 2% perf-profile.self.cycles-pp.kmem_cache_alloc_node
0.26 perf-profile.self.cycles-pp.udp4_lib_lookup2
0.41 perf-profile.self.cycles-pp.do_syscall_64
0.54 perf-profile.self.cycles-pp.ip_finish_output2
0.34 perf-profile.self.cycles-pp._raw_spin_trylock
0.41 perf-profile.self.cycles-pp._copy_from_user
0.33 perf-profile.self.cycles-pp.udp_send_skb
0.30 ? 2% perf-profile.self.cycles-pp.__alloc_skb
0.36 perf-profile.self.cycles-pp.__dev_queue_xmit
0.31 perf-profile.self.cycles-pp._raw_spin_lock_irq
0.38 perf-profile.self.cycles-pp.native_sched_clock
1.06 perf-profile.self.cycles-pp.switch_mm_irqs_off
0.43 perf-profile.self.cycles-pp.__virt_addr_valid
0.29 perf-profile.self.cycles-pp.__mkroute_output
0.20 ? 2% perf-profile.self.cycles-pp.pick_eevdf
0.20 ? 2% perf-profile.self.cycles-pp.__udp4_lib_lookup
0.43 perf-profile.self.cycles-pp._raw_spin_lock_irqsave
0.18 ? 2% perf-profile.self.cycles-pp.ip_route_output_flow
0.03 ? 70% perf-profile.self.cycles-pp.check_preempt_wakeup_fair
0.23 ? 2% perf-profile.self.cycles-pp.entry_SYSRETQ_unsafe_stack
0.26 perf-profile.self.cycles-pp.net_rx_action
0.09 ? 4% perf-profile.self.cycles-pp.__mem_cgroup_uncharge
0.26 perf-profile.self.cycles-pp.__check_object_size
0.48 perf-profile.self.cycles-pp.__netif_receive_skb_core
0.26 ? 2% perf-profile.self.cycles-pp.process_backlog
0.19 perf-profile.self.cycles-pp.ip_output
0.33 perf-profile.self.cycles-pp.__ip_make_skb
0.36 perf-profile.self.cycles-pp.entry_SYSCALL_64_after_hwframe
0.24 ? 2% perf-profile.self.cycles-pp.get_pfnblock_flags_mask
0.45 perf-profile.self.cycles-pp.kmem_cache_free
0.15 ? 3% perf-profile.self.cycles-pp.reweight_entity
0.20 perf-profile.self.cycles-pp.__udp4_lib_rcv
0.12 perf-profile.self.cycles-pp.validate_xmit_skb
0.13 perf-profile.self.cycles-pp.dst_release
0.34 perf-profile.self.cycles-pp.__do_softirq
0.08 perf-profile.self.cycles-pp.rcu_all_qs
0.06 ? 6% perf-profile.self.cycles-pp.vruntime_eligible
0.23 ? 4% perf-profile.self.cycles-pp.__udp_enqueue_schedule_skb
0.14 ? 3% perf-profile.self.cycles-pp.enqueue_to_backlog
0.17 perf-profile.self.cycles-pp.siphash_3u32
0.24 perf-profile.self.cycles-pp.rseq_update_cpu_node_id
0.12 ? 3% perf-profile.self.cycles-pp.ip_send_check
0.12 ? 3% perf-profile.self.cycles-pp.ip_setup_cork
0.26 perf-profile.self.cycles-pp.__update_load_avg_se
0.14 ? 3% perf-profile.self.cycles-pp.ip_route_output_key_hash_rcu
0.17 ? 2% perf-profile.self.cycles-pp.irqtime_account_irq
0.17 ? 2% perf-profile.self.cycles-pp.udp4_csum_init
0.16 ? 3% perf-profile.self.cycles-pp.ip_generic_getfrag
0.21 ? 3% perf-profile.self.cycles-pp.ip_send_skb
0.12 ? 4% perf-profile.self.cycles-pp.update_curr_se
0.10 ? 4% perf-profile.self.cycles-pp.__netif_receive_skb_one_core
0.06 ? 7% perf-profile.self.cycles-pp.__sk_mem_raise_allocated
0.09 ? 5% perf-profile.self.cycles-pp.security_sock_rcv_skb
0.05 perf-profile.self.cycles-pp.ip_local_deliver_finish
0.19 ? 2% perf-profile.self.cycles-pp.__cond_resched
0.18 ? 2% perf-profile.self.cycles-pp.ipv4_mtu
0.28 ? 2% perf-profile.self.cycles-pp.__alloc_pages
0.13 ? 2% perf-profile.self.cycles-pp.do_softirq
0.06 perf-profile.self.cycles-pp.skb_network_protocol
0.12 perf-profile.self.cycles-pp.__wrgsbase_inactive
0.12 perf-profile.self.cycles-pp.entry_SYSCALL_64
0.13 perf-profile.self.cycles-pp.ip_make_skb
0.12 perf-profile.self.cycles-pp.move_addr_to_kernel
0.05 perf-profile.self.cycles-pp.__ip_finish_output
0.15 ? 2% perf-profile.self.cycles-pp.enqueue_task_fair
0.14 ? 5% perf-profile.self.cycles-pp.recvfrom
0.09 perf-profile.self.cycles-pp.__build_skb_around
0.30 perf-profile.self.cycles-pp.udp_recvmsg
0.08 perf-profile.self.cycles-pp.should_failslab
0.07 ? 8% perf-profile.self.cycles-pp.demo_interval_tick
0.22 ? 2% perf-profile.self.cycles-pp.__skb_datagram_iter
0.28 perf-profile.self.cycles-pp.prepare_task_switch
0.08 ? 5% perf-profile.self.cycles-pp.task_mm_cid_work
0.24 ? 4% perf-profile.self.cycles-pp.recv_omni
0.05 perf-profile.self.cycles-pp.update_rq_clock
0.30 ? 2% perf-profile.self.cycles-pp.ip_rcv_core
0.06 ? 6% perf-profile.self.cycles-pp.skb_clone_tx_timestamp
0.10 perf-profile.self.cycles-pp.__free_one_page
0.08 ? 6% perf-profile.self.cycles-pp.rmqueue_bulk
0.78 perf-profile.self.cycles-pp.restore_fpregs_from_fpstate
0.18 ? 3% perf-profile.self.cycles-pp.select_task_rq_fair
0.66 ? 3% perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath
16.35 perf-profile.self.cycles-pp._copy_to_iter
*************************************************************************
threads 2 sockets Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz (Ice Lake) with 256G memory
===============================================================
disk/fs/kconfig/nr_threads/rootfs/tbox_group/test/testcase/testtime:
ext4/x86_64-rhel-8.3/100%/debian-12-x86_64-20240206.cgz/lkp-icl-2sp8/fstat/stress-ng/60s
Remove shift of thermal clock")
Reschedule the cfs_rq when current is ineligible")
e2bbd1c498980c5cb68f9973f41
---------------------------
%stddev
\
7.30 ? 3% iostat.cpu.user
7.50 ? 3% mpstat.cpu.all.usr%
+22.2% 890173 vmstat.system.cs
-81.4% 268871 ? 17% meminfo.Active
-81.8% 261809 ? 18% meminfo.Active(anon)
+13.2% 7738594 meminfo.Inactive
+13.3% 7726478 meminfo.Inactive(anon)
8342 ? 61% numa-meminfo.node0.Active
6931 ? 78% numa-meminfo.node0.Active(anon)
-79.9% 261111 ? 16% numa-meminfo.node1.Active
-80.2% 255458 ? 16% numa-meminfo.node1.Active(anon)
1725 ? 77% numa-vmstat.node0.nr_active_anon
1725 ? 77% numa-vmstat.node0.nr_zone_active_anon
63052 ? 16% numa-vmstat.node1.nr_active_anon
63052 ? 16% numa-vmstat.node1.nr_zone_active_anon
-3.9% 4786411 stress-ng.fstat.ops
-3.9% 79773 stress-ng.fstat.ops_per_sec
+88.8% 23722089 stress-ng.time.involuntary_context_switches
4855 stress-ng.time.percent_of_cpu_this_job_got
2678 stress-ng.time.system_time
+30.7% 239.27 ? 2% stress-ng.time.user_time
-1.3% 7637067 stress-ng.time.voluntary_context_switches
+10.6% 1118668 sched_debug.cfs_rq:/.avg_vruntime.min
78376 ? 3% sched_debug.cfs_rq:/.avg_vruntime.stddev
+10.6% 1118668 sched_debug.cfs_rq:/.min_vruntime.min
78376 ? 3% sched_debug.cfs_rq:/.min_vruntime.stddev
-16.6% 649990 ? 14% sched_debug.cpu.curr->pid.avg
-27.2% 1023836 sched_debug.cpu.curr->pid.max
-30.9% 480606 ? 6% sched_debug.cpu.curr->pid.stddev
+22.0% 435017 sched_debug.cpu.nr_switches.avg
+21.5% 455819 sched_debug.cpu.nr_switches.max
+21.3% 290992 sched_debug.cpu.nr_switches.min
21908 ? 3% sched_debug.cpu.nr_switches.stddev
65651 ? 18% proc-vmstat.nr_active_anon
-2.5% 2589142 proc-vmstat.nr_file_pages
+13.3% 1933352 proc-vmstat.nr_inactive_anon
+1.3% 20285 proc-vmstat.nr_kernel_stack
-3.6% 1795497 proc-vmstat.nr_shmem
65651 ? 18% proc-vmstat.nr_zone_active_anon
+13.3% 1933352 proc-vmstat.nr_zone_inactive_anon
-6.2% 53230553 proc-vmstat.numa_hit
-6.2% 53170292 proc-vmstat.numa_local
28742 ? 4% proc-vmstat.pgactivate
-7.0% 70719008 proc-vmstat.pgalloc_normal
-7.1% 67838944 proc-vmstat.pgfree
1.54 perf-stat.i.MPKI
+17.1% 2.586e+10 perf-stat.i.branch-instructions
0.27 perf-stat.i.branch-miss-rate%
+4.7% 63912456 perf-stat.i.branch-misses
25.30 perf-stat.i.cache-miss-rate%
-2.8% 2.149e+08 perf-stat.i.cache-misses
-1.7% 8.499e+08 perf-stat.i.cache-references
+21.4% 928892 perf-stat.i.context-switches
1.59 perf-stat.i.cpi
-15.6% 142157 perf-stat.i.cpu-migrations
1034 perf-stat.i.cycles-between-cache-misses
+18.5% 1.394e+11 perf-stat.i.instructions
0.63 perf-stat.i.ipc
16.78 perf-stat.i.metric.K/sec
1.54 perf-stat.overall.MPKI
0.25 perf-stat.overall.branch-miss-rate%
25.30 perf-stat.overall.cache-miss-rate%
1.59 perf-stat.overall.cpi
1033 perf-stat.overall.cycles-between-cache-misses
0.63 perf-stat.overall.ipc
+17.5% 850512 ? 7% perf-stat.ps.context-switches
-18.3% 130155 ? 7% perf-stat.ps.cpu-migrations
+21.1% 6.007e+12 ? 2% perf-stat.total.instructions
48.80 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe
48.78 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe
26.35 perf-profile.calltrace.cycles-pp.__x64_sys_exit.do_syscall_64.entry_SYSCALL_64_after_hwframe
26.34 perf-profile.calltrace.cycles-pp.do_exit.__x64_sys_exit.do_syscall_64.entry_SYSCALL_64_after_hwframe
22.06 perf-profile.calltrace.cycles-pp.exit_notify.do_exit.__x64_sys_exit.do_syscall_64.entry_SYSCALL_64_after_hwframe
21.15 perf-profile.calltrace.cycles-pp.__do_sys_clone3.do_syscall_64.entry_SYSCALL_64_after_hwframe
17.18 perf-profile.calltrace.cycles-pp.copy_process.kernel_clone.__do_sys_clone3.do_syscall_64.entry_SYSCALL_64_after_hwframe
21.10 perf-profile.calltrace.cycles-pp.kernel_clone.__do_sys_clone3.do_syscall_64.entry_SYSCALL_64_after_hwframe
10.07 perf-profile.calltrace.cycles-pp.queued_write_lock_slowpath.copy_process.kernel_clone.__do_sys_clone3.do_syscall_64
9.65 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath.queued_write_lock_slowpath.copy_process.kernel_clone.__do_sys_clone3
10.11 perf-profile.calltrace.cycles-pp.queued_write_lock_slowpath.exit_notify.do_exit.__x64_sys_exit.do_syscall_64
9.68 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath.queued_write_lock_slowpath.exit_notify.do_exit.__x64_sys_exit
11.59 perf-profile.calltrace.cycles-pp.release_task.exit_notify.do_exit.__x64_sys_exit.do_syscall_64
9.59 perf-profile.calltrace.cycles-pp.queued_write_lock_slowpath.release_task.exit_notify.do_exit.__x64_sys_exit
9.12 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath.queued_write_lock_slowpath.release_task.exit_notify.do_exit
0.25 ?100% perf-profile.calltrace.cycles-pp.remove_vm_area.vfree.delayed_vfree_work.process_one_work.worker_thread
1.09 perf-profile.calltrace.cycles-pp.__schedule.do_task_dead.do_exit.__x64_sys_exit.do_syscall_64
2.54 perf-profile.calltrace.cycles-pp.ret_from_fork_asm
2.32 perf-profile.calltrace.cycles-pp.ret_from_fork.ret_from_fork_asm
1.10 perf-profile.calltrace.cycles-pp.do_task_dead.do_exit.__x64_sys_exit.do_syscall_64.entry_SYSCALL_64_after_hwframe
2.19 perf-profile.calltrace.cycles-pp.kthread.ret_from_fork.ret_from_fork_asm
0.66 perf-profile.calltrace.cycles-pp.__schedule.schedule.futex_wait_queue.__futex_wait.futex_wait
0.78 perf-profile.calltrace.cycles-pp.do_futex.__x64_sys_futex.do_syscall_64.entry_SYSCALL_64_after_hwframe
0.78 perf-profile.calltrace.cycles-pp.futex_wait.do_futex.__x64_sys_futex.do_syscall_64.entry_SYSCALL_64_after_hwframe
0.76 perf-profile.calltrace.cycles-pp.__futex_wait.futex_wait.do_futex.__x64_sys_futex.do_syscall_64
0.67 perf-profile.calltrace.cycles-pp.futex_wait_queue.__futex_wait.futex_wait.do_futex.__x64_sys_futex
0.66 perf-profile.calltrace.cycles-pp.schedule.futex_wait_queue.__futex_wait.futex_wait.do_futex
0.79 perf-profile.calltrace.cycles-pp.__x64_sys_futex.do_syscall_64.entry_SYSCALL_64_after_hwframe
0.99 perf-profile.calltrace.cycles-pp.worker_thread.kthread.ret_from_fork.ret_from_fork_asm
1.45 perf-profile.calltrace.cycles-pp.alloc_pid.copy_process.kernel_clone.__do_sys_clone3.do_syscall_64
1.12 perf-profile.calltrace.cycles-pp.run_ksoftirqd.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
1.19 perf-profile.calltrace.cycles-pp.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
0.81 perf-profile.calltrace.cycles-pp.delayed_vfree_work.process_one_work.worker_thread.kthread.ret_from_fork
0.76 perf-profile.calltrace.cycles-pp.vfree.delayed_vfree_work.process_one_work.worker_thread.kthread
0.90 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irq.alloc_pid.copy_process.kernel_clone
1.12 perf-profile.calltrace.cycles-pp.__do_softirq.run_ksoftirqd.smpboot_thread_fn.kthread.ret_from_fork
0.90 perf-profile.calltrace.cycles-pp.process_one_work.worker_thread.kthread.ret_from_fork.ret_from_fork_asm
1.08 perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.alloc_pid.copy_process.kernel_clone.__do_sys_clone3
1.11 perf-profile.calltrace.cycles-pp.rcu_core.__do_softirq.run_ksoftirqd.smpboot_thread_fn.kthread
1.10 perf-profile.calltrace.cycles-pp.rcu_do_batch.rcu_core.__do_softirq.run_ksoftirqd.smpboot_thread_fn
0.87 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.__madvise
0.86 perf-profile.calltrace.cycles-pp.__x64_sys_madvise.do_syscall_64.entry_SYSCALL_64_after_hwframe.__madvise
0.86 perf-profile.calltrace.cycles-pp.do_madvise.__x64_sys_madvise.do_syscall_64.entry_SYSCALL_64_after_hwframe.__madvise
0.79 perf-profile.calltrace.cycles-pp.__vmalloc_area_node.__vmalloc_node_range.alloc_thread_stack_node.dup_task_struct.copy_process
0.87 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.__madvise
0.65 perf-profile.calltrace.cycles-pp.madvise_vma_behavior.do_madvise.__x64_sys_madvise.do_syscall_64.entry_SYSCALL_64_after_hwframe
0.64 perf-profile.calltrace.cycles-pp.__alloc_pages_bulk.__vmalloc_area_node.__vmalloc_node_range.alloc_thread_stack_node.dup_task_struct
0.89 perf-profile.calltrace.cycles-pp.__madvise
2.13 perf-profile.calltrace.cycles-pp.update_sg_wakeup_stats.sched_balance_find_dst_group.sched_balance_find_dst_cpu.select_task_rq_fair.wake_up_new_task
0.64 perf-profile.calltrace.cycles-pp.do_futex.mm_release.exit_mm.do_exit.__x64_sys_exit
0.64 perf-profile.calltrace.cycles-pp.futex_wake.do_futex.mm_release.exit_mm.do_exit
0.67 perf-profile.calltrace.cycles-pp.mm_release.exit_mm.do_exit.__x64_sys_exit.do_syscall_64
0.82 perf-profile.calltrace.cycles-pp.exit_mm.do_exit.__x64_sys_exit.do_syscall_64.entry_SYSCALL_64_after_hwframe
0.70 perf-profile.calltrace.cycles-pp.kmem_cache_free.__x64_sys_statx.do_syscall_64.entry_SYSCALL_64_after_hwframe.statx
0.72 perf-profile.calltrace.cycles-pp.cp_statx.do_statx.__x64_sys_statx.do_syscall_64.entry_SYSCALL_64_after_hwframe
0.72 perf-profile.calltrace.cycles-pp.security_inode_getattr.vfs_statx.vfs_fstatat.__do_sys_newfstatat.do_syscall_64
0.75 perf-profile.calltrace.cycles-pp.check_heap_object.__check_object_size.strncpy_from_user.getname_flags.vfs_fstatat
0.72 perf-profile.calltrace.cycles-pp.schedule.__x64_sys_sched_yield.do_syscall_64.entry_SYSCALL_64_after_hwframe.__sched_yield
0.86 perf-profile.calltrace.cycles-pp.shim_statx
0.89 perf-profile.calltrace.cycles-pp.complete_walk.path_lookupat.filename_lookup.vfs_statx.do_statx
0.95 perf-profile.calltrace.cycles-pp.check_heap_object.__check_object_size.strncpy_from_user.getname_flags.__x64_sys_statx
0.99 perf-profile.calltrace.cycles-pp.kmem_cache_alloc.getname_flags.vfs_fstatat.__do_sys_newfstatat.do_syscall_64
0.90 perf-profile.calltrace.cycles-pp.dput.path_put.vfs_statx.vfs_fstatat.__do_sys_newfstatat
0.72 perf-profile.calltrace.cycles-pp.lockref_put_return.dput.path_put.vfs_statx.vfs_fstatat
0.96 perf-profile.calltrace.cycles-pp.path_put.vfs_statx.vfs_fstatat.__do_sys_newfstatat.do_syscall_64
0.96 perf-profile.calltrace.cycles-pp.__x64_sys_sched_yield.do_syscall_64.entry_SYSCALL_64_after_hwframe.__sched_yield
0.71 perf-profile.calltrace.cycles-pp.__schedule.schedule.__x64_sys_sched_yield.do_syscall_64.entry_SYSCALL_64_after_hwframe
1.13 perf-profile.calltrace.cycles-pp.link_path_walk.path_lookupat.filename_lookup.vfs_statx.do_statx
1.08 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.__sched_yield
1.08 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.__sched_yield
1.28 perf-profile.calltrace.cycles-pp.kmem_cache_alloc.getname_flags.__x64_sys_statx.do_syscall_64.entry_SYSCALL_64_after_hwframe
1.40 perf-profile.calltrace.cycles-pp.cp_new_stat.__do_sys_newfstatat.do_syscall_64.entry_SYSCALL_64_after_hwframe.fstatat64
1.23 perf-profile.calltrace.cycles-pp.__sched_yield
1.48 perf-profile.calltrace.cycles-pp.__check_object_size.strncpy_from_user.getname_flags.vfs_fstatat.__do_sys_newfstatat
1.89 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64.statx
1.90 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64.fstatat64
1.88 perf-profile.calltrace.cycles-pp.__check_object_size.strncpy_from_user.getname_flags.__x64_sys_statx.do_syscall_64
0.54 perf-profile.calltrace.cycles-pp._copy_to_user.cp_new_stat.__do_sys_newfstatat.do_syscall_64.entry_SYSCALL_64_after_hwframe
0.54 ? 2% perf-profile.calltrace.cycles-pp.lookup_fast.walk_component.path_lookupat.filename_lookup.vfs_statx
1.93 perf-profile.calltrace.cycles-pp.complete_walk.path_lookupat.filename_lookup.vfs_statx.vfs_fstatat
0.55 perf-profile.calltrace.cycles-pp.path_init.path_lookupat.filename_lookup.vfs_statx.do_statx
0.59 perf-profile.calltrace.cycles-pp.common_perm_cond.security_inode_getattr.vfs_statx.vfs_fstatat.__do_sys_newfstatat
0.60 ? 2% perf-profile.calltrace.cycles-pp.path_init.path_lookupat.filename_lookup.vfs_statx.vfs_fstatat
0.65 ? 2% perf-profile.calltrace.cycles-pp.walk_component.path_lookupat.filename_lookup.vfs_statx.vfs_fstatat
2.50 perf-profile.calltrace.cycles-pp.link_path_walk.path_lookupat.filename_lookup.vfs_statx.vfs_fstatat
0.67 ? 2% perf-profile.calltrace.cycles-pp.inode_permission.link_path_walk.path_lookupat.filename_lookup.vfs_statx
2.37 perf-profile.calltrace.cycles-pp.__legitimize_path.try_to_unlazy.complete_walk.path_lookupat.filename_lookup
2.81 perf-profile.calltrace.cycles-pp.strncpy_from_user.getname_flags.vfs_fstatat.__do_sys_newfstatat.do_syscall_64
2.65 perf-profile.calltrace.cycles-pp.try_to_unlazy.complete_walk.path_lookupat.filename_lookup.vfs_statx
3.14 perf-profile.calltrace.cycles-pp.path_lookupat.filename_lookup.vfs_statx.do_statx.__x64_sys_statx
3.54 perf-profile.calltrace.cycles-pp.strncpy_from_user.getname_flags.__x64_sys_statx.do_syscall_64.entry_SYSCALL_64_after_hwframe
1.91 perf-profile.calltrace.cycles-pp.lockref_get_not_dead.__legitimize_path.try_to_unlazy.complete_walk.path_lookupat
3.92 perf-profile.calltrace.cycles-pp.filename_lookup.vfs_statx.do_statx.__x64_sys_statx.do_syscall_64
4.72 perf-profile.calltrace.cycles-pp.getname_flags.vfs_fstatat.__do_sys_newfstatat.do_syscall_64.entry_SYSCALL_64_after_hwframe
5.54 perf-profile.calltrace.cycles-pp.getname_flags.__x64_sys_statx.do_syscall_64.entry_SYSCALL_64_after_hwframe.statx
5.52 perf-profile.calltrace.cycles-pp.vfs_statx.do_statx.__x64_sys_statx.do_syscall_64.entry_SYSCALL_64_after_hwframe
6.40 perf-profile.calltrace.cycles-pp.path_lookupat.filename_lookup.vfs_statx.vfs_fstatat.__do_sys_newfstatat
7.15 perf-profile.calltrace.cycles-pp.do_statx.__x64_sys_statx.do_syscall_64.entry_SYSCALL_64_after_hwframe.statx
7.44 perf-profile.calltrace.cycles-pp.filename_lookup.vfs_statx.vfs_fstatat.__do_sys_newfstatat.do_syscall_64
10.49 perf-profile.calltrace.cycles-pp.vfs_statx.vfs_fstatat.__do_sys_newfstatat.do_syscall_64.entry_SYSCALL_64_after_hwframe
14.36 perf-profile.calltrace.cycles-pp.__x64_sys_statx.do_syscall_64.entry_SYSCALL_64_after_hwframe.statx
15.41 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.statx
15.77 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.statx
16.46 perf-profile.calltrace.cycles-pp.vfs_fstatat.__do_sys_newfstatat.do_syscall_64.entry_SYSCALL_64_after_hwframe.fstatat64
18.84 perf-profile.calltrace.cycles-pp.statx
18.98 perf-profile.calltrace.cycles-pp.__do_sys_newfstatat.do_syscall_64.entry_SYSCALL_64_after_hwframe.fstatat64
20.04 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.fstatat64
20.46 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.fstatat64
23.42 perf-profile.calltrace.cycles-pp.fstatat64
32.30 perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
29.78 perf-profile.children.cycles-pp.queued_write_lock_slowpath
26.35 perf-profile.children.cycles-pp.__x64_sys_exit
26.36 perf-profile.children.cycles-pp.do_exit
22.07 perf-profile.children.cycles-pp.exit_notify
21.10 perf-profile.children.cycles-pp.kernel_clone
21.15 perf-profile.children.cycles-pp.__do_sys_clone3
17.21 perf-profile.children.cycles-pp.copy_process
11.60 perf-profile.children.cycles-pp.release_task
87.81 perf-profile.children.cycles-pp.do_syscall_64
88.48 perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
0.58 perf-profile.children.cycles-pp.sched_balance_newidle
0.50 ? 2% perf-profile.children.cycles-pp.sched_balance_rq
1.10 perf-profile.children.cycles-pp.do_task_dead
2.32 perf-profile.children.cycles-pp.ret_from_fork
2.54 perf-profile.children.cycles-pp.ret_from_fork_asm
1.60 perf-profile.children.cycles-pp.__do_softirq
2.19 perf-profile.children.cycles-pp.kthread
1.49 perf-profile.children.cycles-pp.rcu_core
1.47 perf-profile.children.cycles-pp.rcu_do_batch
0.83 perf-profile.children.cycles-pp._raw_spin_lock_irqsave
0.67 perf-profile.children.cycles-pp.futex_wait_queue
0.78 perf-profile.children.cycles-pp.futex_wait
0.79 perf-profile.children.cycles-pp.__x64_sys_futex
0.48 perf-profile.children.cycles-pp.irq_exit_rcu
0.77 perf-profile.children.cycles-pp.__futex_wait
0.99 perf-profile.children.cycles-pp.worker_thread
0.16 ? 2% perf-profile.children.cycles-pp.detach_tasks
0.83 perf-profile.children.cycles-pp.activate_task
1.46 perf-profile.children.cycles-pp.alloc_pid
1.12 perf-profile.children.cycles-pp.run_ksoftirqd
1.19 perf-profile.children.cycles-pp.smpboot_thread_fn
0.51 ? 2% perf-profile.children.cycles-pp.perf_session__process_user_event
0.81 perf-profile.children.cycles-pp.delayed_vfree_work
0.50 ? 2% perf-profile.children.cycles-pp.perf_session__deliver_event
0.90 perf-profile.children.cycles-pp.process_one_work
0.51 ? 2% perf-profile.children.cycles-pp.__ordered_events__flush
0.76 perf-profile.children.cycles-pp.vfree
0.15 perf-profile.children.cycles-pp.free_unref_page_commit
0.20 perf-profile.children.cycles-pp.free_unref_page
0.46 ? 2% perf-profile.children.cycles-pp.read
0.45 perf-profile.children.cycles-pp.ksys_read
0.85 perf-profile.children.cycles-pp.enqueue_task_fair
0.44 ? 2% perf-profile.children.cycles-pp.seq_read
0.44 ? 2% perf-profile.children.cycles-pp.seq_read_iter
0.41 ? 2% perf-profile.children.cycles-pp.proc_pid_status
0.14 ? 3% perf-profile.children.cycles-pp.free_pcppages_bulk
0.44 ? 3% perf-profile.children.cycles-pp.machine__process_fork_event
0.44 perf-profile.children.cycles-pp.vfs_read
0.91 perf-profile.children.cycles-pp.dequeue_task_fair
1.42 perf-profile.children.cycles-pp.do_futex
0.16 ? 2% perf-profile.children.cycles-pp.update_sd_lb_stats
0.44 ? 2% perf-profile.children.cycles-pp.____machine__findnew_thread
0.41 perf-profile.children.cycles-pp.proc_single_show
0.74 perf-profile.children.cycles-pp.sysvec_apic_timer_interrupt
0.17 ? 2% perf-profile.children.cycles-pp.sched_balance_find_src_group
0.53 perf-profile.children.cycles-pp.dequeue_entity
0.16 ? 4% perf-profile.children.cycles-pp.__put_partials
0.48 perf-profile.children.cycles-pp.enqueue_entity
0.50 perf-profile.children.cycles-pp.remove_vm_area
0.14 ? 2% perf-profile.children.cycles-pp.update_sg_lb_stats
0.90 perf-profile.children.cycles-pp.__madvise
0.86 perf-profile.children.cycles-pp.__x64_sys_madvise
0.79 perf-profile.children.cycles-pp.__vmalloc_area_node
0.86 perf-profile.children.cycles-pp.do_madvise
0.65 perf-profile.children.cycles-pp.madvise_vma_behavior
0.25 perf-profile.children.cycles-pp.__slab_free
0.64 perf-profile.children.cycles-pp.__alloc_pages_bulk
0.07 ? 5% perf-profile.children.cycles-pp.flush_tlb_mm_range
0.61 perf-profile.children.cycles-pp.zap_page_range_single
0.50 perf-profile.children.cycles-pp.perf_event_task_output
0.07 ? 5% perf-profile.children.cycles-pp.on_each_cpu_cond_mask
0.07 ? 5% perf-profile.children.cycles-pp.smp_call_function_many_cond
0.09 ? 6% perf-profile.children.cycles-pp.tlb_finish_mmu
0.78 perf-profile.children.cycles-pp.__exit_signal
0.08 ? 6% perf-profile.children.cycles-pp.schedule_tail
0.55 perf-profile.children.cycles-pp.perf_iterate_sb
0.13 perf-profile.children.cycles-pp.__task_pid_nr_ns
0.46 perf-profile.children.cycles-pp.clear_page_erms
0.06 perf-profile.children.cycles-pp.__free_pages
0.06 perf-profile.children.cycles-pp.bitmap_string
0.06 perf-profile.children.cycles-pp.mem_cgroup_handle_over_high
0.08 perf-profile.children.cycles-pp.rseq_get_rseq_cs
0.07 ? 5% perf-profile.children.cycles-pp.cpuacct_charge
0.07 ? 5% perf-profile.children.cycles-pp.os_xsave
0.19 perf-profile.children.cycles-pp.restore_fpregs_from_fpstate
0.06 ? 7% perf-profile.children.cycles-pp.select_idle_sibling
0.08 ? 5% perf-profile.children.cycles-pp.shmem_is_huge
0.11 ? 3% perf-profile.children.cycles-pp.rseq_ip_fixup
0.13 ? 2% perf-profile.children.cycles-pp.select_task_rq
0.11 ? 4% perf-profile.children.cycles-pp.__switch_to
0.12 perf-profile.children.cycles-pp.___perf_sw_event
0.07 ? 5% perf-profile.children.cycles-pp.make_vfsgid
0.09 perf-profile.children.cycles-pp.__enqueue_entity
0.15 ? 4% perf-profile.children.cycles-pp.stress_fstat_thread
0.23 perf-profile.children.cycles-pp.set_next_entity
0.07 perf-profile.children.cycles-pp.proc_pid_get_link
0.09 perf-profile.children.cycles-pp.pick_eevdf
0.51 perf-profile.children.cycles-pp.try_to_wake_up
0.43 perf-profile.children.cycles-pp.wake_up_q
0.25 perf-profile.children.cycles-pp.switch_fpu_return
0.15 perf-profile.children.cycles-pp.__rseq_handle_notify_resume
0.21 ? 3% perf-profile.children.cycles-pp.tick_nohz_handler
0.09 ? 5% perf-profile.children.cycles-pp.__x64_sys_newfstatat
0.20 ? 2% perf-profile.children.cycles-pp.update_process_times
0.06 perf-profile.children.cycles-pp.statx@plt
0.22 ? 3% perf-profile.children.cycles-pp.__hrtimer_run_queues
0.36 ? 3% perf-profile.children.cycles-pp.pick_link
0.12 perf-profile.children.cycles-pp.should_failslab
0.08 perf-profile.children.cycles-pp.asm_sysvec_reschedule_ipi
0.10 perf-profile.children.cycles-pp.mntput
0.64 perf-profile.children.cycles-pp.futex_wake
0.27 ? 3% perf-profile.children.cycles-pp.__sysvec_apic_timer_interrupt
0.26 ? 3% perf-profile.children.cycles-pp.hrtimer_interrupt
0.13 ? 2% perf-profile.children.cycles-pp.legitimize_links
0.67 perf-profile.children.cycles-pp.mm_release
0.19 ? 2% perf-profile.children.cycles-pp.prepare_task_switch
0.10 ? 4% perf-profile.children.cycles-pp.yield_task_fair
0.82 perf-profile.children.cycles-pp.exit_mm
0.16 ? 2% perf-profile.children.cycles-pp.mntput_no_expire
0.19 perf-profile.children.cycles-pp.__get_user_1
0.22 ? 2% perf-profile.children.cycles-pp.switch_mm_irqs_off
0.18 ? 3% perf-profile.children.cycles-pp.vfs_fstat
0.25 ? 2% perf-profile.children.cycles-pp.entry_SYSCALL_64_safe_stack
0.06 ? 17% perf-profile.children.cycles-pp.try_to_unlazy_next
0.05 perf-profile.children.cycles-pp.apparmor_inode_getattr
0.05 perf-profile.children.cycles-pp.tid_fd_revalidate
0.21 perf-profile.children.cycles-pp.security_inode_permission
0.22 ? 2% perf-profile.children.cycles-pp.from_kgid_munged
0.21 ? 2% perf-profile.children.cycles-pp.is_vmalloc_addr
0.21 perf-profile.children.cycles-pp.amd_clear_divider
0.22 perf-profile.children.cycles-pp.from_kuid_munged
0.73 perf-profile.children.cycles-pp.update_curr
0.17 ? 2% perf-profile.children.cycles-pp.put_prev_entity
0.31 ? 5% perf-profile.children.cycles-pp.__fdget_raw
0.30 perf-profile.children.cycles-pp.rcu_all_qs
0.26 ? 10% perf-profile.children.cycles-pp.__lookup_mnt
0.23 perf-profile.children.cycles-pp.do_sched_yield
0.34 perf-profile.children.cycles-pp.syscall_exit_to_user_mode_prepare
0.32 ? 5% perf-profile.children.cycles-pp.terminate_walk
0.33 perf-profile.children.cycles-pp.make_vfsuid
0.38 perf-profile.children.cycles-pp.map_id_up
1.52 ? 2% perf-profile.children.cycles-pp._raw_spin_lock
0.39 perf-profile.children.cycles-pp.check_stack_object
0.44 perf-profile.children.cycles-pp.generic_fillattr
0.46 ? 2% perf-profile.children.cycles-pp.__legitimize_mnt
0.46 ? 6% perf-profile.children.cycles-pp.set_root
0.59 perf-profile.children.cycles-pp.vfs_getattr_nosec
0.64 perf-profile.children.cycles-pp.syscall_return_via_sysret
0.70 ? 2% perf-profile.children.cycles-pp.generic_permission
0.76 perf-profile.children.cycles-pp.shmem_getattr
0.74 perf-profile.children.cycles-pp.cp_statx
0.64 ? 4% perf-profile.children.cycles-pp.nd_jump_root
0.82 perf-profile.children.cycles-pp.__check_heap_object
0.91 perf-profile.children.cycles-pp._copy_to_user
0.88 perf-profile.children.cycles-pp.shim_statx
0.84 perf-profile.children.cycles-pp.__cond_resched
0.70 ? 25% perf-profile.children.cycles-pp.stress_fstat_helper
0.91 perf-profile.children.cycles-pp.common_perm_cond
0.90 ? 2% perf-profile.children.cycles-pp.__virt_addr_valid
0.91 ? 4% perf-profile.children.cycles-pp.__d_lookup_rcu
1.04 perf-profile.children.cycles-pp.inode_permission
3.01 perf-profile.children.cycles-pp.__schedule
1.11 perf-profile.children.cycles-pp.putname
0.96 perf-profile.children.cycles-pp.__x64_sys_sched_yield
1.36 ? 2% perf-profile.children.cycles-pp.step_into
1.11 perf-profile.children.cycles-pp.security_inode_getattr
2.27 perf-profile.children.cycles-pp.kmem_cache_free
1.21 ? 2% perf-profile.children.cycles-pp.path_init
1.16 ? 2% perf-profile.children.cycles-pp.lockref_put_return
1.78 perf-profile.children.cycles-pp.schedule
1.34 ? 2% perf-profile.children.cycles-pp.lookup_fast
1.24 perf-profile.children.cycles-pp.__sched_yield
1.45 perf-profile.children.cycles-pp.cp_new_stat
1.60 ? 2% perf-profile.children.cycles-pp.walk_component
1.50 perf-profile.children.cycles-pp.path_put
1.96 perf-profile.children.cycles-pp.entry_SYSRETQ_unsafe_stack
1.58 perf-profile.children.cycles-pp.dput
1.86 perf-profile.children.cycles-pp.check_heap_object
1.79 perf-profile.children.cycles-pp.syscall_exit_to_user_mode
2.00 perf-profile.children.cycles-pp.lockref_get_not_dead
2.46 perf-profile.children.cycles-pp.entry_SYSCALL_64
2.50 perf-profile.children.cycles-pp.kmem_cache_alloc
2.52 perf-profile.children.cycles-pp.__legitimize_path
2.73 perf-profile.children.cycles-pp.try_to_unlazy
2.85 perf-profile.children.cycles-pp.complete_walk
3.81 perf-profile.children.cycles-pp.link_path_walk
3.75 perf-profile.children.cycles-pp.__check_object_size
6.52 perf-profile.children.cycles-pp.strncpy_from_user
7.21 perf-profile.children.cycles-pp.do_statx
9.97 perf-profile.children.cycles-pp.path_lookupat
10.82 perf-profile.children.cycles-pp.getname_flags
11.70 perf-profile.children.cycles-pp.filename_lookup
14.48 perf-profile.children.cycles-pp.__x64_sys_statx
16.49 perf-profile.children.cycles-pp.vfs_statx
16.91 perf-profile.children.cycles-pp.vfs_fstatat
19.42 perf-profile.children.cycles-pp.__do_sys_newfstatat
18.92 perf-profile.children.cycles-pp.statx
23.87 perf-profile.children.cycles-pp.fstatat64
32.29 perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath
1.33 perf-profile.self.cycles-pp.queued_write_lock_slowpath
0.12 ? 3% perf-profile.self.cycles-pp.update_sg_lb_stats
0.05 ? 8% perf-profile.self.cycles-pp.smp_call_function_many_cond
0.58 perf-profile.self.cycles-pp.__memcpy
0.46 perf-profile.self.cycles-pp.clear_page_erms
0.06 perf-profile.self.cycles-pp.__free_pages
0.12 perf-profile.self.cycles-pp.sched_balance_find_dst_cpu
0.08 perf-profile.self.cycles-pp.available_idle_cpu
0.07 perf-profile.self.cycles-pp.__dequeue_entity
0.10 perf-profile.self.cycles-pp.__switch_to
0.07 perf-profile.self.cycles-pp.shmem_is_huge
0.07 ? 5% perf-profile.self.cycles-pp.os_xsave
0.19 perf-profile.self.cycles-pp.restore_fpregs_from_fpstate
0.11 perf-profile.self.cycles-pp.___perf_sw_event
0.14 ? 3% perf-profile.self.cycles-pp.stress_fstat_thread
0.08 perf-profile.self.cycles-pp.pick_eevdf
0.07 perf-profile.self.cycles-pp.mntput
0.07 perf-profile.self.cycles-pp.should_failslab
0.09 perf-profile.self.cycles-pp.__enqueue_entity
0.09 ? 5% perf-profile.self.cycles-pp.legitimize_links
0.11 perf-profile.self.cycles-pp.amd_clear_divider
0.12 perf-profile.self.cycles-pp.__legitimize_path
0.11 perf-profile.self.cycles-pp.pick_next_task_fair
0.13 ? 2% perf-profile.self.cycles-pp.complete_walk
0.14 perf-profile.self.cycles-pp.dput
0.14 ? 2% perf-profile.self.cycles-pp.security_inode_getattr
0.15 perf-profile.self.cycles-pp.mntput_no_expire
0.16 ? 2% perf-profile.self.cycles-pp.try_to_unlazy
0.90 perf-profile.self.cycles-pp._raw_spin_lock
0.15 ? 2% perf-profile.self.cycles-pp.is_vmalloc_addr
0.22 perf-profile.self.cycles-pp.switch_mm_irqs_off
0.18 perf-profile.self.cycles-pp.__get_user_1
0.17 ? 2% perf-profile.self.cycles-pp.security_inode_permission
0.17 ? 2% perf-profile.self.cycles-pp.terminate_walk
0.36 ? 2% perf-profile.self.cycles-pp.update_curr
0.19 ? 2% perf-profile.self.cycles-pp.nd_jump_root
0.24 perf-profile.self.cycles-pp.entry_SYSCALL_64_safe_stack
0.23 perf-profile.self.cycles-pp.rcu_all_qs
0.05 perf-profile.self.cycles-pp.from_kgid_munged
0.05 perf-profile.self.cycles-pp.make_vfsgid
0.05 perf-profile.self.cycles-pp.path_put
0.06 ? 9% perf-profile.self.cycles-pp.__x64_sys_newfstatat
0.24 ? 2% perf-profile.self.cycles-pp.walk_component
0.28 perf-profile.self.cycles-pp.syscall_exit_to_user_mode_prepare
0.23 ? 12% perf-profile.self.cycles-pp.__lookup_mnt
0.29 ? 5% perf-profile.self.cycles-pp.__fdget_raw
0.27 perf-profile.self.cycles-pp.lookup_fast
0.27 perf-profile.self.cycles-pp.shmem_getattr
0.25 perf-profile.self.cycles-pp.make_vfsuid
0.28 perf-profile.self.cycles-pp.vfs_fstatat
0.30 perf-profile.self.cycles-pp.cp_statx
0.32 perf-profile.self.cycles-pp.generic_fillattr
0.31 perf-profile.self.cycles-pp.check_stack_object
0.33 perf-profile.self.cycles-pp.map_id_up
0.33 perf-profile.self.cycles-pp.__x64_sys_statx
0.40 perf-profile.self.cycles-pp.__cond_resched
0.32 ? 3% perf-profile.self.cycles-pp.inode_permission
0.40 perf-profile.self.cycles-pp.path_init
0.44 perf-profile.self.cycles-pp.__schedule
0.42 perf-profile.self.cycles-pp.path_lookupat
0.42 ? 2% perf-profile.self.cycles-pp.__legitimize_mnt
0.43 ? 6% perf-profile.self.cycles-pp.set_root
0.56 perf-profile.self.cycles-pp.syscall_exit_to_user_mode
0.56 ? 3% perf-profile.self.cycles-pp.generic_permission
0.56 perf-profile.self.cycles-pp.vfs_getattr_nosec
0.63 perf-profile.self.cycles-pp.entry_SYSCALL_64
0.57 perf-profile.self.cycles-pp.cp_new_stat
0.64 perf-profile.self.cycles-pp.syscall_return_via_sysret
0.77 perf-profile.self.cycles-pp.__check_heap_object
0.72 perf-profile.self.cycles-pp.vfs_statx
0.73 ? 2% perf-profile.self.cycles-pp.step_into
0.72 ? 5% perf-profile.self.cycles-pp.entry_SYSCALL_64_after_hwframe
0.77 perf-profile.self.cycles-pp.check_heap_object
0.77 perf-profile.self.cycles-pp.__check_object_size
0.88 perf-profile.self.cycles-pp._copy_to_user
0.83 perf-profile.self.cycles-pp.shim_statx
0.84 perf-profile.self.cycles-pp.common_perm_cond
0.62 ? 27% perf-profile.self.cycles-pp.stress_fstat_helper
0.84 ? 4% perf-profile.self.cycles-pp.__d_lookup_rcu
0.84 ? 2% perf-profile.self.cycles-pp.__virt_addr_valid
0.88 perf-profile.self.cycles-pp.do_statx
0.90 perf-profile.self.cycles-pp.__do_sys_newfstatat
1.04 perf-profile.self.cycles-pp.do_syscall_64
1.02 perf-profile.self.cycles-pp.putname
1.05 perf-profile.self.cycles-pp.getname_flags
1.12 ? 3% perf-profile.self.cycles-pp.link_path_walk
1.16 perf-profile.self.cycles-pp.fstatat64
1.26 perf-profile.self.cycles-pp.statx
1.12 ? 2% perf-profile.self.cycles-pp.lockref_put_return
1.60 perf-profile.self.cycles-pp.kmem_cache_free
1.90 perf-profile.self.cycles-pp.entry_SYSRETQ_unsafe_stack
1.70 perf-profile.self.cycles-pp.filename_lookup
1.89 perf-profile.self.cycles-pp.kmem_cache_alloc
1.96 perf-profile.self.cycles-pp.lockref_get_not_dead
2.83 perf-profile.self.cycles-pp.strncpy_from_user
*************************************************************************
threads 2 sockets Intel(R) Xeon(R) Platinum 8260L CPU @ 2.40GHz (Cascade Lake) with 128G memory
===============================================================
disk/fs/kconfig/load/md/rootfs/tbox_group/test/testcase:
12G/xfs/x86_64-rhel-8.3/300/RAID1/debian-12-x86_64-20240206.cgz/lkp-csl-2sp3/sync_disk_rw/aim7
Remove shift of thermal clock")
Reschedule the cfs_rq when current is ineligible")
e2bbd1c498980c5cb68f9973f41
---------------------------
%stddev
\
+15.8% 28212175 cpuidle..usage
34.87 ? 2% iostat.cpu.idle
-6.3% 64.25 iostat.cpu.system
+10.4% 831869 ? 4% meminfo.Inactive(anon)
+45.9% 239358 ? 11% meminfo.Mapped
6775 ? 2% perf-c2c.DRAM.remote
4151 ? 2% perf-c2c.HITM.remote
+7.8% 698666 ? 2% vmstat.io.bo
83.64 ? 4% vmstat.procs.r
+16.2% 930980 ? 2% vmstat.system.cs
33.67 ? 3% mpstat.cpu.all.idle%
0.04 ? 5% mpstat.cpu.all.iowait%
0.85 ? 3% mpstat.cpu.all.usr%
76.32 mpstat.max_utilization_pct
+9.6% 17753 aim7.jobs-per-min
-8.7% 101.44 aim7.time.elapsed_time
-8.7% 101.44 aim7.time.elapsed_time.max
+75.1% 7669146 aim7.time.involuntary_context_switches
5983 aim7.time.percent_of_cpu_this_job_got
6042 aim7.time.system_time
-4.9% 50219682 aim7.time.voluntary_context_switches
+3.7% 348553 ? 2% proc-vmstat.nr_active_anon
+2.6% 1210206 proc-vmstat.nr_file_pages
+10.4% 208211 ? 4% proc-vmstat.nr_inactive_anon
60654 ? 11% proc-vmstat.nr_mapped
+8.6% 391938 proc-vmstat.nr_shmem
+3.7% 348553 ? 2% proc-vmstat.nr_zone_active_anon
+10.4% 208211 ? 4% proc-vmstat.nr_zone_inactive_anon
-21.8% 1153102 sched_debug.cfs_rq:/.avg_vruntime.avg
-20.9% 1300217 ? 3% sched_debug.cfs_rq:/.avg_vruntime.max
-21.2% 1098002 ? 2% sched_debug.cfs_rq:/.avg_vruntime.min
24315 ? 12% sched_debug.cfs_rq:/.avg_vruntime.stddev
0.57 ? 7% sched_debug.cfs_rq:/.h_nr_running.avg
-21.8% 1153102 sched_debug.cfs_rq:/.min_vruntime.avg
-20.9% 1300217 ? 3% sched_debug.cfs_rq:/.min_vruntime.max
-21.2% 1098002 ? 2% sched_debug.cfs_rq:/.min_vruntime.min
24315 ? 12% sched_debug.cfs_rq:/.min_vruntime.stddev
0.42 ? 4% sched_debug.cfs_rq:/.nr_running.avg
-26.5% 627.20 sched_debug.cfs_rq:/.runnable_avg.avg
1983 ? 17% sched_debug.cfs_rq:/.runnable_avg.max
-37.5% 367.74 ? 5% sched_debug.cfs_rq:/.runnable_avg.stddev
1612 ? 16% sched_debug.cfs_rq:/.util_avg.max
-21.5% 309.35 ? 5% sched_debug.cfs_rq:/.util_avg.stddev
-41.5% 159.47 ? 7% sched_debug.cfs_rq:/.util_est.avg
-43.3% 214.15 ? 16% sched_debug.cfs_rq:/.util_est.stddev
0.55 ? 7% sched_debug.cpu.nr_running.avg
+20.1% 265968 sched_debug.cpu.nr_switches.avg
+18.7% 283476 ? 2% sched_debug.cpu.nr_switches.max
9124 ? 9% sched_debug.cpu.nr_switches.stddev
1.62 ? 3% perf-stat.i.MPKI
1.36 ? 3% perf-stat.i.branch-miss-rate%
+16.3% 1.003e+08 ? 4% perf-stat.i.branch-misses
+12.5% 90980799 ? 2% perf-stat.i.cache-misses
+16.5% 952574 ? 2% perf-stat.i.context-switches
3.57 perf-stat.i.cpi
-5.3% 2.02e+11 perf-stat.i.cpu-cycles
2272 ? 3% perf-stat.i.cycles-between-cache-misses
0.34 perf-stat.i.ipc
12.23 ? 2% perf-stat.i.metric.K/sec
7047 ? 3% perf-stat.i.minor-faults
7048 ? 3% perf-stat.i.page-faults
1.70 ? 2% perf-stat.overall.MPKI
0.92 ? 4% perf-stat.overall.branch-miss-rate%
3.78 perf-stat.overall.cpi
2222 ? 2% perf-stat.overall.cycles-between-cache-misses
0.26 perf-stat.overall.ipc
+16.3% 99200665 ? 4% perf-stat.ps.branch-misses
+12.5% 90070323 ? 2% perf-stat.ps.cache-misses
+16.5% 943274 ? 2% perf-stat.ps.context-switches
-5.3% 2e+11 perf-stat.ps.cpu-cycles
-6.5% 5.47e+12 ? 2% perf-stat.total.instructions
7.88 ? 2% perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irq.md_flush_request.raid1_make_request.md_handle_request
7.96 ? 2% perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.md_flush_request.raid1_make_request.md_handle_request.__submit_bio
9.25 ? 3% perf-profile.calltrace.cycles-pp.md_flush_request.raid1_make_request.md_handle_request.__submit_bio.__submit_bio_noacct
9.27 ? 3% perf-profile.calltrace.cycles-pp.md_handle_request.__submit_bio.__submit_bio_noacct.submit_bio_wait.blkdev_issue_flush
9.30 ? 3% perf-profile.calltrace.cycles-pp.__submit_bio_noacct.submit_bio_wait.blkdev_issue_flush.xfs_file_fsync.xfs_file_buffered_write
9.26 ? 3% perf-profile.calltrace.cycles-pp.raid1_make_request.md_handle_request.__submit_bio.__submit_bio_noacct.submit_bio_wait
9.30 ? 3% perf-profile.calltrace.cycles-pp.__submit_bio.__submit_bio_noacct.submit_bio_wait.blkdev_issue_flush.xfs_file_fsync
9.41 ? 3% perf-profile.calltrace.cycles-pp.submit_bio_wait.blkdev_issue_flush.xfs_file_fsync.xfs_file_buffered_write.vfs_write
9.43 ? 3% perf-profile.calltrace.cycles-pp.blkdev_issue_flush.xfs_file_fsync.xfs_file_buffered_write.vfs_write.ksys_write
81.04 perf-profile.calltrace.cycles-pp.xfs_file_fsync.xfs_file_buffered_write.vfs_write.ksys_write.do_syscall_64
82.54 perf-profile.calltrace.cycles-pp.xfs_file_buffered_write.vfs_write.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe
82.68 perf-profile.calltrace.cycles-pp.vfs_write.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe.write
82.69 perf-profile.calltrace.cycles-pp.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe.write
82.88 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.write
82.88 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.write
83.13 perf-profile.calltrace.cycles-pp.write
2.10 ? 3% perf-profile.calltrace.cycles-pp.xlog_wait_on_iclog.xfs_log_force_seq.xfs_file_fsync.xfs_file_buffered_write.vfs_write
1.86 ? 2% perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.remove_wait_queue.xlog_wait_on_iclog.xfs_log_force_seq.xfs_file_fsync
1.86 ? 2% perf-profile.calltrace.cycles-pp.remove_wait_queue.xlog_wait_on_iclog.xfs_log_force_seq.xfs_file_fsync.xfs_file_buffered_write
1.85 ? 2% perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.remove_wait_queue.xlog_wait_on_iclog.xfs_log_force_seq
2.78 ? 3% perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.remove_wait_queue.xlog_cil_force_seq.xfs_log_force_seq.xfs_file_fsync
2.76 ? 3% perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.remove_wait_queue.xlog_cil_force_seq.xfs_log_force_seq
2.78 ? 3% perf-profile.calltrace.cycles-pp.remove_wait_queue.xlog_cil_force_seq.xfs_log_force_seq.xfs_file_fsync.xfs_file_buffered_write
0.80 perf-profile.calltrace.cycles-pp.kiocb_modified.xfs_file_write_checks.xfs_file_buffered_write.vfs_write.ksys_write
0.77 perf-profile.calltrace.cycles-pp.xfs_vn_update_time.kiocb_modified.xfs_file_write_checks.xfs_file_buffered_write.vfs_write
0.55 ? 2% perf-profile.calltrace.cycles-pp.__xfs_trans_commit.xfs_vn_update_time.kiocb_modified.xfs_file_write_checks.xfs_file_buffered_write
0.56 perf-profile.calltrace.cycles-pp.xfs_end_ioend.xfs_end_io.process_one_work.worker_thread.kthread
0.57 perf-profile.calltrace.cycles-pp.xfs_end_io.process_one_work.worker_thread.kthread.ret_from_fork
0.62 ? 3% perf-profile.calltrace.cycles-pp.iomap_file_buffered_write.xfs_file_buffered_write.vfs_write.ksys_write.do_syscall_64
0.99 ? 3% perf-profile.calltrace.cycles-pp.copy_to_brd.brd_submit_bio.__submit_bio.__submit_bio_noacct.iomap_submit_ioend
2.68 ? 2% perf-profile.calltrace.cycles-pp.__submit_bio.__submit_bio_noacct.iomap_submit_ioend.xfs_vm_writepages.do_writepages
2.70 ? 2% perf-profile.calltrace.cycles-pp.__submit_bio_noacct.iomap_submit_ioend.xfs_vm_writepages.do_writepages.filemap_fdatawrite_wbc
2.22 perf-profile.calltrace.cycles-pp.process_one_work.worker_thread.kthread.ret_from_fork.ret_from_fork_asm
2.73 ? 2% perf-profile.calltrace.cycles-pp.iomap_submit_ioend.xfs_vm_writepages.do_writepages.filemap_fdatawrite_wbc.__filemap_fdatawrite_range
1.29 ? 3% perf-profile.calltrace.cycles-pp.brd_submit_bio.__submit_bio.__submit_bio_noacct.iomap_submit_ioend.xfs_vm_writepages
2.38 perf-profile.calltrace.cycles-pp.kthread.ret_from_fork.ret_from_fork_asm
2.38 perf-profile.calltrace.cycles-pp.ret_from_fork.ret_from_fork_asm
2.38 perf-profile.calltrace.cycles-pp.ret_from_fork_asm
2.37 perf-profile.calltrace.cycles-pp.worker_thread.kthread.ret_from_fork.ret_from_fork_asm
1.58 perf-profile.calltrace.cycles-pp.mutex_spin_on_owner.__mutex_lock.__flush_workqueue.xlog_cil_push_now.xlog_cil_force_seq
55.70 perf-profile.calltrace.cycles-pp.osq_lock.__mutex_lock.__flush_workqueue.xlog_cil_push_now.xlog_cil_force_seq
2.86 ? 10% perf-profile.calltrace.cycles-pp.intel_idle_irq.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call.do_idle
65.62 perf-profile.calltrace.cycles-pp.xlog_cil_force_seq.xfs_log_force_seq.xfs_file_fsync.xfs_file_buffered_write.vfs_write
9.42 perf-profile.calltrace.cycles-pp.intel_idle.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call.do_idle
57.86 perf-profile.calltrace.cycles-pp.__mutex_lock.__flush_workqueue.xlog_cil_push_now.xlog_cil_force_seq.xfs_log_force_seq
60.76 perf-profile.calltrace.cycles-pp.__flush_workqueue.xlog_cil_push_now.xlog_cil_force_seq.xfs_log_force_seq.xfs_file_fsync
62.18 perf-profile.calltrace.cycles-pp.xlog_cil_push_now.xlog_cil_force_seq.xfs_log_force_seq.xfs_file_fsync.xfs_file_buffered_write
12.86 ? 3% perf-profile.calltrace.cycles-pp.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call.do_idle.cpu_startup_entry
12.86 ? 3% perf-profile.calltrace.cycles-pp.cpuidle_enter.cpuidle_idle_call.do_idle.cpu_startup_entry.start_secondary
13.08 ? 3% perf-profile.calltrace.cycles-pp.cpuidle_idle_call.do_idle.cpu_startup_entry.start_secondary.common_startup_64
13.94 ? 2% perf-profile.calltrace.cycles-pp.do_idle.cpu_startup_entry.start_secondary.common_startup_64
13.94 ? 2% perf-profile.calltrace.cycles-pp.cpu_startup_entry.start_secondary.common_startup_64
13.94 ? 2% perf-profile.calltrace.cycles-pp.start_secondary.common_startup_64
14.08 ? 2% perf-profile.calltrace.cycles-pp.common_startup_64
15.63 ? 2% perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
8.46 ? 2% perf-profile.children.cycles-pp._raw_spin_lock_irq
9.34 ? 3% perf-profile.children.cycles-pp.md_flush_request
10.06 ? 3% perf-profile.children.cycles-pp.md_handle_request
10.04 ? 3% perf-profile.children.cycles-pp.raid1_make_request
9.41 ? 3% perf-profile.children.cycles-pp.submit_bio_wait
9.43 ? 3% perf-profile.children.cycles-pp.blkdev_issue_flush
12.26 ? 2% perf-profile.children.cycles-pp.__submit_bio
12.28 ? 2% perf-profile.children.cycles-pp.__submit_bio_noacct
81.04 perf-profile.children.cycles-pp.xfs_file_fsync
82.54 perf-profile.children.cycles-pp.xfs_file_buffered_write
82.70 perf-profile.children.cycles-pp.vfs_write
82.71 perf-profile.children.cycles-pp.ksys_write
83.03 perf-profile.children.cycles-pp.do_syscall_64
83.04 perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
83.18 perf-profile.children.cycles-pp.write
5.71 ? 2% perf-profile.children.cycles-pp._raw_spin_lock_irqsave
4.82 ? 2% perf-profile.children.cycles-pp.remove_wait_queue
2.23 ? 3% perf-profile.children.cycles-pp.xlog_wait_on_iclog
0.21 ? 5% perf-profile.children.cycles-pp.asm_sysvec_call_function_single
0.19 ? 4% perf-profile.children.cycles-pp.sysvec_call_function_single
0.18 ? 5% perf-profile.children.cycles-pp.__sysvec_call_function_single
0.10 ? 7% perf-profile.children.cycles-pp.sb_clear_inode_writeback
0.10 ? 8% perf-profile.children.cycles-pp.sb_mark_inode_writeback
0.29 ? 4% perf-profile.children.cycles-pp.__folio_end_writeback
0.17 ? 4% perf-profile.children.cycles-pp.__folio_start_writeback
0.80 perf-profile.children.cycles-pp.kiocb_modified
0.78 perf-profile.children.cycles-pp.xfs_vn_update_time
0.30 ? 4% perf-profile.children.cycles-pp.iomap_writepage_map
0.40 ? 3% perf-profile.children.cycles-pp.iomap_writepages
0.08 ? 7% perf-profile.children.cycles-pp.xfs_log_ticket_ungrant
0.11 ? 3% perf-profile.children.cycles-pp.xlog_state_clean_iclog
0.06 perf-profile.children.cycles-pp.__update_blocked_fair
0.06 perf-profile.children.cycles-pp.kmem_cache_free
0.12 perf-profile.children.cycles-pp.llseek
0.08 perf-profile.children.cycles-pp.switch_fpu_return
0.07 perf-profile.children.cycles-pp.sched_clock
0.07 perf-profile.children.cycles-pp.wake_page_function
0.11 ? 4% perf-profile.children.cycles-pp.xfs_buffered_write_iomap_begin
0.12 ? 4% perf-profile.children.cycles-pp.__switch_to_asm
0.06 ? 7% perf-profile.children.cycles-pp.ktime_get
0.07 ? 7% perf-profile.children.cycles-pp.xfs_btree_lookup_get_block
0.08 perf-profile.children.cycles-pp.__switch_to
0.06 ? 7% perf-profile.children.cycles-pp.tick_nohz_get_sleep_length
0.07 ? 7% perf-profile.children.cycles-pp.mutex_lock
0.09 ? 5% perf-profile.children.cycles-pp.select_idle_core
0.07 ? 5% perf-profile.children.cycles-pp.llist_add_batch
0.07 ? 5% perf-profile.children.cycles-pp.mutex_unlock
0.08 ? 6% perf-profile.children.cycles-pp.perf_tp_event
0.15 ? 2% perf-profile.children.cycles-pp.xlog_cil_committed
0.14 ? 4% perf-profile.children.cycles-pp.iomap_iter
0.10 ? 3% perf-profile.children.cycles-pp.xfs_btree_lookup
0.15 ? 3% perf-profile.children.cycles-pp.xlog_cil_process_committed
0.15 perf-profile.children.cycles-pp.xlog_cil_write_commit_record
0.08 ? 5% perf-profile.children.cycles-pp.submit_flushes
0.15 ? 2% perf-profile.children.cycles-pp.xlog_cil_set_ctx_write_state
0.11 ? 6% perf-profile.children.cycles-pp.kick_pool
0.16 ? 2% perf-profile.children.cycles-pp.xfs_bmap_add_extent_unwritten_real
0.13 ? 4% perf-profile.children.cycles-pp.__queue_work
0.10 ? 8% perf-profile.children.cycles-pp.__smp_call_single_queue
0.15 ? 2% perf-profile.children.cycles-pp.perf_trace_sched_wakeup_template
0.17 ? 2% perf-profile.children.cycles-pp.prepare_task_switch
0.13 ? 5% perf-profile.children.cycles-pp.sched_balance_update_blocked_averages
0.18 ? 4% perf-profile.children.cycles-pp.bio_alloc_bioset
0.20 ? 3% perf-profile.children.cycles-pp.xfs_bmapi_write
0.10 ? 4% perf-profile.children.cycles-pp.__cond_resched
0.17 ? 2% perf-profile.children.cycles-pp.xfs_bmapi_convert_unwritten
0.30 perf-profile.children.cycles-pp.select_idle_sibling
0.20 ? 4% perf-profile.children.cycles-pp.available_idle_cpu
0.14 ? 4% perf-profile.children.cycles-pp.queue_work_on
0.15 ? 4% perf-profile.children.cycles-pp.ttwu_queue_wakelist
0.60 ? 2% perf-profile.children.cycles-pp.sysvec_apic_timer_interrupt
0.25 ? 2% perf-profile.children.cycles-pp.select_idle_cpu
0.18 ? 3% perf-profile.children.cycles-pp.menu_select
0.41 perf-profile.children.cycles-pp.xfs_iomap_write_unwritten
0.56 perf-profile.children.cycles-pp.xfs_end_ioend
0.57 perf-profile.children.cycles-pp.xfs_end_io
0.63 ? 3% perf-profile.children.cycles-pp.iomap_file_buffered_write
0.53 perf-profile.children.cycles-pp.sched_ttwu_pending
0.06 ? 16% perf-profile.children.cycles-pp.poll_idle
0.63 perf-profile.children.cycles-pp.__flush_smp_call_function_queue
0.32 ? 2% perf-profile.children.cycles-pp.flush_workqueue_prep_pwqs
0.28 ? 2% perf-profile.children.cycles-pp.schedule_idle
2.22 perf-profile.children.cycles-pp.process_one_work
2.73 ? 2% perf-profile.children.cycles-pp.iomap_submit_ioend
1.15 ? 2% perf-profile.children.cycles-pp.copy_to_brd
2.38 perf-profile.children.cycles-pp.kthread
2.38 perf-profile.children.cycles-pp.ret_from_fork
2.38 perf-profile.children.cycles-pp.ret_from_fork_asm
2.37 perf-profile.children.cycles-pp.worker_thread
1.46 ? 2% perf-profile.children.cycles-pp.brd_submit_bio
0.50 ? 2% perf-profile.children.cycles-pp.flush_smp_call_function_queue
1.05 ? 2% perf-profile.children.cycles-pp.asm_sysvec_apic_timer_interrupt
0.27 ? 2% perf-profile.children.cycles-pp.wake_up_q
1.46 perf-profile.children.cycles-pp.try_to_wake_up
0.31 ? 18% perf-profile.children.cycles-pp.schedule_preempt_disabled
0.40 perf-profile.children.cycles-pp.__mutex_unlock_slowpath
1.58 perf-profile.children.cycles-pp.mutex_spin_on_owner
55.72 perf-profile.children.cycles-pp.osq_lock
3.30 ? 10% perf-profile.children.cycles-pp.intel_idle_irq
65.62 perf-profile.children.cycles-pp.xlog_cil_force_seq
9.52 perf-profile.children.cycles-pp.intel_idle
57.86 perf-profile.children.cycles-pp.__mutex_lock
60.76 perf-profile.children.cycles-pp.__flush_workqueue
62.18 perf-profile.children.cycles-pp.xlog_cil_push_now
12.99 ? 3% perf-profile.children.cycles-pp.cpuidle_enter_state
12.99 ? 3% perf-profile.children.cycles-pp.cpuidle_enter
13.22 ? 3% perf-profile.children.cycles-pp.cpuidle_idle_call
13.94 ? 2% perf-profile.children.cycles-pp.start_secondary
14.08 ? 2% perf-profile.children.cycles-pp.do_idle
14.08 ? 2% perf-profile.children.cycles-pp.common_startup_64
14.08 ? 2% perf-profile.children.cycles-pp.cpu_startup_entry
15.60 ? 2% perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath
0.07 ? 7% perf-profile.self.cycles-pp.xfs_log_ticket_ungrant
0.06 ? 6% perf-profile.self.cycles-pp.finish_task_switch
0.07 ? 5% perf-profile.self.cycles-pp.llist_add_batch
0.07 ? 5% perf-profile.self.cycles-pp.mutex_unlock
0.08 perf-profile.self.cycles-pp.__switch_to
0.10 ? 4% perf-profile.self.cycles-pp.prepare_task_switch
0.08 ? 4% perf-profile.self.cycles-pp.try_to_wake_up
0.16 ? 5% perf-profile.self.cycles-pp.__schedule
0.20 ? 3% perf-profile.self.cycles-pp.available_idle_cpu
0.10 ? 5% perf-profile.self.cycles-pp.menu_select
0.41 perf-profile.self.cycles-pp._raw_spin_lock_irqsave
0.16 ? 3% perf-profile.self.cycles-pp.flush_workqueue_prep_pwqs
0.28 ? 3% perf-profile.self.cycles-pp._raw_spin_lock_irq
0.05 perf-profile.self.cycles-pp.vfs_write
0.14 ? 6% perf-profile.self.cycles-pp.__mutex_lock
0.06 ? 13% perf-profile.self.cycles-pp.poll_idle
0.45 ? 3% perf-profile.self.cycles-pp._raw_spin_lock
1.12 ? 2% perf-profile.self.cycles-pp.copy_to_brd
1.57 perf-profile.self.cycles-pp.mutex_spin_on_owner
55.30 perf-profile.self.cycles-pp.osq_lock
3.22 ? 10% perf-profile.self.cycles-pp.intel_idle_irq
9.52 perf-profile.self.cycles-pp.intel_idle
*************************************************************************
threads 2 sockets Intel(R) Xeon(R) Platinum 8260L CPU @ 2.40GHz (Cascade Lake) with 128G memory
===============================================================
freq_governor/kconfig/nr_task/rootfs/runtime/target/tbox_group/testcase:
ance/x86_64-rhel-8.3/200%/debian-12-x86_64-20240206.cgz/300s/vmlinux/lkp-csl-2sp3/kbuild
Remove shift of thermal clock")
Reschedule the cfs_rq when current is ineligible")
e2bbd1c498980c5cb68f9973f41
---------------------------
%stddev
\
-11.3% 1724001 cpuidle..usage
0.07 mpstat.cpu.all.soft%
51728 vmstat.system.cs
+2.3% 67537 vmstat.system.in
+18.7% 283496 ? 6% numa-meminfo.node1.Active
+18.7% 283496 ? 6% numa-meminfo.node1.Active(anon)
+19.9% 302260 ? 6% numa-meminfo.node1.Shmem
70769 ? 6% numa-vmstat.node1.nr_active_anon
75559 ? 6% numa-vmstat.node1.nr_shmem
70769 ? 6% numa-vmstat.node1.nr_zone_active_anon
59.17 ? 17% perf-c2c.DRAM.remote
+36.3% 135.67 ? 10% perf-c2c.HITM.local
40.17 ? 18% perf-c2c.HITM.remote
+22.5% 326455 meminfo.Active
+22.5% 326455 meminfo.Active(anon)
85005 meminfo.Mapped
+23.8% 352101 meminfo.Shmem
+1.4% 51.35 kbuild.buildtime_per_iteration
+1.4% 51.35 kbuild.real_time_per_iteration
+1.3% 164.26 kbuild.sys_time_per_iteration
+259.3% 15415002 kbuild.time.involuntary_context_switches
5444 kbuild.time.percent_of_cpu_this_job_got
+1.3% 996.08 kbuild.time.system_time
+2.3% 17819 kbuild.time.user_time
2965 kbuild.user_time_per_iteration
81776 proc-vmstat.nr_active_anon
+2.1% 1522533 proc-vmstat.nr_anon_pages
+2.2% 1529304 proc-vmstat.nr_inactive_anon
21461 proc-vmstat.nr_mapped
9732 proc-vmstat.nr_page_table_pages
88073 proc-vmstat.nr_shmem
81776 proc-vmstat.nr_zone_active_anon
+2.2% 1529304 proc-vmstat.nr_zone_inactive_anon
+7.4% 69486 proc-vmstat.numa_huge_pte_updates
+7.4% 35829277 proc-vmstat.numa_pte_updates
+4.4% 394459 proc-vmstat.pgactivate
2.76 perf-stat.i.branch-miss-rate%
+3.1% 8.224e+08 perf-stat.i.branch-misses
25.84 perf-stat.i.cache-miss-rate%
+3.3% 2.843e+08 perf-stat.i.cache-misses
+11.4% 9.895e+08 perf-stat.i.cache-references
52066 perf-stat.i.context-switches
1.09 perf-stat.i.cpi
+1.2% 1.557e+11 perf-stat.i.cpu-cycles
-5.5% 664.26 perf-stat.i.cpu-migrations
1.04 perf-stat.i.ipc
11.23 ? 2% perf-stat.i.major-faults
-1.1% 325687 perf-stat.i.minor-faults
-1.1% 325698 perf-stat.i.page-faults
2.40 perf-stat.overall.MPKI
3.20 perf-stat.overall.branch-miss-rate%
28.74 perf-stat.overall.cache-miss-rate%
1.31 perf-stat.overall.cpi
-2.0% 547.74 perf-stat.overall.cycles-between-cache-misses
0.76 perf-stat.overall.ipc
+2.9% 8.199e+08 perf-stat.ps.branch-misses
+3.2% 2.835e+08 perf-stat.ps.cache-misses
+11.2% 9.865e+08 perf-stat.ps.cache-references
51908 perf-stat.ps.context-switches
+1.1% 1.553e+11 perf-stat.ps.cpu-cycles
-5.6% 662.09 perf-stat.ps.cpu-migrations
-1.1% 1.183e+11 perf-stat.ps.instructions
11.20 ? 2% perf-stat.ps.major-faults
-1.2% 324786 perf-stat.ps.minor-faults
-1.2% 324797 perf-stat.ps.page-faults
0.40 sched_debug.cfs_rq:/.h_nr_running.avg
1.58 ? 8% sched_debug.cfs_rq:/.h_nr_running.max
0.17 sched_debug.cfs_rq:/.h_nr_running.min
0.31 ? 3% sched_debug.cfs_rq:/.h_nr_running.stddev
-48.8% 864.97 sched_debug.cfs_rq:/.load.min
0.83 sched_debug.cfs_rq:/.load_avg.min
0.22 sched_debug.cfs_rq:/.nr_running.avg
0.17 sched_debug.cfs_rq:/.nr_running.min
0.20 ? 3% sched_debug.cfs_rq:/.nr_running.stddev
-43.4% 419.70 sched_debug.cfs_rq:/.runnable_avg.avg
1368 ? 2% sched_debug.cfs_rq:/.runnable_avg.max
-49.2% 157.89 ? 7% sched_debug.cfs_rq:/.runnable_avg.min
-21.0% 259.55 ? 3% sched_debug.cfs_rq:/.runnable_avg.stddev
-38.5% 248.50 ? 2% sched_debug.cfs_rq:/.util_avg.avg
88.44 ? 14% sched_debug.cfs_rq:/.util_avg.min
52.39 ? 2% sched_debug.cfs_rq:/.util_est.avg
70.04 ? 4% sched_debug.cfs_rq:/.util_est.stddev
+17.0% 843275 ? 3% sched_debug.cpu.avg_idle.avg
6.81 ? 5% sched_debug.cpu.clock.stddev
30600 ? 2% sched_debug.cpu.curr->pid.avg
12586 ? 11% sched_debug.cpu.curr->pid.stddev
0.40 ? 2% sched_debug.cpu.nr_running.avg
0.17 sched_debug.cpu.nr_running.min
0.31 ? 5% sched_debug.cpu.nr_running.stddev
80145 sched_debug.cpu.nr_switches.avg
94514 ? 2% sched_debug.cpu.nr_switches.max
73354 sched_debug.cpu.nr_switches.min
-18.1% -88.19 sched_debug.cpu.nr_uninterruptible.min
30.49 ? 5% sched_debug.cpu.nr_uninterruptible.stddev
2.69 ? 8% perf-profile.calltrace.cycles-pp.common_startup_64
2.66 ? 9% perf-profile.calltrace.cycles-pp.cpu_startup_entry.start_secondary.common_startup_64
2.66 ? 9% perf-profile.calltrace.cycles-pp.start_secondary.common_startup_64
2.66 ? 9% perf-profile.calltrace.cycles-pp.do_idle.cpu_startup_entry.start_secondary.common_startup_64
2.65 ? 9% perf-profile.calltrace.cycles-pp.cpuidle_idle_call.do_idle.cpu_startup_entry.start_secondary.common_startup_64
2.62 ? 9% perf-profile.calltrace.cycles-pp.cpuidle_enter.cpuidle_idle_call.do_idle.cpu_startup_entry.start_secondary
2.61 ? 9% perf-profile.calltrace.cycles-pp.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call.do_idle.cpu_startup_entry
2.44 ? 9% perf-profile.calltrace.cycles-pp.intel_idle.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call.do_idle
0.59 ? 3% perf-profile.calltrace.cycles-pp.do_anonymous_page.__handle_mm_fault.handle_mm_fault.do_user_addr_fault.exc_page_fault
1.46 perf-profile.calltrace.cycles-pp.open64
1.06 ? 2% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe
1.06 ? 2% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe
1.92 perf-profile.calltrace.cycles-pp.malloc
0.87 ? 2% perf-profile.calltrace.cycles-pp.asm_sysvec_apic_timer_interrupt
0.56 ? 2% perf-profile.calltrace.cycles-pp.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt
2.69 ? 8% perf-profile.children.cycles-pp.common_startup_64
2.69 ? 8% perf-profile.children.cycles-pp.cpu_startup_entry
2.69 ? 8% perf-profile.children.cycles-pp.do_idle
2.69 ? 8% perf-profile.children.cycles-pp.cpuidle_idle_call
2.66 ? 9% perf-profile.children.cycles-pp.start_secondary
2.65 ? 8% perf-profile.children.cycles-pp.cpuidle_enter
2.65 ? 8% perf-profile.children.cycles-pp.cpuidle_enter_state
2.47 ? 8% perf-profile.children.cycles-pp.intel_idle
1.76 ? 2% perf-profile.children.cycles-pp.sysvec_apic_timer_interrupt
1.48 perf-profile.children.cycles-pp.__sysvec_apic_timer_interrupt
0.21 ? 5% perf-profile.children.cycles-pp.irq_exit_rcu
0.05 ? 8% perf-profile.children.cycles-pp.free_pcppages_bulk
0.09 ? 4% perf-profile.children.cycles-pp.perf_mux_hrtimer_handler
0.06 perf-profile.children.cycles-pp.perf_rotate_context
0.26 perf-profile.children.cycles-pp.ggc_free(void*)
0.09 ? 4% perf-profile.children.cycles-pp.dput
0.27 perf-profile.children.cycles-pp._cpp_pop_context
0.29 ? 2% perf-profile.children.cycles-pp.mmap_region
0.10 ? 3% perf-profile.children.cycles-pp.__split_vma
0.15 ? 3% perf-profile.children.cycles-pp.do_vmi_align_munmap
0.14 ? 3% perf-profile.children.cycles-pp.ksys_mmap_pgoff
0.34 perf-profile.children.cycles-pp.mark_exp_read(tree_node*)
0.14 perf-profile.children.cycles-pp.do_vmi_munmap
0.14 ? 3% perf-profile.children.cycles-pp.update_load_avg
0.13 ? 5% perf-profile.children.cycles-pp.next_uptodate_folio
0.23 ? 3% perf-profile.children.cycles-pp.filemap_map_pages
0.37 ? 2% perf-profile.children.cycles-pp.walk_component
0.35 ? 2% perf-profile.children.cycles-pp.do_mmap
0.37 ? 2% perf-profile.children.cycles-pp.vm_mmap_pgoff
0.32 ? 3% perf-profile.children.cycles-pp.lookup_name(tree_node*)
0.25 ? 3% perf-profile.children.cycles-pp.do_read_fault
0.08 ? 5% perf-profile.children.cycles-pp.smpboot_thread_fn
0.30 ? 3% perf-profile.children.cycles-pp.do_fault
0.05 perf-profile.children.cycles-pp.__rseq_handle_notify_resume
0.05 perf-profile.children.cycles-pp.__update_load_avg_se
1.26 perf-profile.children.cycles-pp.path_openat
1.47 perf-profile.children.cycles-pp.open64
0.06 ? 7% perf-profile.children.cycles-pp.run_ksoftirqd
1.28 perf-profile.children.cycles-pp.do_filp_open
1.49 perf-profile.children.cycles-pp.__x64_sys_openat
1.48 perf-profile.children.cycles-pp.do_sys_openat2
1.95 perf-profile.children.cycles-pp.malloc
0.11 ? 4% perf-profile.children.cycles-pp.pick_next_task_fair
4.43 perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
4.42 perf-profile.children.cycles-pp.do_syscall_64
0.30 ? 3% perf-profile.children.cycles-pp.__schedule
0.30 ? 4% perf-profile.children.cycles-pp.schedule
0.43 perf-profile.children.cycles-pp.irqentry_exit_to_user_mode
2.47 ? 8% perf-profile.self.cycles-pp.intel_idle
0.11 ? 4% perf-profile.self.cycles-pp.next_uptodate_folio
0.31 ? 3% perf-profile.self.cycles-pp.lookup_name(tree_node*)
1.80 perf-profile.self.cycles-pp.malloc
estimated based on internal Intel analysis and are provided
purposes only. Any difference in system hardware or software
may affect actual performance.
om/intel/lkp-tests/wiki">https://github.com/intel/lkp-tests/wiki

2024-06-03 02:58:24

by Honglei Wang

[permalink] [raw]

Subject: Re: [PATCH] sched/fair: Reschedule the cfs_rq when current is ineligible

On 2024/5/29 22:31, Chunxin Zang wrote:
>
>
>> On May 25, 2024, at 19:48, Honglei Wang <[email protected]> wrote:
>>
>>
>>
>> On 2024/5/24 21:40, Chunxin Zang wrote:
>>> I found that some tasks have been running for a long enough time and
>>> have become illegal, but they are still not releasing the CPU. This
>>> will increase the scheduling delay of other processes. Therefore, I
>>> tried checking the current process in wakeup_preempt and entity_tick,
>>> and if it is illegal, reschedule that cfs queue.
>>> The modification can reduce the scheduling delay by about 30% when
>>> RUN_TO_PARITY is enabled.
>>> So far, it has been running well in my test environment, and I have
>>> pasted some test results below.
>>> I isolated four cores for testing. I ran Hackbench in the background
>>> and observed the test results of cyclictest.
>>> hackbench -g 4 -l 100000000 &
>>> cyclictest --mlockall -D 5m -q
>>> EEVDF PATCH EEVDF-NO_PARITY PATCH-NO_PARITY
>>> # Min Latencies: 00006 00006 00006 00006
>>> LNICE(-19) # Avg Latencies: 00191 00122 00089 00066
>>> # Max Latencies: 15442 07648 14133 07713
>>> # Min Latencies: 00006 00010 00006 00006
>>> LNICE(0) # Avg Latencies: 00466 00277 00289 00257
>>> # Max Latencies: 38917 32391 32665 17710
>>> # Min Latencies: 00019 00053 00010 00013
>>> LNICE(19) # Avg Latencies: 37151 31045 18293 23035
>>> # Max Latencies: 2688299 7031295 426196 425708
>>> I'm actually a bit hesitant about placing this modification under the
>>> NO_PARITY feature. This is because the modification conflicts with the
>>> semantics of RUN_TO_PARITY. So, I captured and compared the number of
>>> resched occurrences in wakeup_preempt to see if it introduced any
>>> additional overhead.
>>> Similarly, hackbench is used to stress the utilization of four cores to
>>> 100%, and the method for capturing the number of PREEMPT occurrences is
>>> referenced from [1].
>>> schedstats EEVDF PATCH EEVDF-NO_PARITY PATCH-NO_PARITY CFS(6.5)
>>> stats.check_preempt_count 5053054 5057286 5003806 5018589 5031908
>>> stats.patch_cause_preempt_count ------- 858044 ------- 765726 -------
>>> stats.need_preempt_count 570520 858684 3380513 3426977 1140821
>>> From the above test results, there is a slight increase in the number of
>>> resched occurrences in wakeup_preempt. However, the results vary with each
>>> test, and sometimes the difference is not that significant. But overall,
>>> the count of reschedules remains lower than that of CFS and is much less
>>> than that of NO_PARITY.
>>> [1]: https://lore.kernel.org/all/[email protected]/T/#m52057282ceb6203318be1ce9f835363de3bef5cb
>>> Signed-off-by: Chunxin Zang <[email protected]>
>>> Reviewed-by: Chen Yang <[email protected]>
>>> ---
>>> kernel/sched/fair.c | 6 ++++++
>>> 1 file changed, 6 insertions(+)
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 03be0d1330a6..a0005d240db5 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -5523,6 +5523,9 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
>>> hrtimer_active(&rq_of(cfs_rq)->hrtick_timer))
>>> return;
>>> #endif
>>> +
>>> + if (!entity_eligible(cfs_rq, curr))
>>> + resched_curr(rq_of(cfs_rq));
>>> }
>>> @@ -8325,6 +8328,9 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>>> if (unlikely(p->policy != SCHED_NORMAL) || !sched_feat(WAKEUP_PREEMPTION))
>>> return;
>>> + if (!entity_eligible(cfs_rq, se))
>>> + goto preempt;
>>> +
>>> find_matching_se(&se, &pse);
>>> WARN_ON_ONCE(!pse);
>>>
>> Hi Chunxin,
>>
>> Did you run a comparative test to see which modification is more helpful on improve the latency? Modification at tick point makes more sense to me. But, seems just resched arbitrarily in wakeup might introduce too much preemption (and maybe more context switch?) in complex environment such as cgroup hierarchy.
>>
>> Thanks,
>> Honglei
>
> Hi Honglei
>
> I attempted to build a slightly more complex scenario. It consists of 4 isolated cores,
> 4 groups of hackbench (160 processes in total) to stress the CPU, and 1 cyclictest
> process to test scheduling latency. Using cgroup v2, to created 64 cgroup leaf nodes
> in a binary tree structure (with a depth of 7). I then evenly distributed the aforementioned
> 161 processes across the 64 cgroups respectively, and observed the scheduling delay
> performance of cyclictest.
>
> Unfortunately, the test results were very fluctuating, and the two sets of data were very
> close to each other. I suspect that it might be due to too few processes being distributed
> in each cgroup, which led to the logic for determining ineligible always succeeding and
> following the original logic. Later, I will attempt more tests to verify the impact of these
> modifications in scenarios involving multiple cgroups.
>

Sorry to lately replay, I was a bit busy last week. How's the test going
on? What about run some workload processes who spend more time in
kernel? Maybe it's worth do give a try, but it depends on your test plan.

Thanks,
Honglei

> thanks
> Chunxin
>
>

2024-06-05 17:20:09

by Chen Yu

[permalink] [raw]

Subject: Re: [PATCH] sched/fair: Reschedule the cfs_rq when current is ineligible

Hi Prateek, Chunxin,

On 2024-05-28 at 10:32:23 +0530, K Prateek Nayak wrote:
> Hello Chunxin,
>
> On 5/28/2024 8:12 AM, Chunxin Zang wrote:
> >
> >> On May 24, 2024, at 23:30, Chen Yu <[email protected]> wrote:
> >>
> >> On 2024-05-24 at 21:40:11 +0800, Chunxin Zang wrote:
> >>> I found that some tasks have been running for a long enough time and
> >>> have become illegal, but they are still not releasing the CPU. This
> >>> will increase the scheduling delay of other processes. Therefore, I
> >>> tried checking the current process in wakeup_preempt and entity_tick,
> >>> and if it is illegal, reschedule that cfs queue.
> >>>
> >>> The modification can reduce the scheduling delay by about 30% when
> >>> RUN_TO_PARITY is enabled.
> >>> So far, it has been running well in my test environment, and I have
> >>> pasted some test results below.
> >>>
> >>
> >> Interesting, besides hackbench, I assume that you have workload in
> >> real production environment that is sensitive to wakeup latency?
> >
> > Hi Chen
> >
> > Yes, my workload are quite sensitive to wakeup latency .
> >>
> >>>
> >>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >>> index 03be0d1330a6..a0005d240db5 100644
> >>> --- a/kernel/sched/fair.c
> >>> +++ b/kernel/sched/fair.c
> >>> @@ -5523,6 +5523,9 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
> >>> hrtimer_active(&rq_of(cfs_rq)->hrtick_timer))
> >>> return;
> >>> #endif
> >>> +
> >>> + if (!entity_eligible(cfs_rq, curr))
> >>> + resched_curr(rq_of(cfs_rq));
> >>> }
> >>>
> >>
> >> entity_tick() -> update_curr() -> update_deadline():
> >> se->vruntime >= se->deadline ? resched_curr()
> >> only current has expired its slice will it be scheduled out.
> >>
> >> So here you want to schedule current out if its lag becomes 0.
> >>
> >> In lastest sched/eevdf branch, it is controlled by two sched features:
> >> RESPECT_SLICE: Inhibit preemption until the current task has exhausted it's slice.
> >> RUN_TO_PARITY: Relax RESPECT_SLICE and only protect current until 0-lag.
> >> https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=sched/eevdf&id=e04f5454d68590a239092a700e9bbaf84270397c
> >>
> >> Maybe something like this can achieve your goal
> >> if (sched_feat(RUN_TOPARITY) && !entity_eligible(cfs_rq, curr))
> >> resched_curr
> >>
> >>>
> >>> @@ -8325,6 +8328,9 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
> >>> if (unlikely(p->policy != SCHED_NORMAL) || !sched_feat(WAKEUP_PREEMPTION))
> >>> return;
> >>>
> >>> + if (!entity_eligible(cfs_rq, se))
> >>> + goto preempt;
> >>> +
> >>
> >> Not sure if this is applicable, later in this function, pick_eevdf() checks
> >> if the current is eligible, !entity_eligible(cfs_rq, curr), if not, curr will
> >> be evicted. And this change does not consider the cgroup hierarchy.
>
> The above line will be referred to as [1] below.
>
> >>
> >> Besides, the check of current eligiblity can get false negative result,
> >> if the enqueued entity has a positive lag. Prateek proposed to
> >> remove the check of current's eligibility in pick_eevdf():
> >> https://lore.kernel.org/lkml/[email protected]/
> >
> > Thank you for letting me know about Peter's latest updates and thoughts.
> > Actually, the original intention of my modification was to minimize the
> > traversal of the rb-tree as much as possible. For example, in the following
> > scenario, if 'curr' is ineligible, the system would still traverse the rb-tree in
> > 'pick_eevdf' to return an optimal 'se', and then trigger 'resched_curr'. After
> > resched, the scheduler will call 'pick_eevdf' again, traversing the
> > rb-tree once more. This ultimately results in the rb-tree being traversed
> > twice. If it's possible to determine that 'curr' is ineligible within 'wakeup_preempt'
> > and directly trigger a 'resched', it would reduce the traversal of the rb-tree
> > by one time.
> >
> >
> > wakeup_preempt-> pick_eevdf -> resched_curr
> > |->'traverse the rb-tree' |
> > schedule->pick_eevdf
> > |->'traverse the rb-tree'
>
> I see what you mean but a couple of things:
>
> (I'm adding the check_preempt_wakeup_fair() hunk from the original patch
> below for ease of interpretation)
>
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 03be0d1330a6..a0005d240db5 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -8325,6 +8328,9 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
> > if (unlikely(p->policy != SCHED_NORMAL) || !sched_feat(WAKEUP_PREEMPTION))
> > return;
> >
> > + if (!entity_eligible(cfs_rq, se))
> > + goto preempt;
> > +
>
> This check uses the root cfs_rq since "task_cfs_rq()" returns the
> "rq->cfs" of the runqueue the task is on. In presence of cgroups or
> CONFIG_SCHED_AUTOGROUP, there is a good chance this the task is queued
> on a higher order cfs_rq and this entity_eligible() calculation might
> not be valid since the vruntime calculation for the "se" is relative to
> the "cfs_rq" where it is queued on. Please correct me if I'm wrong but
> I believe that is what Chenyu was referring to in [1].
>

Sorry for the late reply and thanks for help clarify this. Yes, this is
what my previous concern was:
1. It does not consider the cgroup and does not check preemption in the same
level which is covered by find_matching_se().
2. The if (!entity_eligible(cfs_rq, se)) for current is redundant because
later pick_eevdf() will check the eligible of current anyway. But
as pointed out by Chunxi, his concern is the double-traverse of the rb-tree,
I just wonder if we could leverage the cfs_rq->next to store the next
candidate, so it can be picked directly in the 2nd pick as a fast path?
Something like below untested:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8a5b1ae0aa55..f716646d595e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8349,7 +8349,7 @@ static void set_next_buddy(struct sched_entity *se)
static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int wake_flags)
{
struct task_struct *curr = rq->curr;
- struct sched_entity *se = &curr->se, *pse = &p->se;
+ struct sched_entity *se = &curr->se, *pse = &p->se, *next;
struct cfs_rq *cfs_rq = task_cfs_rq(curr);
int cse_is_idle, pse_is_idle;

@@ -8415,7 +8415,11 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
/*
* XXX pick_eevdf(cfs_rq) != se ?
*/
- if (pick_eevdf(cfs_rq) == pse)
+ next = pick_eevdf(cfs_rq);
+ if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK) && next)
+ set_next_buddy(next);
+
+ if (next == pse)
goto preempt;

return;

thanks,
Chenyu

> > find_matching_se(&se, &pse);
> > WARN_ON_ONCE(!pse);
> >
> > --
>
> In addition to that, There is an update_curr() call below for the first
> cfs_rq where both the entities' hierarchy is queued which is found by
> find_matching_se(). I believe that is required too to update the
> vruntime and deadline of the entity where preemption can happen.
>
> If you want to circumvent a second call to pick_eevdf(), could you
> perhaps do:
>
> (Only build tested)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 9eb63573110c..653b1bee1e62 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8407,9 +8407,13 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
> update_curr(cfs_rq);
>
> /*
> - * XXX pick_eevdf(cfs_rq) != se ?
> + * If the hierarchy of current task is ineligible at the common
> + * point on the newly woken entity, there is a good chance of
> + * wakeup preemption by the newly woken entity. Mark for resched
> + * and allow pick_eevdf() in schedule() to judge which task to
> + * run next.
> */
> - if (pick_eevdf(cfs_rq) == pse)
> + if (!entity_eligible(cfs_rq, se))
> goto preempt;
>
> return;
>
> --
>
> There are other implications here which is specifically highlighted by
> the "XXX pick_eevdf(cfs_rq) != se ?" comment. If the current waking
> entity is not the entity with the earliest eligible virtual deadline,
> the current task is still preempted if any other entity has the EEVD.
>
> Mike's box gave switching to above two thumbs up; I have to check what
> my box says :)
>
> Following are DeathStarBench results with your original patch compared
> to v6.9-rc5 based tip:sched/core:
>
> ==================================================================
> Test : DeathStarBench
> Why? : Some tasks here do no like aggressive preemption
> Units : Normalized throughput
> Interpretation: Higher is better
> Statistic : Mean
> ==================================================================
> Pinning scaling tip eager_preempt (pct imp)
> 1CCD 1 1.00 0.99 (%diff: -1.13%)
> 2CCD 2 1.00 0.97 (%diff: -3.21%)
> 4CCD 3 1.00 0.97 (%diff: -3.41%)
> 8CCD 6 1.00 0.97 (%diff: -3.20%)
> --
>
> I'll give the variants mentioned in the thread a try too to see if
> some of my assumptions around heavy preemption hold good. I was also
> able to dig up an old patch by Balakumaran Kannan which skipped
> pick_eevdf() altogether if "pse" is ineligible which also seems like
> a good optimization based on current check in
> check_preempt_wakeup_fair() but it perhaps doesn't help the case of
> wakeup-latency sensitivity you are optimizing for; only reduces
> rb-tree traversal if there is no chance of pick_eevdf() returning "pse"
> https://lore.kernel.org/lkml/[email protected]/
>
> --
> Thanks and Regards,
> Prateek
>
> >
> >
> > Of course, this would break the semantics of RESPECT_SLICE as well as
> > RUN_TO_PARITY. So, this might be considered a performance enhancement
> > for scenarios without NO_RESPECT_SLICE/NO_RUN_TO_PARITY.
> >
> > thanks
> > Chunxin
> >
> >
> >> If I understand your requirement correctly, you want to reduce the wakeup
> >> latency. There are some codes under developed by Peter, which could
> >> customized task's wakeup latency via setting its slice:
> >> https://lore.kernel.org/lkml/[email protected]/
> >>
> >> thanks,
> >> Chenyu
>

2024-06-06 01:47:25

by Chunxin Zang

[permalink] [raw]

Subject: Re: [PATCH] sched/fair: Reschedule the cfs_rq when current is ineligible

> On Jun 6, 2024, at 01:19, Chen Yu <[email protected]> wrote:
>
> Hi Prateek, Chunxin,
>
> On 2024-05-28 at 10:32:23 +0530, K Prateek Nayak wrote:
>> Hello Chunxin,
>>
>> On 5/28/2024 8:12 AM, Chunxin Zang wrote:
>>>
>>>> On May 24, 2024, at 23:30, Chen Yu <[email protected]> wrote:
>>>>
>>>> On 2024-05-24 at 21:40:11 +0800, Chunxin Zang wrote:
>>>>> I found that some tasks have been running for a long enough time and
>>>>> have become illegal, but they are still not releasing the CPU. This
>>>>> will increase the scheduling delay of other processes. Therefore, I
>>>>> tried checking the current process in wakeup_preempt and entity_tick,
>>>>> and if it is illegal, reschedule that cfs queue.
>>>>>
>>>>> The modification can reduce the scheduling delay by about 30% when
>>>>> RUN_TO_PARITY is enabled.
>>>>> So far, it has been running well in my test environment, and I have
>>>>> pasted some test results below.
>>>>>
>>>>
>>>> Interesting, besides hackbench, I assume that you have workload in
>>>> real production environment that is sensitive to wakeup latency?
>>>
>>> Hi Chen
>>>
>>> Yes, my workload are quite sensitive to wakeup latency .
>>>>
>>>>>
>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>>> index 03be0d1330a6..a0005d240db5 100644
>>>>> --- a/kernel/sched/fair.c
>>>>> +++ b/kernel/sched/fair.c
>>>>> @@ -5523,6 +5523,9 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
>>>>> hrtimer_active(&rq_of(cfs_rq)->hrtick_timer))
>>>>> return;
>>>>> #endif
>>>>> +
>>>>> + if (!entity_eligible(cfs_rq, curr))
>>>>> + resched_curr(rq_of(cfs_rq));
>>>>> }
>>>>>
>>>>
>>>> entity_tick() -> update_curr() -> update_deadline():
>>>> se->vruntime >= se->deadline ? resched_curr()
>>>> only current has expired its slice will it be scheduled out.
>>>>
>>>> So here you want to schedule current out if its lag becomes 0.
>>>>
>>>> In lastest sched/eevdf branch, it is controlled by two sched features:
>>>> RESPECT_SLICE: Inhibit preemption until the current task has exhausted it's slice.
>>>> RUN_TO_PARITY: Relax RESPECT_SLICE and only protect current until 0-lag.
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=sched/eevdf&id=e04f5454d68590a239092a700e9bbaf84270397c
>>>>
>>>> Maybe something like this can achieve your goal
>>>> if (sched_feat(RUN_TOPARITY) && !entity_eligible(cfs_rq, curr))
>>>> resched_curr
>>>>
>>>>>
>>>>> @@ -8325,6 +8328,9 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>>>>> if (unlikely(p->policy != SCHED_NORMAL) || !sched_feat(WAKEUP_PREEMPTION))
>>>>> return;
>>>>>
>>>>> + if (!entity_eligible(cfs_rq, se))
>>>>> + goto preempt;
>>>>> +
>>>>
>>>> Not sure if this is applicable, later in this function, pick_eevdf() checks
>>>> if the current is eligible, !entity_eligible(cfs_rq, curr), if not, curr will
>>>> be evicted. And this change does not consider the cgroup hierarchy.
>>
>> The above line will be referred to as [1] below.
>>
>>>>
>>>> Besides, the check of current eligiblity can get false negative result,
>>>> if the enqueued entity has a positive lag. Prateek proposed to
>>>> remove the check of current's eligibility in pick_eevdf():
>>>> https://lore.kernel.org/lkml/[email protected]/
>>>
>>> Thank you for letting me know about Peter's latest updates and thoughts.
>>> Actually, the original intention of my modification was to minimize the
>>> traversal of the rb-tree as much as possible. For example, in the following
>>> scenario, if 'curr' is ineligible, the system would still traverse the rb-tree in
>>> 'pick_eevdf' to return an optimal 'se', and then trigger 'resched_curr'. After
>>> resched, the scheduler will call 'pick_eevdf' again, traversing the
>>> rb-tree once more. This ultimately results in the rb-tree being traversed
>>> twice. If it's possible to determine that 'curr' is ineligible within 'wakeup_preempt'
>>> and directly trigger a 'resched', it would reduce the traversal of the rb-tree
>>> by one time.
>>>
>>>
>>> wakeup_preempt-> pick_eevdf -> resched_curr
>>> |->'traverse the rb-tree' |
>>> schedule->pick_eevdf
>>> |->'traverse the rb-tree'
>>
>> I see what you mean but a couple of things:
>>
>> (I'm adding the check_preempt_wakeup_fair() hunk from the original patch
>> below for ease of interpretation)
>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 03be0d1330a6..a0005d240db5 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -8325,6 +8328,9 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>>> if (unlikely(p->policy != SCHED_NORMAL) || !sched_feat(WAKEUP_PREEMPTION))
>>> return;
>>>
>>> + if (!entity_eligible(cfs_rq, se))
>>> + goto preempt;
>>> +
>>
>> This check uses the root cfs_rq since "task_cfs_rq()" returns the
>> "rq->cfs" of the runqueue the task is on. In presence of cgroups or
>> CONFIG_SCHED_AUTOGROUP, there is a good chance this the task is queued
>> on a higher order cfs_rq and this entity_eligible() calculation might
>> not be valid since the vruntime calculation for the "se" is relative to
>> the "cfs_rq" where it is queued on. Please correct me if I'm wrong but
>> I believe that is what Chenyu was referring to in [1].
>>
>
> Sorry for the late reply and thanks for help clarify this. Yes, this is
> what my previous concern was:
> 1. It does not consider the cgroup and does not check preemption in the same
> level which is covered by find_matching_se().
> 2. The if (!entity_eligible(cfs_rq, se)) for current is redundant because
> later pick_eevdf() will check the eligible of current anyway. But
> as pointed out by Chunxi, his concern is the double-traverse of the rb-tree,
> I just wonder if we could leverage the cfs_rq->next to store the next
> candidate, so it can be picked directly in the 2nd pick as a fast path?
> Something like below untested:
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8a5b1ae0aa55..f716646d595e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8349,7 +8349,7 @@ static void set_next_buddy(struct sched_entity *se)
> static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int wake_flags)
> {
> struct task_struct *curr = rq->curr;
> - struct sched_entity *se = &curr->se, *pse = &p->se;
> + struct sched_entity *se = &curr->se, *pse = &p->se, *next;
> struct cfs_rq *cfs_rq = task_cfs_rq(curr);
> int cse_is_idle, pse_is_idle;
>
> @@ -8415,7 +8415,11 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
> /*
> * XXX pick_eevdf(cfs_rq) != se ?
> */
> - if (pick_eevdf(cfs_rq) == pse)
> + next = pick_eevdf(cfs_rq);
> + if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK) && next)
> + set_next_buddy(next);
> +
> + if (next == pse)
> goto preempt;
>
> return;
>
>
> thanks,
> Chenyu

Hi Chen

First of all, thank you for your patient response. Regarding the issue of avoiding traversing
the RB-tree twice, I initially had two methods in mind.
1. Cache the optimal result so that it can be used directly during the second pick_eevdf operation.
This idea is similar to the one you proposed this time.
2. Avoid the pick_eevdf operation as much as possible within 'check_preempt_wakeup_fair.'
Because I believe that 'checking whether preemption is necessary' and 'finding the optimal
process to schedule' are two different things. 'check_preempt_wakeup_fair' is not just to
check if the newly awakened process should preempt the current process; it can also serve
as an opportunity to check whether any other processes should preempt the current one,
thereby improving the real-time performance of the scheduler. Although now in pick_eevdf,
the legitimacy of 'curr' is also evaluated, if the result returned is not the awakened process,
then the current process will still not be preempted. Therefore, I posted the v2 PATCH.
The implementation of v2 PATCH might express this point more clearly.
https://lore.kernel.org/lkml/[email protected]/T/

I previously implemented and tested both of these methods, and the test results showed that
method 2 had somewhat more obvious benefits. Therefore, I submitted method 2. Now that I
think about it, perhaps method 1 could also be viable at the same time. :)

thanks
Chunixn

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 03be0d1330a6..f67894d8fbc8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -563,6 +563,8 @@ static inline s64 entity_key(struct cfs_rq *cfs_rq, struct sched_entity *se)
return (s64)(se->vruntime - cfs_rq->min_vruntime);
}

+static void unset_pick_cached(struct cfs_rq *cfs_rq);
+
#define __node_2_se(node) \
rb_entry((node), struct sched_entity, run_node)

@@ -632,6 +634,8 @@ avg_vruntime_add(struct cfs_rq *cfs_rq, struct sched_entity *se)

cfs_rq->avg_vruntime += key * weight;
cfs_rq->avg_load += weight;
+
+ unset_pick_cached(cfs_rq);
}

static void
@@ -642,6 +646,8 @@ avg_vruntime_sub(struct cfs_rq *cfs_rq, struct sched_entity *se)

cfs_rq->avg_vruntime -= key * weight;
cfs_rq->avg_load -= weight;
+
+ unset_pick_cached(cfs_rq);
}

static inline
@@ -651,6 +657,8 @@ void avg_vruntime_update(struct cfs_rq *cfs_rq, s64 delta)
* v' = v + d ==> avg_vruntime' = avg_runtime - d*avg_load
*/
cfs_rq->avg_vruntime -= cfs_rq->avg_load * delta;
+
+ unset_pick_cached(cfs_rq);
}

/*
@@ -745,6 +753,36 @@ int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se)
return vruntime_eligible(cfs_rq, se->vruntime);
}

+static struct sched_entity *try_to_get_pick_cached(struct cfs_rq* cfs_rq)
+{
+ struct sched_entity *se;
+
+ se = cfs_rq->pick_cached;
+
+ return se == NULL ? NULL : (se->on_rq ? se : NULL);
+}
+
+static void unset_pick_cached(struct cfs_rq *cfs_rq)
+{
+ cfs_rq->pick_cached = NULL;
+}
+
+static void set_pick_cached(struct sched_entity *se)
+{
+ if (!se || !se->on_rq)
+ return;
+
+ cfs_rq_of(se)->pick_cached = se;
+}
+
static u64 __update_min_vruntime(struct cfs_rq *cfs_rq, u64 vruntime)
{
u64 min_vruntime = cfs_rq->min_vruntime;
@@ -856,6 +894,51 @@ struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq)
return __node_2_se(left);
}

+static struct sched_entity *__pick_eevdf(struct cfs_rq *cfs_rq)
+{
+ struct rb_node *node = cfs_rq->tasks_timeline.rb_root.rb_node;
+ struct sched_entity *se = __pick_first_entity(cfs_rq);
+ struct sched_entity *best = NULL;
+
+ /* Pick the leftmost entity if it's eligible */
+ if (se && entity_eligible(cfs_rq, se))
+ return se;
+
+ /* Heap search for the EEVD entity */
+ while (node) {
+ struct rb_node *left = node->rb_left;
+
+ /*
+ * Eligible entities in left subtree are always better
+ * choices, since they have earlier deadlines.
+ */
+ if (left && vruntime_eligible(cfs_rq,
+ __node_2_se(left)->min_vruntime)) {
+ node = left;
+ continue;
+ }
+
+ se = __node_2_se(node);
+
+ /*
+ * The left subtree either is empty or has no eligible
+ * entity, so check the current node since it is the one
+ * with earliest deadline that might be eligible.
+ */
+ if (entity_eligible(cfs_rq, se)) {
+ best = se;
+ break;
+ }
+
+ node = node->rb_right;
+ }
+
+ if (best)
+ set_pick_cached(best);
+
+ return best;
+}
+
/*
* Earliest Eligible Virtual Deadline First
*
@@ -877,7 +960,6 @@ struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq)
*/
static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
{
- struct rb_node *node = cfs_rq->tasks_timeline.rb_root.rb_node;
struct sched_entity *se = __pick_first_entity(cfs_rq);
struct sched_entity *curr = cfs_rq->curr;
struct sched_entity *best = NULL;
@@ -899,41 +981,13 @@ static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
if (sched_feat(RUN_TO_PARITY) && curr && curr->vlag == curr->deadline)
return curr;

- /* Pick the leftmost entity if it's eligible */
- if (se && entity_eligible(cfs_rq, se)) {
- best = se;
- goto found;
- }
+ best = try_to_get_pick_cached(cfs_rq);
+ if (best && !entity_eligible(cfs_rq, best))
+ best = NULL;

- /* Heap search for the EEVD entity */
- while (node) {
- struct rb_node *left = node->rb_left;
-
- /*
- * Eligible entities in left subtree are always better
- * choices, since they have earlier deadlines.
- */
- if (left && vruntime_eligible(cfs_rq,
- __node_2_se(left)->min_vruntime)) {
- node = left;
- continue;
- }
-
- se = __node_2_se(node);
+ if (!best)
+ best = __pick_eevdf(cfs_rq);

- /*
- * The left subtree either is empty or has no eligible
- * entity, so check the current node since it is the one
- * with earliest deadline that might be eligible.
- */
- if (entity_eligible(cfs_rq, se)) {
- best = se;
- break;
- }
-
- node = node->rb_right;
- }
-found:
if (!best || (curr && entity_before(curr, best)))
best = curr;

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d2242679239e..373241075449 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -597,6 +597,7 @@ struct cfs_rq {
*/
struct sched_entity *curr;
struct sched_entity *next;
+ struct sched_entity *pick_cached;

#ifdef CONFIG_SCHED_DEBUG
unsigned int nr_spread_over;
--
2.34.1

>
>>> find_matching_se(&se, &pse);
>>> WARN_ON_ONCE(!pse);
>>>
>>> --
>>
>> In addition to that, There is an update_curr() call below for the first
>> cfs_rq where both the entities' hierarchy is queued which is found by
>> find_matching_se(). I believe that is required too to update the
>> vruntime and deadline of the entity where preemption can happen.
>>
>> If you want to circumvent a second call to pick_eevdf(), could you
>> perhaps do:
>>
>> (Only build tested)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 9eb63573110c..653b1bee1e62 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -8407,9 +8407,13 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>> update_curr(cfs_rq);
>>
>> /*
>> - * XXX pick_eevdf(cfs_rq) != se ?
>> + * If the hierarchy of current task is ineligible at the common
>> + * point on the newly woken entity, there is a good chance of
>> + * wakeup preemption by the newly woken entity. Mark for resched
>> + * and allow pick_eevdf() in schedule() to judge which task to
>> + * run next.
>> */
>> - if (pick_eevdf(cfs_rq) == pse)
>> + if (!entity_eligible(cfs_rq, se))
>> goto preempt;
>>
>> return;
>>
>> --
>>
>> There are other implications here which is specifically highlighted by
>> the "XXX pick_eevdf(cfs_rq) != se ?" comment. If the current waking
>> entity is not the entity with the earliest eligible virtual deadline,
>> the current task is still preempted if any other entity has the EEVD.
>>
>> Mike's box gave switching to above two thumbs up; I have to check what
>> my box says :)
>>
>> Following are DeathStarBench results with your original patch compared
>> to v6.9-rc5 based tip:sched/core:
>>
>> ==================================================================
>> Test : DeathStarBench
>> Why? : Some tasks here do no like aggressive preemption
>> Units : Normalized throughput
>> Interpretation: Higher is better
>> Statistic : Mean
>> ==================================================================
>> Pinning scaling tip eager_preempt (pct imp)
>> 1CCD 1 1.00 0.99 (%diff: -1.13%)
>> 2CCD 2 1.00 0.97 (%diff: -3.21%)
>> 4CCD 3 1.00 0.97 (%diff: -3.41%)
>> 8CCD 6 1.00 0.97 (%diff: -3.20%)
>> --
>>
>> I'll give the variants mentioned in the thread a try too to see if
>> some of my assumptions around heavy preemption hold good. I was also
>> able to dig up an old patch by Balakumaran Kannan which skipped
>> pick_eevdf() altogether if "pse" is ineligible which also seems like
>> a good optimization based on current check in
>> check_preempt_wakeup_fair() but it perhaps doesn't help the case of
>> wakeup-latency sensitivity you are optimizing for; only reduces
>> rb-tree traversal if there is no chance of pick_eevdf() returning "pse"
>> https://lore.kernel.org/lkml/[email protected]/
>>
>> --
>> Thanks and Regards,
>> Prateek
>>
>>>
>>>
>>> Of course, this would break the semantics of RESPECT_SLICE as well as
>>> RUN_TO_PARITY. So, this might be considered a performance enhancement
>>> for scenarios without NO_RESPECT_SLICE/NO_RUN_TO_PARITY.
>>>
>>> thanks
>>> Chunxin
>>>
>>>
>>>> If I understand your requirement correctly, you want to reduce the wakeup
>>>> latency. There are some codes under developed by Peter, which could
>>>> customized task's wakeup latency via setting its slice:
>>>> https://lore.kernel.org/lkml/[email protected]/
>>>>
>>>> thanks,
>>>> Chenyu

2024-06-06 12:40:17

by Chunxin Zang

[permalink] [raw]

Subject: Re: [PATCH] sched/fair: Reschedule the cfs_rq when current is ineligible

> On Jun 3, 2024, at 10:55, Honglei Wang <[email protected]> wrote:
>
>
>
> On 2024/5/29 22:31, Chunxin Zang wrote:
>>> On May 25, 2024, at 19:48, Honglei Wang <[email protected]> wrote:
>>>
>>>
>>>
>>> On 2024/5/24 21:40, Chunxin Zang wrote:
>>>> I found that some tasks have been running for a long enough time and
>>>> have become illegal, but they are still not releasing the CPU. This
>>>> will increase the scheduling delay of other processes. Therefore, I
>>>> tried checking the current process in wakeup_preempt and entity_tick,
>>>> and if it is illegal, reschedule that cfs queue.
>>>> The modification can reduce the scheduling delay by about 30% when
>>>> RUN_TO_PARITY is enabled.
>>>> So far, it has been running well in my test environment, and I have
>>>> pasted some test results below.
>>>> I isolated four cores for testing. I ran Hackbench in the background
>>>> and observed the test results of cyclictest.
>>>> hackbench -g 4 -l 100000000 &
>>>> cyclictest --mlockall -D 5m -q
>>>> EEVDF PATCH EEVDF-NO_PARITY PATCH-NO_PARITY
>>>> # Min Latencies: 00006 00006 00006 00006
>>>> LNICE(-19) # Avg Latencies: 00191 00122 00089 00066
>>>> # Max Latencies: 15442 07648 14133 07713
>>>> # Min Latencies: 00006 00010 00006 00006
>>>> LNICE(0) # Avg Latencies: 00466 00277 00289 00257
>>>> # Max Latencies: 38917 32391 32665 17710
>>>> # Min Latencies: 00019 00053 00010 00013
>>>> LNICE(19) # Avg Latencies: 37151 31045 18293 23035
>>>> # Max Latencies: 2688299 7031295 426196 425708
>>>> I'm actually a bit hesitant about placing this modification under the
>>>> NO_PARITY feature. This is because the modification conflicts with the
>>>> semantics of RUN_TO_PARITY. So, I captured and compared the number of
>>>> resched occurrences in wakeup_preempt to see if it introduced any
>>>> additional overhead.
>>>> Similarly, hackbench is used to stress the utilization of four cores to
>>>> 100%, and the method for capturing the number of PREEMPT occurrences is
>>>> referenced from [1].
>>>> schedstats EEVDF PATCH EEVDF-NO_PARITY PATCH-NO_PARITY CFS(6.5)
>>>> stats.check_preempt_count 5053054 5057286 5003806 5018589 5031908
>>>> stats.patch_cause_preempt_count ------- 858044 ------- 765726 -------
>>>> stats.need_preempt_count 570520 858684 3380513 3426977 1140821
>>>> From the above test results, there is a slight increase in the number of
>>>> resched occurrences in wakeup_preempt. However, the results vary with each
>>>> test, and sometimes the difference is not that significant. But overall,
>>>> the count of reschedules remains lower than that of CFS and is much less
>>>> than that of NO_PARITY.
>>>> [1]: https://lore.kernel.org/all/[email protected]/T/#m52057282ceb6203318be1ce9f835363de3bef5cb
>>>> Signed-off-by: Chunxin Zang <[email protected]>
>>>> Reviewed-by: Chen Yang <[email protected]>
>>>> ---
>>>> kernel/sched/fair.c | 6 ++++++
>>>> 1 file changed, 6 insertions(+)
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index 03be0d1330a6..a0005d240db5 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -5523,6 +5523,9 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
>>>> hrtimer_active(&rq_of(cfs_rq)->hrtick_timer))
>>>> return;
>>>> #endif
>>>> +
>>>> + if (!entity_eligible(cfs_rq, curr))
>>>> + resched_curr(rq_of(cfs_rq));
>>>> }
>>>> @@ -8325,6 +8328,9 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>>>> if (unlikely(p->policy != SCHED_NORMAL) || !sched_feat(WAKEUP_PREEMPTION))
>>>> return;
>>>> + if (!entity_eligible(cfs_rq, se))
>>>> + goto preempt;
>>>> +
>>>> find_matching_se(&se, &pse);
>>>> WARN_ON_ONCE(!pse);
>>>>
>>> Hi Chunxin,
>>>
>>> Did you run a comparative test to see which modification is more helpful on improve the latency? Modification at tick point makes more sense to me. But, seems just resched arbitrarily in wakeup might introduce too much preemption (and maybe more context switch?) in complex environment such as cgroup hierarchy.
>>>
>>> Thanks,
>>> Honglei
>> Hi Honglei
>> I attempted to build a slightly more complex scenario. It consists of 4 isolated cores,
>> 4 groups of hackbench (160 processes in total) to stress the CPU, and 1 cyclictest
>> process to test scheduling latency. Using cgroup v2, to created 64 cgroup leaf nodes
>> in a binary tree structure (with a depth of 7). I then evenly distributed the aforementioned
>> 161 processes across the 64 cgroups respectively, and observed the scheduling delay
>> performance of cyclictest.
>> Unfortunately, the test results were very fluctuating, and the two sets of data were very
>> close to each other. I suspect that it might be due to too few processes being distributed
>> in each cgroup, which led to the logic for determining ineligible always succeeding and
>> following the original logic. Later, I will attempt more tests to verify the impact of these
>> modifications in scenarios involving multiple cgroups.
>
> Sorry to lately replay, I was a bit busy last week. How's the test going on? What about run some workload processes who spend more time in kernel? Maybe it's worth do give a try, but it depends on your test plan.
>

Hi honglei

Recently, I conducted testing of multiple cgroups using version 2. Version 2 ensures the
RUN_TO_PARITY feature, so the test results are somewhat better under the
NO_RUN_TO_PARITY feature.
https://lore.kernel.org/lkml/[email protected]/T/

The testing environment I used still employed 4 cores, 4 groups of hackbench (160 processes)
and 1 cyclictest. If too many cgroups or processes are created on the 4 cores, the test
results will fluctuate severely, making it difficult to discern any differences.

The organization of cgroups was in two forms:
1. Within the same level cgroup, 10 sub-cgroups were created, with each cgroup having
an average of 16 processes.

EEVDF PATCH EEVDF-NO_PARITY PATCH-NO_PARITY

LNICE(-19) # Avg Latencies: 00572 00347 00502 00218

LNICE(0) # Avg Latencies: 02262 02225 02442 02321

LNICE(19) # Avg Latencies: 03132 03422 03333 03489

2. In the form of a binary tree, 8 leaf cgroups were established, with a depth of 4.
On average, each cgroup had 20 processes

EEVDF PATCH EEVDF-NO_PARITY PATCH-NO_PARITY

LNICE(-19) # Avg Latencies: 00601 00592 00510 00400

LNICE(0) # Avg Latencies: 02703 02170 02381 02126

LNICE(19) # Avg Latencies: 04773 03387 04478 03611

Based on the test results, there is a noticeable improvement in scheduling latency after
applying the patch in scenarios involving multiple cgroups.

thanks
Chunxin

> Thanks,
> Honglei
>
>> thanks
>> Chunxin

2024-06-07 02:38:53

by Chen Yu

[permalink] [raw]

Subject: Re: [PATCH] sched/fair: Reschedule the cfs_rq when current is ineligible

On 2024-06-06 at 09:46:53 +0800, Chunxin Zang wrote:
>
>
> > On Jun 6, 2024, at 01:19, Chen Yu <[email protected]> wrote:
> >
> >
> > Sorry for the late reply and thanks for help clarify this. Yes, this is
> > what my previous concern was:
> > 1. It does not consider the cgroup and does not check preemption in the same
> > level which is covered by find_matching_se().
> > 2. The if (!entity_eligible(cfs_rq, se)) for current is redundant because
> > later pick_eevdf() will check the eligible of current anyway. But
> > as pointed out by Chunxi, his concern is the double-traverse of the rb-tree,
> > I just wonder if we could leverage the cfs_rq->next to store the next
> > candidate, so it can be picked directly in the 2nd pick as a fast path?
> > Something like below untested:
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 8a5b1ae0aa55..f716646d595e 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -8349,7 +8349,7 @@ static void set_next_buddy(struct sched_entity *se)
> > static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int wake_flags)
> > {
> > struct task_struct *curr = rq->curr;
> > - struct sched_entity *se = &curr->se, *pse = &p->se;
> > + struct sched_entity *se = &curr->se, *pse = &p->se, *next;
> > struct cfs_rq *cfs_rq = task_cfs_rq(curr);
> > int cse_is_idle, pse_is_idle;
> >
> > @@ -8415,7 +8415,11 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
> > /*
> > * XXX pick_eevdf(cfs_rq) != se ?
> > */
> > - if (pick_eevdf(cfs_rq) == pse)
> > + next = pick_eevdf(cfs_rq);
> > + if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK) && next)
> > + set_next_buddy(next);
> > +
> > + if (next == pse)
> > goto preempt;
> >
> > return;
> >
> >
> > thanks,
> > Chenyu
>
> Hi Chen
>
> First of all, thank you for your patient response. Regarding the issue of avoiding traversing
> the RB-tree twice, I initially had two methods in mind.
> 1. Cache the optimal result so that it can be used directly during the second pick_eevdf operation.
> This idea is similar to the one you proposed this time.
> 2. Avoid the pick_eevdf operation as much as possible within 'check_preempt_wakeup_fair.'
> Because I believe that 'checking whether preemption is necessary' and 'finding the optimal
> process to schedule' are two different things.

I agree, and it seems that in current eevdf implementation the former relies on the latter.

> 'check_preempt_wakeup_fair' is not just to
> check if the newly awakened process should preempt the current process; it can also serve
> as an opportunity to check whether any other processes should preempt the current one,
> thereby improving the real-time performance of the scheduler. Although now in pick_eevdf,
> the legitimacy of 'curr' is also evaluated, if the result returned is not the awakened process,
> then the current process will still not be preempted.

I thought Mike has proposed a patch to deal with this scenario you mentioned above:
https://lore.kernel.org/lkml/[email protected]/

And I suppose you are refering to increase the preemption chance on current rather than reducing
the invoke of pick_eevdf() in check_preempt_wakeup_fair().

> Therefore, I posted the v2 PATCH.
> The implementation of v2 PATCH might express this point more clearly.
> https://lore.kernel.org/lkml/[email protected]/T/
>

Let me take a look at it and do some tests.

> I previously implemented and tested both of these methods, and the test results showed that
> method 2 had somewhat more obvious benefits. Therefore, I submitted method 2. Now that I
> think about it, perhaps method 1 could also be viable at the same time. :)
>

Actually I found that, even without any changes, if we enabled sched feature NEXT_BUDDY, the
wakeup latency/request latency are both reduced. The following is the schbench result on a
240 CPUs system:

NO_NEXT_BUDDY
Wakeup Latencies percentiles (usec) runtime 100 (s) (1698990 total samples)
50.0th: 6 (429125 samples)
90.0th: 14 (682355 samples)
* 99.0th: 29 (126695 samples)
99.9th: 529 (14603 samples)
min=1, max=4741
Request Latencies percentiles (usec) runtime 100 (s) (1702523 total samples)
50.0th: 14992 (550939 samples)
90.0th: 15376 (668687 samples)
* 99.0th: 15600 (128111 samples)
99.9th: 15888 (11238 samples)
min=3528, max=31677
RPS percentiles (requests) runtime 100 (s) (101 total samples)
20.0th: 16864 (31 samples)
* 50.0th: 16928 (26 samples)
90.0th: 17248 (36 samples)
min=16615, max=20041
average rps: 17025.23

NEXT_BUDDY
Wakeup Latencies percentiles (usec) runtime 100 (s) (1653564 total samples)
50.0th: 5 (376845 samples)
90.0th: 12 (632075 samples)
* 99.0th: 24 (114398 samples)
99.9th: 105 (13737 samples)
min=1, max=7428
Request Latencies percentiles (usec) runtime 100 (s) (1657268 total samples)
50.0th: 14480 (524763 samples)
90.0th: 15216 (647982 samples)
* 99.0th: 15472 (130730 samples)
99.9th: 15728 (13980 samples)
min=3542, max=34805
RPS percentiles (requests) runtime 100 (s) (101 total samples)
20.0th: 16544 (62 samples)
* 50.0th: 16544 (0 samples)
90.0th: 16608 (37 samples)
min=16470, max=16648
average rps: 16572.68

So I think NEXT_BUDDY has more or less reduced the rb-tree scan.

thanks,
Chenyu

2024-06-11 12:21:51

by Honglei Wang

[permalink] [raw]

Subject: Re: [PATCH] sched/fair: Reschedule the cfs_rq when current is ineligible

On 2024/6/6 20:39, Chunxin Zang wrote:

>
> Hi honglei
>
> Recently, I conducted testing of multiple cgroups using version 2. Version 2 ensures the
> RUN_TO_PARITY feature, so the test results are somewhat better under the
> NO_RUN_TO_PARITY feature.
> https://lore.kernel.org/lkml/[email protected]/T/
>
> The testing environment I used still employed 4 cores, 4 groups of hackbench (160 processes)
> and 1 cyclictest. If too many cgroups or processes are created on the 4 cores, the test
> results will fluctuate severely, making it difficult to discern any differences.
>
> The organization of cgroups was in two forms:
> 1. Within the same level cgroup, 10 sub-cgroups were created, with each cgroup having
> an average of 16 processes.
>
> EEVDF PATCH EEVDF-NO_PARITY PATCH-NO_PARITY
>
> LNICE(-19) # Avg Latencies: 00572 00347 00502 00218
>
> LNICE(0) # Avg Latencies: 02262 02225 02442 02321
>
> LNICE(19) # Avg Latencies: 03132 03422 03333 03489
>
> 2. In the form of a binary tree, 8 leaf cgroups were established, with a depth of 4.
> On average, each cgroup had 20 processes
>
> EEVDF PATCH EEVDF-NO_PARITY PATCH-NO_PARITY
>
> LNICE(-19) # Avg Latencies: 00601 00592 00510 00400
>
> LNICE(0) # Avg Latencies: 02703 02170 02381 02126
>
> LNICE(19) # Avg Latencies: 04773 03387 04478 03611
>
> Based on the test results, there is a noticeable improvement in scheduling latency after
> applying the patch in scenarios involving multiple cgroups.
>
>
> thanks
> Chunxin
>
Hi Chunxin,

Thanks for sharing the test result. It looks helpful at least in this
cgroups scenario. I'm still curious which point of the two changes helps
more in your test, just as mentioned at the very first mail of this thread.

Thanks,
Honglei

2024-06-11 13:20:36

by Chunxin Zang

[permalink] [raw]

Subject: Re: [PATCH] sched/fair: Reschedule the cfs_rq when current is ineligible

> On Jun 7, 2024, at 10:38, Chen Yu <[email protected]> wrote:
>
> On 2024-06-06 at 09:46:53 +0800, Chunxin Zang wrote:
>>
>>
>>> On Jun 6, 2024, at 01:19, Chen Yu <[email protected]> wrote:
>>>
>>>
>>> Sorry for the late reply and thanks for help clarify this. Yes, this is
>>> what my previous concern was:
>>> 1. It does not consider the cgroup and does not check preemption in the same
>>> level which is covered by find_matching_se().
>>> 2. The if (!entity_eligible(cfs_rq, se)) for current is redundant because
>>> later pick_eevdf() will check the eligible of current anyway. But
>>> as pointed out by Chunxi, his concern is the double-traverse of the rb-tree,
>>> I just wonder if we could leverage the cfs_rq->next to store the next
>>> candidate, so it can be picked directly in the 2nd pick as a fast path?
>>> Something like below untested:
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 8a5b1ae0aa55..f716646d595e 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -8349,7 +8349,7 @@ static void set_next_buddy(struct sched_entity *se)
>>> static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int wake_flags)
>>> {
>>> struct task_struct *curr = rq->curr;
>>> - struct sched_entity *se = &curr->se, *pse = &p->se;
>>> + struct sched_entity *se = &curr->se, *pse = &p->se, *next;
>>> struct cfs_rq *cfs_rq = task_cfs_rq(curr);
>>> int cse_is_idle, pse_is_idle;
>>>
>>> @@ -8415,7 +8415,11 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>>> /*
>>> * XXX pick_eevdf(cfs_rq) != se ?
>>> */
>>> - if (pick_eevdf(cfs_rq) == pse)
>>> + next = pick_eevdf(cfs_rq);
>>> + if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK) && next)
>>> + set_next_buddy(next);
>>> +
>>> + if (next == pse)
>>> goto preempt;
>>>
>>> return;
>>>
>>>
>>> thanks,
>>> Chenyu
>>
>> Hi Chen
>>
>> First of all, thank you for your patient response. Regarding the issue of avoiding traversing
>> the RB-tree twice, I initially had two methods in mind.
>> 1. Cache the optimal result so that it can be used directly during the second pick_eevdf operation.
>> This idea is similar to the one you proposed this time.
>> 2. Avoid the pick_eevdf operation as much as possible within 'check_preempt_wakeup_fair.'
>> Because I believe that 'checking whether preemption is necessary' and 'finding the optimal
>> process to schedule' are two different things.
>
> I agree, and it seems that in current eevdf implementation the former relies on the latter.
>
>> 'check_preempt_wakeup_fair' is not just to
>> check if the newly awakened process should preempt the current process; it can also serve
>> as an opportunity to check whether any other processes should preempt the current one,
>> thereby improving the real-time performance of the scheduler. Although now in pick_eevdf,
>> the legitimacy of 'curr' is also evaluated, if the result returned is not the awakened process,
>> then the current process will still not be preempted.
>
> I thought Mike has proposed a patch to deal with this scenario you mentioned above:
> https://lore.kernel.org/lkml/[email protected]/
>
> And I suppose you are refering to increase the preemption chance on current rather than reducing
> the invoke of pick_eevdf() in check_preempt_wakeup_fair().

Hi chen

Happy holidays. I believe the modifications here will indeed provide more opportunities for preemption,
thereby leading to lower scheduling latencies, while also truly reducing calls to pick_eevdf. It's a win-win situation. :)

I conducted a test. It involved applying my modifications on top of MIKE PATCH, along with
adding some statistical counts following your previous method, in order to assess the potential
benefits of my changes.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 03be0d1330a6..c5453866899f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8283,6 +8286,10 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
struct sched_entity *se = &curr->se, *pse = &p->se;
struct cfs_rq *cfs_rq = task_cfs_rq(curr);
int cse_is_idle, pse_is_idle;
+ bool patch_preempt = false;
+ bool pick_preempt = false;
+
+ schedstat_inc(rq->check_preempt_count);

if (unlikely(se == pse))
return;
@@ -8343,15 +8350,31 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
cfs_rq = cfs_rq_of(se);
update_curr(cfs_rq);

+ if ((sched_feat(RUN_TO_PARITY) && se->vlag != se->deadline && !entity_eligible(cfs_rq, se))
+ || (!sched_feat(RUN_TO_PARITY) && !entity_eligible(cfs_rq, se))) {
+ schedstat_inc(rq->patch_preempt_count);
+ patch_preempt = true;
+ }
+
/*
* XXX pick_eevdf(cfs_rq) != se ?
*/
- if (pick_eevdf(cfs_rq) == pse)
+ if (pick_eevdf(cfs_rq) != se) {
+ schedstat_inc(rq->pick_preempt_count);
+ pick_preempt = true;
goto preempt;
+ }

return;

preempt:
+ if (patch_preempt && !pick_preempt)
+ schedstat_inc(rq->patch_preempt_only_count);
+ if (!patch_preempt && pick_preempt)
+ schedstat_inc(rq->pick_preempt_only_count);
+
+ schedstat_inc(rq->need_preempt_count);
+
resched_curr(rq);
}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d2242679239e..002c6b0f966a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1141,6 +1141,12 @@ struct rq {
/* try_to_wake_up() stats */
unsigned int ttwu_count;
unsigned int ttwu_local;
+ unsigned int check_preempt_count;
+ unsigned int need_preempt_count;
+ unsigned int patch_preempt_count;
+ unsigned int patch_preempt_only_count;
+ unsigned int pick_preempt_count;
+ unsigned int pick_preempt_only_count;
#endif

#ifdef CONFIG_CPU_IDLE
diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
index 857f837f52cb..fe5487572409 100644
--- a/kernel/sched/stats.c
+++ b/kernel/sched/stats.c
@@ -133,12 +133,21 @@ static int show_schedstat(struct seq_file *seq, void *v)

/* runqueue-specific stats */
seq_printf(seq,
- "cpu%d %u 0 %u %u %u %u %llu %llu %lu",
+ "cpu%d %u 0 %u %u %u %u %llu %llu %lu *** %u %u * %u %u * %u %u",
cpu, rq->yld_count,
rq->sched_count, rq->sched_goidle,
rq->ttwu_count, rq->ttwu_local,
rq->rq_cpu_time,
- rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount);
+ rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount,
+ rq->check_preempt_count,
+ rq->need_preempt_count,
+ rq->patch_preempt_count,
+ rq->patch_preempt_only_count,
+ rq->pick_preempt_count,
+ rq->pick_preempt_only_count);
+

seq_printf(seq, "\n");

The test results are as follows:

RUN_TO_PARITY:
EEVDF PATCH
.stat.check_preempt_count 5053054 5029546
.stat.need_preempt_count 0570520 1282780
.stat.patch_preempt_count ------- 0038602
.stat.patch_preempt_only_count ------- 0000000
.stat.pick_preempt_count ------- 1282780
.stat.pick_preempt_only_count ------- 1244178

NO_RUN_TO_PARITY:
EEVDF PATCH
.stat.check_preempt_count 5018589 5005812
.stat.need_preempt_count 3380513 2994773
.stat.patch_preempt_count ------- 0907927
.stat.patch_preempt_only_count ------- 0000000
.stat.pick_preempt_count ------- 2994773
.stat.pick_preempt_only_count ------- 2086846

Looking at the results, adding an ineligible check for the se within check_preempt_wakeup_fair
can prevent 3% of pick_eevdf calls under the RUN_TO_PARITY feature, and in the case of
NO_RUN_TO_PARITY, it can prevent 30% of pick_eevdf calls. It was also discovered that the
patch_preempt_only_count is at 0, indicating that all invalid checks for the se are correct.

It's worth mentioning that under the RUN_TO_PARITY feature, the number of preemptions
triggered by 'pick_eevdf != se' would be 2.25 times that of the original version, which could
lead to a series of other performance issues. However, logically speaking, this is indeed reasonable. :(

>
>> Therefore, I posted the v2 PATCH.
>> The implementation of v2 PATCH might express this point more clearly.
>> https://lore.kernel.org/lkml/[email protected]/T/
>>
>
> Let me take a look at it and do some tests.

Thank you for doing this :)

>
>> I previously implemented and tested both of these methods, and the test results showed that
>> method 2 had somewhat more obvious benefits. Therefore, I submitted method 2. Now that I
>> think about it, perhaps method 1 could also be viable at the same time. :)
>>
>
> Actually I found that, even without any changes, if we enabled sched feature NEXT_BUDDY, the
> wakeup latency/request latency are both reduced. The following is the schbench result on a
> 240 CPUs system:
>
> NO_NEXT_BUDDY
> Wakeup Latencies percentiles (usec) runtime 100 (s) (1698990 total samples)
> 50.0th: 6 (429125 samples)
> 90.0th: 14 (682355 samples)
> * 99.0th: 29 (126695 samples)
> 99.9th: 529 (14603 samples)
> min=1, max=4741
> Request Latencies percentiles (usec) runtime 100 (s) (1702523 total samples)
> 50.0th: 14992 (550939 samples)
> 90.0th: 15376 (668687 samples)
> * 99.0th: 15600 (128111 samples)
> 99.9th: 15888 (11238 samples)
> min=3528, max=31677
> RPS percentiles (requests) runtime 100 (s) (101 total samples)
> 20.0th: 16864 (31 samples)
> * 50.0th: 16928 (26 samples)
> 90.0th: 17248 (36 samples)
> min=16615, max=20041
> average rps: 17025.23
>
> NEXT_BUDDY
> Wakeup Latencies percentiles (usec) runtime 100 (s) (1653564 total samples)
> 50.0th: 5 (376845 samples)
> 90.0th: 12 (632075 samples)
> * 99.0th: 24 (114398 samples)
> 99.9th: 105 (13737 samples)
> min=1, max=7428
> Request Latencies percentiles (usec) runtime 100 (s) (1657268 total samples)
> 50.0th: 14480 (524763 samples)
> 90.0th: 15216 (647982 samples)
> * 99.0th: 15472 (130730 samples)
> 99.9th: 15728 (13980 samples)
> min=3542, max=34805
> RPS percentiles (requests) runtime 100 (s) (101 total samples)
> 20.0th: 16544 (62 samples)
> * 50.0th: 16544 (0 samples)
> 90.0th: 16608 (37 samples)
> min=16470, max=16648
> average rps: 16572.68
>
> So I think NEXT_BUDDY has more or less reduced the rb-tree scan.
>
> thanks,
> Chenyu

I'm not completely sure if my understanding is correct, but NEXT_BUDDY can only cache the process
that has been woken up; it doesn't necessarily correspond to the result returned by pick_eevdf. Furthermore,
even if it does cache the result returned by pick_eevdf, by the time the next scheduling occurs, due to
other processes enqueing or dequeuing, it might not be the result picked by pick_eevdf at that moment.
Hence, it's a 'best effort' approach, and therefore, its impact on scheduling latency may vary depending
on the use case.

thanks
Chunxin

2024-06-13 11:47:32

by Chen Yu

[permalink] [raw]

Subject: Re: [PATCH] sched/fair: Reschedule the cfs_rq when current is ineligible

On 2024-06-11 at 21:10:50 +0800, Chunxin Zang wrote:
>
>
> > On Jun 7, 2024, at 10:38, Chen Yu <[email protected]> wrote:
> >
> > On 2024-06-06 at 09:46:53 +0800, Chunxin Zang wrote:
> >>
> >>
> >>> On Jun 6, 2024, at 01:19, Chen Yu <[email protected]> wrote:
> >>>
> >>>
> >>> Sorry for the late reply and thanks for help clarify this. Yes, this is
> >>> what my previous concern was:
> >>> 1. It does not consider the cgroup and does not check preemption in the same
> >>> level which is covered by find_matching_se().
> >>> 2. The if (!entity_eligible(cfs_rq, se)) for current is redundant because
> >>> later pick_eevdf() will check the eligible of current anyway. But
> >>> as pointed out by Chunxi, his concern is the double-traverse of the rb-tree,
> >>> I just wonder if we could leverage the cfs_rq->next to store the next
> >>> candidate, so it can be picked directly in the 2nd pick as a fast path?
> >>> Something like below untested:
> >>>
> >>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >>> index 8a5b1ae0aa55..f716646d595e 100644
> >>> --- a/kernel/sched/fair.c
> >>> +++ b/kernel/sched/fair.c
> >>> @@ -8349,7 +8349,7 @@ static void set_next_buddy(struct sched_entity *se)
> >>> static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int wake_flags)
> >>> {
> >>> struct task_struct *curr = rq->curr;
> >>> - struct sched_entity *se = &curr->se, *pse = &p->se;
> >>> + struct sched_entity *se = &curr->se, *pse = &p->se, *next;
> >>> struct cfs_rq *cfs_rq = task_cfs_rq(curr);
> >>> int cse_is_idle, pse_is_idle;
> >>>
> >>> @@ -8415,7 +8415,11 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
> >>> /*
> >>> * XXX pick_eevdf(cfs_rq) != se ?
> >>> */
> >>> - if (pick_eevdf(cfs_rq) == pse)
> >>> + next = pick_eevdf(cfs_rq);
> >>> + if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK) && next)
> >>> + set_next_buddy(next);
> >>> +
> >>> + if (next == pse)
> >>> goto preempt;
> >>>
> >>> return;
> >>>
> >>>
> >>> thanks,
> >>> Chenyu
> >>
> >> Hi Chen
> >>
> >> First of all, thank you for your patient response. Regarding the issue of avoiding traversing
> >> the RB-tree twice, I initially had two methods in mind.
> >> 1. Cache the optimal result so that it can be used directly during the second pick_eevdf operation.
> >> This idea is similar to the one you proposed this time.
> >> 2. Avoid the pick_eevdf operation as much as possible within 'check_preempt_wakeup_fair.'
> >> Because I believe that 'checking whether preemption is necessary' and 'finding the optimal
> >> process to schedule' are two different things.
> >
> > I agree, and it seems that in current eevdf implementation the former relies on the latter.
> >
> >> 'check_preempt_wakeup_fair' is not just to
> >> check if the newly awakened process should preempt the current process; it can also serve
> >> as an opportunity to check whether any other processes should preempt the current one,
> >> thereby improving the real-time performance of the scheduler. Although now in pick_eevdf,
> >> the legitimacy of 'curr' is also evaluated, if the result returned is not the awakened process,
> >> then the current process will still not be preempted.
> >
> > I thought Mike has proposed a patch to deal with this scenario you mentioned above:
> > https://lore.kernel.org/lkml/[email protected]/
> >
> > And I suppose you are refering to increase the preemption chance on current rather than reducing
> > the invoke of pick_eevdf() in check_preempt_wakeup_fair().
>
> Hi chen
>
> Happy holidays. I believe the modifications here will indeed provide more opportunities for preemption,
> thereby leading to lower scheduling latencies, while also truly reducing calls to pick_eevdf. It's a win-win situation. :)
>
> I conducted a test. It involved applying my modifications on top of MIKE PATCH, along with
> adding some statistical counts following your previous method, in order to assess the potential
> benefits of my changes.
>

[snip]

> Looking at the results, adding an ineligible check for the se within check_preempt_wakeup_fair
> can prevent 3% of pick_eevdf calls under the RUN_TO_PARITY feature, and in the case of
> NO_RUN_TO_PARITY, it can prevent 30% of pick_eevdf calls. It was also discovered that the
> patch_preempt_only_count is at 0, indicating that all invalid checks for the se are correct.
>
> It's worth mentioning that under the RUN_TO_PARITY feature, the number of preemptions
> triggered by 'pick_eevdf != se' would be 2.25 times that of the original version, which could
> lead to a series of other performance issues. However, logically speaking, this is indeed reasonable. :(
>
>

I wonder if we can only do this for NO_RUN_TO_PARITY? That is to say, if RUN_TO_PARITY is enabled,
we do not preempt the current task based on its eligibility in check_preempt_wakeup_fair()
or entity_tick(). Personally I don't have objection to increase the preemption a little bit, however
it seems that we have encountered over-scheduling and that is why RUN_TO_PARITY was introduced,
and RUN_TO_PARITY means "respect the slice" per my understanding.

> > So I think NEXT_BUDDY has more or less reduced the rb-tree scan.
> >
> > thanks,
> > Chenyu
>
> I'm not completely sure if my understanding is correct, but NEXT_BUDDY can only cache the process
> that has been woken up; it doesn't necessarily correspond to the result returned by pick_eevdf. Furthermore,
> even if it does cache the result returned by pick_eevdf, by the time the next scheduling occurs, due to
> other processes enqueing or dequeuing, it might not be the result picked by pick_eevdf at that moment.
> Hence, it's a 'best effort' approach, and therefore, its impact on scheduling latency may vary depending
> on the use case.
>

That is true, currently the NEXT_BUDDY is set to the wakee if it is eligible, not mean it is the best
candidate in the tree. I think it is 'best effort' to reduce the wakeup latency rather than fairness.

thanks,
Chenyu