2023-04-17 08:12:36

by Chris Mason

[permalink] [raw]
Subject: schbench v1.0

Hi everyone,

Since we've been doing a lot of scheduler benchmarking lately, I wanted
to dust off schbench and see if I could make it more accurately model
the results we're seeing from production workloads.

I've reworked a few things and since it's somewhat different now I went
ahead and tagged v1.0:

https://git.kernel.org/pub/scm/linux/kernel/git/mason/schbench.git

I also tossed in a README.md, which documents the arguments.

https://git.kernel.org/pub/scm/linux/kernel/git/mason/schbench.git/tree/README.md

The original schbench focused almost entirely on wakeup latencies, which
is still included in the output now. Instead of spinning for a fixed
amount of wall time, v1.0 now uses a loop of matrix multiplication to
simulate a web request.

David Vernet recently benchmarked EEVDF, CFS, and sched_ext against
production workloads:

https://lore.kernel.org/lkml/20230411020945.GA65214@maniforge/

And what we see in general is that involuntary context switches trigger
a basket of expensive interactions between CPU/memory/disk. This is
pretty difficult to model from a benchmark targeting just the scheduler,
so instead of making a much bigger simulation of the workload, I made
preemption slower inside of schbench. In terms of performance he found:

EEVDF < CFS < CFS shared wake queue < sched_ext BPF

My runs with schbench match his percentage differences pretty closely.

The least complicated way I could find to penalize preemption is to use
a per-cpu spinlock around the matrix math. This can be disabled with
(-L/--no-locking). The results map really well to our production
workloads, which don't use spinlocks, but do get hit with major page
faults when they lose the CPU in the middle of a request.

David has more schbench examples for his presentation at OSPM, but
here's some annotated output:

schbench -F128 -n 10
Wakeup Latencies percentiles (usec) runtime 90 (s) (370488 total samples)
50.0th: 9 (69381 samples)
90.0th: 24 (134753 samples)
* 99.0th: 1266 (32796 samples)
99.9th: 4712 (3322 samples)
min=1, max=12449

This is basically the important part of the original schbench. It's the
time from when a worker thread is woken to when it starts running.

Request Latencies percentiles (usec) runtime 90 (s) (370983 total samples)
50.0th: 11440 (103738 samples)
90.0th: 12496 (120020 samples)
* 99.0th: 22304 (32498 samples)
99.9th: 26336 (3308 samples)
min=5818, max=57747

RPS percentiles (requests) runtime 90 (s) (9 total samples)
20.0th: 4312 (3 samples)
* 50.0th: 4376 (3 samples)
90.0th: 4440 (3 samples)
min=4290, max=4446

Request latency and RPS are both new. The original schbench had
requests, but they were based on wall clock spinning instead of a fixed
amount of CPU work. The new requests include two small usleeps() and
the matrix math in their timing.

Generally for production the 99th percentile latencies are most
important. For RPS, I watch 20th and 50th percentile more. The readme
linked above talks through the command line options and how to pick a
good numbers.

I did some runs with different parameters comparing Linus git and EEVDF:

Comparing EEVDF (8c59a975d5ee) With Linus 6.3-rc6ish (a7a55e27ad72)

schbench -F128 -N <val> with and without -L
Single socket Intel cooperlake CPUs, turbo disabled

F128 N1 EEVDF Linus
Wakeup (usec): 99.0th: 355 555
Request (usec): 99.0th: 2,620 1,906
RPS (count): 50.0th: 37,696 41,664

F128 N1 no-locking EEVDF Linus
Wakeup (usec): 99.0th: 295 545
Request (usec): 99.0th: 1,890 1,758
RPS (count): 50.0th: 37,824 41,920

F128 N10 EEVDF Linus
Wakeup (usec): 99.0th: 755 1,266
Request (usec): 99.0th: 25,632 22,304
RPS (count): 50.0th: 4,280 4,376

F128 N10 no-locking EEVDF Linus
Wakeup (usec): 99.0th: 823 1,118
Request (usec): 99.0th: 17,184 14,192
RPS (count): 50.0th: 4,440 4,456

F128 N20 EEVDF Linus
Wakeup (usec): 99.0th: 901 1,806
Request (usec): 99.0th: 51,136 46,016
RPS (count): 50.0th: 2,132 2,196

F128 N20 no-locking EEVDF Linus
Wakeup (usec): 99.0th: 905 1,902
Request (usec): 99.0th: 32,832 30,496
RPS (count): 50.0th: 2,212 2,212

In general this shows us that EEVDF is a huge improvement on wakeup
latency, but we pay for it with preemptions during the request itself.
Diving into the F128 N10 no-locking numbers:

F128 N10 no-locking EEVDF Linus
Wakeup (usec): 99.0th: 823 1,118
Request (usec): 99.0th: 17,184 14,192
RPS (count): 50.0th: 4,440 4,456

EEVDF is very close in terms of RPS. The p99 request latency shows the
preemptions pretty well, but the p50 request latency numbers have EEVDF
winning slightly (11,376 usec eevdf vs 11,408 usec on -linus).

-chris


2023-04-20 15:22:48

by Peter Zijlstra

[permalink] [raw]
Subject: Re: schbench v1.0

On Mon, Apr 17, 2023 at 10:10:25AM +0200, Chris Mason wrote:

> F128 N10 EEVDF Linus
> Wakeup (usec): 99.0th: 755 1,266
> Request (usec): 99.0th: 25,632 22,304
> RPS (count): 50.0th: 4,280 4,376
>
> F128 N10 no-locking EEVDF Linus
> Wakeup (usec): 99.0th: 823 1,118
> Request (usec): 99.0th: 17,184 14,192
> RPS (count): 50.0th: 4,440 4,456

With the below fixlet (against queue/sched/eevdf) on my measly IVB-EP
(2*10*2):

./schbench -F128 -n10 -C

Request Latencies percentiles (usec) runtime 30 (s) (153800 total samples)
90.0th: 6376 (35699 samples)
* 99.0th: 6440 (9055 samples)
99.9th: 7048 (1345 samples)

CFS

schbench -m2 -F128 -n10 -r90 OTHER BATCH
Wakeup (usec): 99.0th: 6600 6328
Request (usec): 99.0th: 35904 14640
RPS (count): 50.0th: 5368 6104

EEVDF base_slice = 3000[us] (default)

schbench -m2 -F128 -n10 -r90 OTHER BATCH
Wakeup (usec): 99.0th: 3820 6968
Request (usec): 99.0th: 30496 24608
RPS (count): 50.0th: 3836 5496

EEVDF base_slice = 6440[us] (per the calibrate run)

schbench -m2 -F128 -n10 -r90 OTHER BATCH
Wakeup (usec): 99.0th: 9136 6232
Request (usec): 99.0th: 21984 12944
RPS (count): 50.0th: 4968 6184


With base_slice >= request and BATCH (disables wakeup preemption), the
EEVDF thing should turn into FIFO-queue, which is close to ideal for
your workload.

For giggles:

echo 6440000 > /debug/sched/base_slice_ns
echo NO_PLACE_LAG > /debug/sched/features
chrt -b 0 ./schbench -m2 -F128 -n10 -r90

gets me:

Wakeup Latencies percentiles (usec) runtime 90 (s) (526553 total samples)
50.0th: 2084 (158080 samples)
90.0th: 5320 (210675 samples)
* 99.0th: 6232 (47643 samples)
99.9th: 6648 (4297 samples)
min=1, max=13105
Request Latencies percentiles (usec) runtime 90 (s) (526673 total samples)
50.0th: 7544 (157171 samples)
90.0th: 10992 (210461 samples)
* 99.0th: 12944 (48069 samples)
99.9th: 15088 (3716 samples)
min=3841, max=32882
RPS percentiles (requests) runtime 90 (s) (9 total samples)
20.0th: 6184 (9 samples)
* 50.0th: 6184 (0 samples)
90.0th: 6184 (0 samples)
min=6173, max=6180
average rps: 6195.77

FWIW, your RPS stats are broken, note how all the buckets are over the
max value and the average is too.

---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 050e98c97ba3..931102b00786 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1071,6 +1071,8 @@ void set_latency_fair(struct sched_entity *se, int prio)
se->slice = div_u64(base << SCHED_FIXEDPOINT_SHIFT, weight);
}

+static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se);
+
/*
* XXX: strictly: vd_i += N*r_i/w_i such that: vd_i > ve_i
* this is probably good enough.
@@ -1084,6 +1086,14 @@ static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
* EEVDF: vd_i = ve_i + r_i / w_i
*/
se->deadline = se->vruntime + calc_delta_fair(se->slice, se);
+
+ /*
+ * The task has consumed its request, reschedule.
+ */
+ if (cfs_rq->nr_running > 1) {
+ resched_curr(rq_of(cfs_rq));
+ clear_buddies(cfs_rq, se);
+ }
}

#include "pelt.h"
@@ -3636,6 +3646,13 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
* we need to scale se->vlag when w_i changes.
*/
se->vlag = div_s64(se->vlag * old_weight, weight);
+ } else {
+ /*
+ * When the weight changes the virtual time slope changes and
+ * we should adjust the virtual deadline. For now, punt and
+ * simply reset.
+ */
+ se->deadline = se->vruntime + calc_delta_fair(se->slice, se);
}

#ifdef CONFIG_SMP
@@ -5225,22 +5256,6 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
update_idle_cfs_rq_clock_pelt(cfs_rq);
}

-/*
- * Preempt the current task with a newly woken task if needed:
- */
-static void
-check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
-{
- if (pick_eevdf(cfs_rq) != curr) {
- resched_curr(rq_of(cfs_rq));
- /*
- * The current task ran long enough, ensure it doesn't get
- * re-elected due to buddy favours.
- */
- clear_buddies(cfs_rq, curr);
- }
-}
-
static void
set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
@@ -5353,9 +5384,6 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
hrtimer_active(&rq_of(cfs_rq)->hrtick_timer))
return;
#endif
-
- if (cfs_rq->nr_running > 1)
- check_preempt_tick(cfs_rq, curr);
}


2023-04-20 19:07:20

by Peter Zijlstra

[permalink] [raw]
Subject: Re: schbench v1.0

On Thu, Apr 20, 2023 at 05:05:37PM +0200, Peter Zijlstra wrote:

> EEVDF base_slice = 3000[us] (default)
>
> schbench -m2 -F128 -n10 -r90 OTHER BATCH
> Wakeup (usec): 99.0th: 3820 6968
> Request (usec): 99.0th: 30496 24608
> RPS (count): 50.0th: 3836 5496
>
> EEVDF base_slice = 6440[us] (per the calibrate run)
>
> schbench -m2 -F128 -n10 -r90 OTHER BATCH
> Wakeup (usec): 99.0th: 9136 6232
> Request (usec): 99.0th: 21984 12944
> RPS (count): 50.0th: 4968 6184
>
>
> With base_slice >= request and BATCH (disables wakeup preemption), the
> EEVDF thing should turn into FIFO-queue, which is close to ideal for
> your workload.
>
> For giggles:
>
> echo 6440000 > /debug/sched/base_slice_ns
> echo NO_PLACE_LAG > /debug/sched/features
> chrt -b 0 ./schbench -m2 -F128 -n10 -r90

FWIW a similar request size can be achieved through using latency-nice-5

latency-nice-4 gives 3000*1024/526 ~ 5840[us], while
latency-nice-5 gives 3000*1024/423 ~ 7262[us].

Which of course raises the question if we should instead of latency-nice
expose sched_attr::slice (with some suitable bounds).

The immediate problem of course being that while latency-nice is nice
(harhar, teh pun) and vague, sched_attr::slice is fairly well defined.
OTOH as per this example, it might be easier for software to request a
specific slice length (based on prior runs etc..) than it is to guess at
a nice value.

The direct correlation between smaller slice and latency might not be
immediately obvious either, nor might it be a given for any given
scheduling policy.

Also, cgroups :/

2023-04-21 18:21:56

by Chris Mason

[permalink] [raw]
Subject: Re: schbench v1.0

On 4/20/23 11:05 AM, Peter Zijlstra wrote:
> On Mon, Apr 17, 2023 at 10:10:25AM +0200, Chris Mason wrote:
>
>> F128 N10 EEVDF Linus
>> Wakeup (usec): 99.0th: 755 1,266
>> Request (usec): 99.0th: 25,632 22,304
>> RPS (count): 50.0th: 4,280 4,376
>>
>> F128 N10 no-locking EEVDF Linus
>> Wakeup (usec): 99.0th: 823 1,118
>> Request (usec): 99.0th: 17,184 14,192
>> RPS (count): 50.0th: 4,440 4,456
>
> With the below fixlet (against queue/sched/eevdf) on my measly IVB-EP
> (2*10*2):
>
> ./schbench -F128 -n10 -C
>
> Request Latencies percentiles (usec) runtime 30 (s) (153800 total samples)
> 90.0th: 6376 (35699 samples)
> * 99.0th: 6440 (9055 samples)
> 99.9th: 7048 (1345 samples)
>
> CFS
>
> schbench -m2 -F128 -n10 -r90 OTHER BATCH
> Wakeup (usec): 99.0th: 6600 6328
> Request (usec): 99.0th: 35904 14640
> RPS (count): 50.0th: 5368 6104
>

Peter and I went back and forth a bit and now schbench git has a few fixes:

- README.md updated

- warmup time defaults to zero (disabling warmup). This was causing the
stats inconsistency Peter noticed below.

- RPS calculated more often. Every second instead of every reporting
interval.

- thread count scaled to CPU count when -m is used. The thread count is
per messenge thread, so when you use -m2 like Peter did in these runs,
he was ending up with 2xNUM_CPUs workers. That's why his wakeup
latencies are so high, he had double the work that I did.

I'll experiment with some of the suggestions he made too.

-chris

2023-08-14 14:43:18

by Chen Yu

[permalink] [raw]
Subject: Re: schbench v1.0

Hi Chris,

On 2023-04-21 at 14:14:10 -0400, Chris Mason wrote:
> On 4/20/23 11:05 AM, Peter Zijlstra wrote:
> > On Mon, Apr 17, 2023 at 10:10:25AM +0200, Chris Mason wrote:
> >
> >> F128 N10 EEVDF Linus
> >> Wakeup (usec): 99.0th: 755 1,266
> >> Request (usec): 99.0th: 25,632 22,304
> >> RPS (count): 50.0th: 4,280 4,376
> >>
> >> F128 N10 no-locking EEVDF Linus
> >> Wakeup (usec): 99.0th: 823 1,118
> >> Request (usec): 99.0th: 17,184 14,192
> >> RPS (count): 50.0th: 4,440 4,456
> >
> > With the below fixlet (against queue/sched/eevdf) on my measly IVB-EP
> > (2*10*2):
> >
> > ./schbench -F128 -n10 -C
> >
> > Request Latencies percentiles (usec) runtime 30 (s) (153800 total samples)
> > 90.0th: 6376 (35699 samples)
> > * 99.0th: 6440 (9055 samples)
> > 99.9th: 7048 (1345 samples)
> >
> > CFS
> >
> > schbench -m2 -F128 -n10 -r90 OTHER BATCH
> > Wakeup (usec): 99.0th: 6600 6328
> > Request (usec): 99.0th: 35904 14640
> > RPS (count): 50.0th: 5368 6104
> >
>
> Peter and I went back and forth a bit and now schbench git has a few fixes:
>
> - README.md updated
>
> - warmup time defaults to zero (disabling warmup). This was causing the
> stats inconsistency Peter noticed below.
>
> - RPS calculated more often. Every second instead of every reporting
> interval.
>
> - thread count scaled to CPU count when -m is used. The thread count is
> per messenge thread, so when you use -m2 like Peter did in these runs,
> he was ending up with 2xNUM_CPUs workers. That's why his wakeup
> latencies are so high, he had double the work that I did.
>
> I'll experiment with some of the suggestions he made too.
>

Sorry for popping up, while we are doing some eevdf tests and encountered
an issue using the latest schbench, we found this thread. It seems that
there is a minor corner case to be dealt with. Could you help take a look
if the following change make sense?

thanks,
Chenyu

From e84f7634ab611a560a866c887438a4ebd79935ed Mon Sep 17 00:00:00 2001
From: Chen Yu <[email protected]>
Date: Mon, 14 Aug 2023 05:00:06 -0700
Subject: [PATCH] schbench: fix per-cpu spin lock

On a system with 1 socket offline, the CPU ids might not
be continuous. The per_cpu_locks is allocated based on the
number of online CPUs but not accessed continuously:

CPU(s): 224
On-line CPU(s) list: 0-55,112-167
Off-line CPU(s) list: 56-111,168-223

The per_cpu_locks is allocated for 112 elements, but be
accessed beyond an index of 112. This could bring unexpected
deadlock during the test.

Fix this by allocating the per_cpu_locks by the number of
possible CPUs, although there could be some waste of space.

Signed-off-by: Chen Yu <[email protected]>
---
schbench.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/schbench.c b/schbench.c
index 937f1f2..3eaf1a4 100644
--- a/schbench.c
+++ b/schbench.c
@@ -1359,7 +1359,7 @@ int main(int ac, char **av)

matrix_size = sqrt(cache_footprint_kb * 1024 / 3 / sizeof(unsigned long));

- num_cpu_locks = get_nprocs();
+ num_cpu_locks = get_nprocs_conf();
per_cpu_locks = calloc(num_cpu_locks, sizeof(struct per_cpu_lock));
if (!per_cpu_locks) {
perror("unable to allocate memory for per cpu locks\n");
--
2.25.1

> -chris
>