On 2023-05-16 at 15:50:11 +0800, Aaron Lu wrote:
> On Thu, May 04, 2023 at 06:27:46PM +0800, Aaron Lu wrote:
> > Base on my current understanding, the summary is:
> > - Running this workload with nr_thread=224 on SPR, the ingress queue
> > will overflow and that will slow things down. This patch helps
> > performance mainly because it transform the "many cpus accessing the
> > same cacheline" scenario to "many cpus accessing two cachelines" and
> > that can reduce the likelyhood of ingress queue overflow and thus,
> > helps performance;
> > - On Icelake with high nr_threads but not too high that would cause
> > 100% cpu utilization, the two functions' cost will drop a little but
> > performance did not improve(it actually regressed a little);
> > - On SPR when there is no ingress queue overflow, it's similar to
> > Icelake: the two functions' cost will drop but performance did not
> > improve.
>
> More results when running hackbench and netperf on Sapphire Rapids as
> well as on 2 sockets Icelake and 2 sockets Cascade Lake.
>
> The summary is:
> - on SPR, hackbench time reduced ~8% and netperf(UDP_RR/nr_thread=100%)
> performance increased ~50%;
> - on Icelake, performance regressed about 1%-2% for postgres_sysbench
> and hackbench, netperf has no performance change;
> - on Cascade Lake, netperf/UDP_RR/nr_thread=50% sees performance drop
> ~3%; others have no performance change.
>
> Together with results kindly collected by Daniel, it looks this patch
> helps most for SPR while for other machines, it either is flat or
> regressed 1%-3% for some workloads. With these results, I'm thinking an
> alternative solution to reduce the cost of accessing tg->load_avg.
>
> There are two main reasons to access tg->load_avg. One is driven by
> pelt decay, which has a fixed frequency and is not a concern; the other
> is by enqueue_entity/dequeue_entity triggered by task migration. The
> number of migrations can be unbound so the access to tg->load_avg can
> be huge due to this. This frequent task migration is the problem for
> tg->load_avg. One thing I noticed is, on task migration, the load is
> carried from the old per-cpu cfs_rq to the new per-cpu cfs_rq. While
> the cfs_rq's load_avg and tg_load_avg_contrib should change accordingly
> to reflect this so that its corresponding sched entity can get a correct
> weight, the task group's load_avg should stay unchanged. So instead of
> removing a delta to tg->load_avg by src cfs_rq and then increasing the
> same delta to tg->load_avg by target cfs_rq, the two updates to tg's
> load_avg could be avoided. With this change, the update to tg->load_avg
> will be greatly reduced and the problem should be solved and it is
> likely to be a win for most machines/workloads. Not sure if I understand
> this correctly? I'm going to persue a solution based on this, feel free
> to let me know if you see anything wrong here, thanks.
Sound good, but maybe I understand it incorrectly, if the task has been dequeued
for a long time, and not enqueued yet, since we do not update
the tg->load_avg, will it be out-of-date? Or do you mean the task migration
is a frequent sleep-wakeup sequence?

thanks,
Chenyu

2023-05-16 11:47:15

by Aaron Lu

[permalink] [raw]

Subject: Re: [RFC PATCH] sched/fair: Make tg->load_avg per node

On Tue, May 16, 2023 at 04:57:52PM +0800, Chen Yu wrote:
> On 2023-05-16 at 15:50:11 +0800, Aaron Lu wrote:
> > On Thu, May 04, 2023 at 06:27:46PM +0800, Aaron Lu wrote:
> > > Base on my current understanding, the summary is:
> > > - Running this workload with nr_thread=224 on SPR, the ingress queue
> > > will overflow and that will slow things down. This patch helps
> > > performance mainly because it transform the "many cpus accessing the
> > > same cacheline" scenario to "many cpus accessing two cachelines" and
> > > that can reduce the likelyhood of ingress queue overflow and thus,
> > > helps performance;
> > > - On Icelake with high nr_threads but not too high that would cause
> > > 100% cpu utilization, the two functions' cost will drop a little but
> > > performance did not improve(it actually regressed a little);
> > > - On SPR when there is no ingress queue overflow, it's similar to
> > > Icelake: the two functions' cost will drop but performance did not
> > > improve.
> >
> > More results when running hackbench and netperf on Sapphire Rapids as
> > well as on 2 sockets Icelake and 2 sockets Cascade Lake.
> >
> > The summary is:
> > - on SPR, hackbench time reduced ~8% and netperf(UDP_RR/nr_thread=100%)
> > performance increased ~50%;
> > - on Icelake, performance regressed about 1%-2% for postgres_sysbench
> > and hackbench, netperf has no performance change;
> > - on Cascade Lake, netperf/UDP_RR/nr_thread=50% sees performance drop
> > ~3%; others have no performance change.
> >
> > Together with results kindly collected by Daniel, it looks this patch
> > helps most for SPR while for other machines, it either is flat or
> > regressed 1%-3% for some workloads. With these results, I'm thinking an
> > alternative solution to reduce the cost of accessing tg->load_avg.
> >
> > There are two main reasons to access tg->load_avg. One is driven by
> > pelt decay, which has a fixed frequency and is not a concern; the other
> > is by enqueue_entity/dequeue_entity triggered by task migration. The
> > number of migrations can be unbound so the access to tg->load_avg can
> > be huge due to this. This frequent task migration is the problem for
> > tg->load_avg. One thing I noticed is, on task migration, the load is
> > carried from the old per-cpu cfs_rq to the new per-cpu cfs_rq. While
> > the cfs_rq's load_avg and tg_load_avg_contrib should change accordingly
> > to reflect this so that its corresponding sched entity can get a correct
> > weight, the task group's load_avg should stay unchanged. So instead of
> > removing a delta to tg->load_avg by src cfs_rq and then increasing the
> > same delta to tg->load_avg by target cfs_rq, the two updates to tg's
> > load_avg could be avoided. With this change, the update to tg->load_avg
> > will be greatly reduced and the problem should be solved and it is
> > likely to be a win for most machines/workloads. Not sure if I understand
> > this correctly? I'm going to persue a solution based on this, feel free
> > to let me know if you see anything wrong here, thanks.
> Sound good, but maybe I understand it incorrectly, if the task has been dequeued
> for a long time, and not enqueued yet, since we do not update
> the tg->load_avg, will it be out-of-date? Or do you mean the task migration
> is a frequent sleep-wakeup sequence?

When a task is dequeued due to it's blocked, then its load will not be
subtracted from its cfs_rq. That part of the load on cfs_rq will decay
and tg->load_avg will be updated when needed. Because decay happens in a
fixed frequency, that's not a concern.

When the task finally woke and was appointed a new cpu, then its load
will have to be removed from its original cfs_rq and added to its
new cfs_rq and that may trigger two updates to tg->load_avg depending on
how large the task's load is and the cfs_rq's current load contrib to
tg etc. and that is where I'm looking for some optimization, like the
migration will affect corresponding cfs_rq's load_avg but it shouldn't
affect tg->load_avg so there is no need to subtract task's load_avg from
tg->load_avg by original cfs_rq and then add it back by new cfs_rq. But
I suppose there are some details to sort out.

Thanks,
Aaron