This function will help a thread decide if it wants to to do work
that can be delayed, to accumulate more tasks for more efficient
batch processing later.
However, if no other tasks are running on the cpu, it can take
advantgae of the available cpu cycles to complete the tasks
for immediate processing to minimize delay, otherwise it will yield.
Signed-off-by: Tim Chen <[email protected]>
---
include/linux/sched.h | 1 +
kernel/sched/core.c | 8 ++++++++
2 files changed, 9 insertions(+)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7cb07fd..0884250 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -168,6 +168,7 @@ extern int nr_threads;
DECLARE_PER_CPU(unsigned long, process_counts);
extern int nr_processes(void);
extern unsigned long nr_running(void);
+extern unsigned long nr_running_cpu(int cpu);
extern unsigned long nr_iowait(void);
extern unsigned long nr_iowait_cpu(int cpu);
extern unsigned long this_cpu_load(void);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9cae286..d5bb8e6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2283,6 +2283,14 @@ unsigned long nr_running(void)
return sum;
}
+unsigned long nr_running_cpu(int cpu)
+{
+ if (cpumask_test_cpu(cpu, cpu_online_mask))
+ return cpu_rq(cpu)->nr_running;
+ else
+ return 0;
+}
+
unsigned long long nr_context_switches(void)
{
int i;
--
1.7.11.7
On 12.07.2014 00:33, Tim Chen wrote:
> This function will help a thread decide if it wants to to do work
> that can be delayed, to accumulate more tasks for more efficient
> batch processing later.
>
> However, if no other tasks are running on the cpu, it can take
> advantgae of the available cpu cycles to complete the tasks
> for immediate processing to minimize delay, otherwise it will yield.
>
> Signed-off-by: Tim Chen <[email protected]>
> ---
> include/linux/sched.h | 1 +
> kernel/sched/core.c | 8 ++++++++
> 2 files changed, 9 insertions(+)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 7cb07fd..0884250 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -168,6 +168,7 @@ extern int nr_threads;
> DECLARE_PER_CPU(unsigned long, process_counts);
> extern int nr_processes(void);
> extern unsigned long nr_running(void);
> +extern unsigned long nr_running_cpu(int cpu);
> extern unsigned long nr_iowait(void);
> extern unsigned long nr_iowait_cpu(int cpu);
> extern unsigned long this_cpu_load(void);
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 9cae286..d5bb8e6 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2283,6 +2283,14 @@ unsigned long nr_running(void)
> return sum;
> }
>
> +unsigned long nr_running_cpu(int cpu)
> +{
> + if (cpumask_test_cpu(cpu, cpu_online_mask))
> + return cpu_rq(cpu)->nr_running;
> + else
> + return 0;
> +}
> +
Offline cpu should have nr_running equal to 0. We park last
enqueued thread (migration_thread) at the end of take_cpu_down().
So, it's enough to return cpu_rq(cpu)->nr_running.
> unsigned long long nr_context_switches(void)
> {
> int i;
>
Thanks,
Kirill
On 07/11/2014 01:33 PM, Tim Chen wrote:
> +unsigned long nr_running_cpu(int cpu)
> +{
> + if (cpumask_test_cpu(cpu, cpu_online_mask))
> + return cpu_rq(cpu)->nr_running;
> + else
> + return 0;
> +}
> +
EXPORT_SYMBOL?
On Fri, Jul 11, 2014 at 01:33:04PM -0700, Tim Chen wrote:
> This function will help a thread decide if it wants to to do work
> that can be delayed, to accumulate more tasks for more efficient
> batch processing later.
>
> However, if no other tasks are running on the cpu, it can take
> advantgae of the available cpu cycles to complete the tasks
> for immediate processing to minimize delay, otherwise it will yield.
Ugh.. and ignore topology and everything else.
Yet another scheduler on top of the scheduler.
We have the padata muck, also only ever used by crypto.
We have the workqueue nonsense, used all over the place
And we have btrfs doing their own padata like muck.
And I'm sure there's at least one more out there, just because.
Why do we want yet another thing?
I'm inclined to go NAK and get people to reduce the amount of async
queueing and processing crap.
On Mon, 2014-07-14 at 12:16 +0200, Peter Zijlstra wrote:
> On Fri, Jul 11, 2014 at 01:33:04PM -0700, Tim Chen wrote:
> > This function will help a thread decide if it wants to to do work
> > that can be delayed, to accumulate more tasks for more efficient
> > batch processing later.
> >
> > However, if no other tasks are running on the cpu, it can take
> > advantgae of the available cpu cycles to complete the tasks
> > for immediate processing to minimize delay, otherwise it will yield.
>
> Ugh.. and ignore topology and everything else.
>
> Yet another scheduler on top of the scheduler.
>
> We have the padata muck, also only ever used by crypto.
> We have the workqueue nonsense, used all over the place
> And we have btrfs doing their own padata like muck.
> And I'm sure there's at least one more out there, just because.
>
> Why do we want yet another thing?
>
> I'm inclined to go NAK and get people to reduce the amount of async
> queueing and processing crap.
The mult-buffer class of crypto algorithms is by nature
asynchronous. The algorithm gathers several crypto jobs, and
put the buffer from each job in a data lane of the SIMD register.
This allows for parallel processing and increases throughput.
The gathering of the crypto jobs is an async process and
queuing is necessary for this class of algorithm.
Tim
On Mon, Jul 14, 2014 at 09:10:14AM -0700, Tim Chen wrote:
> On Mon, 2014-07-14 at 12:16 +0200, Peter Zijlstra wrote:
> > On Fri, Jul 11, 2014 at 01:33:04PM -0700, Tim Chen wrote:
> > > This function will help a thread decide if it wants to to do work
> > > that can be delayed, to accumulate more tasks for more efficient
> > > batch processing later.
> > >
> > > However, if no other tasks are running on the cpu, it can take
> > > advantgae of the available cpu cycles to complete the tasks
> > > for immediate processing to minimize delay, otherwise it will yield.
> >
> > Ugh.. and ignore topology and everything else.
> >
> > Yet another scheduler on top of the scheduler.
> >
> > We have the padata muck, also only ever used by crypto.
> > We have the workqueue nonsense, used all over the place
> > And we have btrfs doing their own padata like muck.
> > And I'm sure there's at least one more out there, just because.
> >
> > Why do we want yet another thing?
> >
> > I'm inclined to go NAK and get people to reduce the amount of async
> > queueing and processing crap.
>
> The mult-buffer class of crypto algorithms is by nature
> asynchronous. The algorithm gathers several crypto jobs, and
> put the buffer from each job in a data lane of the SIMD register.
> This allows for parallel processing and increases throughput.
> The gathering of the crypto jobs is an async process and
> queuing is necessary for this class of algorithm.
How is that related to me saying we've got too much of this crap
already?
On Mon, 2014-07-14 at 18:14 +0200, Peter Zijlstra wrote:
> On Mon, Jul 14, 2014 at 09:10:14AM -0700, Tim Chen wrote:
> > On Mon, 2014-07-14 at 12:16 +0200, Peter Zijlstra wrote:
> > > On Fri, Jul 11, 2014 at 01:33:04PM -0700, Tim Chen wrote:
> > > > This function will help a thread decide if it wants to to do work
> > > > that can be delayed, to accumulate more tasks for more efficient
> > > > batch processing later.
> > > >
> > > > However, if no other tasks are running on the cpu, it can take
> > > > advantgae of the available cpu cycles to complete the tasks
> > > > for immediate processing to minimize delay, otherwise it will yield.
> > >
> > > Ugh.. and ignore topology and everything else.
> > >
> > > Yet another scheduler on top of the scheduler.
> > >
> > > We have the padata muck, also only ever used by crypto.
> > > We have the workqueue nonsense, used all over the place
> > > And we have btrfs doing their own padata like muck.
> > > And I'm sure there's at least one more out there, just because.
> > >
> > > Why do we want yet another thing?
> > >
> > > I'm inclined to go NAK and get people to reduce the amount of async
> > > queueing and processing crap.
> >
> > The mult-buffer class of crypto algorithms is by nature
> > asynchronous. The algorithm gathers several crypto jobs, and
> > put the buffer from each job in a data lane of the SIMD register.
> > This allows for parallel processing and increases throughput.
> > The gathering of the crypto jobs is an async process and
> > queuing is necessary for this class of algorithm.
>
> How is that related to me saying we've got too much of this crap
> already?
I was trying to explain why the algorithm is implemented this way
because of its batching nature.
There is a whole class of async algorithm that can provide
substantial speedup by doing batch processing and uses workqueue.
The multi-buffer sha1 version has 2.2x speedup over existing
AVX2 version, and can have even more speedup when AVX3
comes round. Workqueue is a natural way to implement
this. I don't think a throughput speedup of 2.2x is "crap".
We are not inventing anything new, but ask for a
very simple helper function to know if there's something else
running on our cpu to help us make a better decision
of whether we should flush the batched jobs immediately.
And also asynchronous crypto interface is already used substantially
in crypto and has a well established infrastructure.
Thanks.
Tim
On Sat, 2014-07-12 at 13:25 +0400, Kirill Tkhai wrote:
> >
> > +unsigned long nr_running_cpu(int cpu)
> > +{
> > + if (cpumask_test_cpu(cpu, cpu_online_mask))
> > + return cpu_rq(cpu)->nr_running;
> > + else
> > + return 0;
> > +}
> > +
>
> Offline cpu should have nr_running equal to 0. We park last
> enqueued thread (migration_thread) at the end of take_cpu_down().
>
> So, it's enough to return cpu_rq(cpu)->nr_running.
Thanks. This seems reasonable.
Tim
On Mon, Jul 14, 2014 at 10:05:34AM -0700, Tim Chen wrote:
> I was trying to explain why the algorithm is implemented this way
> because of its batching nature.
>
> There is a whole class of async algorithm that can provide
> substantial speedup by doing batch processing and uses workqueue.
> The multi-buffer sha1 version has 2.2x speedup over existing
> AVX2 version, and can have even more speedup when AVX3
> comes round. Workqueue is a natural way to implement
> this. I don't think a throughput speedup of 2.2x is "crap".
>
> We are not inventing anything new, but ask for a
> very simple helper function to know if there's something else
> running on our cpu to help us make a better decision
> of whether we should flush the batched jobs immediately.
>
> And also asynchronous crypto interface is already used substantially
> in crypto and has a well established infrastructure.
The crap I was talking about is that there's a metric ton of 'async'
interfaces all different.
Your multi-buffer thing isn't generic either, it seems lmiited to sha1.
It does not reuse padata, it does not extend workqueues, it does not
remove the btrfs nonsense, it adds yet anotehr thing.
On Mon, 2014-07-14 at 20:17 +0200, Peter Zijlstra wrote:
> On Mon, Jul 14, 2014 at 10:05:34AM -0700, Tim Chen wrote:
> > I was trying to explain why the algorithm is implemented this way
> > because of its batching nature.
> >
> > There is a whole class of async algorithm that can provide
> > substantial speedup by doing batch processing and uses workqueue.
> > The multi-buffer sha1 version has 2.2x speedup over existing
> > AVX2 version, and can have even more speedup when AVX3
> > comes round. Workqueue is a natural way to implement
> > this. I don't think a throughput speedup of 2.2x is "crap".
> >
> > We are not inventing anything new, but ask for a
> > very simple helper function to know if there's something else
> > running on our cpu to help us make a better decision
> > of whether we should flush the batched jobs immediately.
> >
> > And also asynchronous crypto interface is already used substantially
> > in crypto and has a well established infrastructure.
>
> The crap I was talking about is that there's a metric ton of 'async'
> interfaces all different.
Async interfaces when used appropriately, actually speed things up
substantially for crypto. We actually have a case with
ecyrptfs not using the async crypto interface, causing cpu to stall
and slowing things down substantially with AES-NI. And async interface
with workqueue speed things up (30% to 35% on encryption with SSD).
http://marc.info/?l=ecryptfs-users&m=136520541407248
http://www.spinics.net/lists/ecryptfs/msg00228.html
>
> Your multi-buffer thing isn't generic either, it seems lmiited to sha1.
We actually have many other multi-buffer crypto algorithms already
published for encryption and other IPSec usages. So
multi-buffer algorithm is not just limited to SHA1.
We hope to port those to the kernel crypto library eventually.
http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/fast-multi-buffer-ipsec-implementations-ia-processors-paper.pdf
> It does not reuse padata,
padata tries to speed things up by parallelizing jobs to *multiple*
cpus. Whereas multi-buffer tries to speed things up by speeding things
up by using multiple data lanes in SIMD register in a *single* cpu.
These two usages are complementary but not the same.
> it does not extend workqueues,
Why do I need to extend workqueues if the existing ones already
meet my needs?
> it does not
> remove the btrfs nonsense,
Not much I can do about btrfs as I don't understand the issues there.
> it adds yet anotehr thing.
Thanks.
Tim
On Mon, Jul 14, 2014 at 12:08:28PM -0700, Tim Chen wrote:
> On Mon, 2014-07-14 at 20:17 +0200, Peter Zijlstra wrote:
> > Your multi-buffer thing isn't generic either, it seems lmiited to sha1.
>
> We actually have many other multi-buffer crypto algorithms already
> published for encryption and other IPSec usages. So
> multi-buffer algorithm is not just limited to SHA1.
> We hope to port those to the kernel crypto library eventually.
> http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/fast-multi-buffer-ipsec-implementations-ia-processors-paper.pdf
That's all nice and such; but the code as I've seen in these patches is
very much sha1 specific. The mb part isn't separated out.
> > It does not reuse padata,
> padata tries to speed things up by parallelizing jobs to *multiple*
> cpus. Whereas multi-buffer tries to speed things up by speeding things
> up by using multiple data lanes in SIMD register in a *single* cpu.
> These two usages are complementary but not the same.
And if its single cpu, wth do you need that nr_running thing for another
cpu for?
Also, this difference wasn't clear to me.
I still loathe all the async work, because it makes a mockery of
accounting etc.. but that's a story for another day I suppose :-(
On Mon, 2014-07-14 at 21:15 +0200, Peter Zijlstra wrote:
> On Mon, Jul 14, 2014 at 12:08:28PM -0700, Tim Chen wrote:
> > On Mon, 2014-07-14 at 20:17 +0200, Peter Zijlstra wrote:
>
> > > Your multi-buffer thing isn't generic either, it seems lmiited to sha1.
> >
> > We actually have many other multi-buffer crypto algorithms already
> > published for encryption and other IPSec usages. So
> > multi-buffer algorithm is not just limited to SHA1.
> > We hope to port those to the kernel crypto library eventually.
> > http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/fast-multi-buffer-ipsec-implementations-ia-processors-paper.pdf
>
> That's all nice and such; but the code as I've seen in these patches is
> very much sha1 specific. The mb part isn't separated out.
There is a generic multi-buffer infrastructure portion that manages
pulling and queuing jobs on the crypto workqueue, and it is separated out
in patch 1 of the patchset. The other portions are algorithm specific that defines
algorithm specific data structure and does the crypto computation
for a particular algorithm, mostly in
assemblies and C glue code. The infrastructure code is
meant to be reused for other similar
multi-buffer algorithms.
>
> > > It does not reuse padata,
> > padata tries to speed things up by parallelizing jobs to *multiple*
> > cpus. Whereas multi-buffer tries to speed things up by speeding things
> > up by using multiple data lanes in SIMD register in a *single* cpu.
> > These two usages are complementary but not the same.
>
> And if its single cpu, wth do you need that nr_running thing for another
> cpu for?
We use nr_running_cpu to check whether there are other tasks running on
the *current* cpu, (not for another cpu), to decide if we should flush
and compute crypto jobs accumulated. If there's nobody else running,
we can take advantage of available cpu cycles on the cpu we are running
on to do computation on the existing jobs in a SIMD mannner.
Waiting a bit longer may accumulate more jobs to process in parallel
in a single SIMD instruction, but will have more delay.
>
> Also, this difference wasn't clear to me.
>
> I still loathe all the async work, because it makes a mockery of
> accounting etc.. but that's a story for another day I suppose :-(
>
>
Thanks.
Tim
On Sat, 2014-07-12 at 07:21 -0700, Tadeusz Struk wrote:
> On 07/11/2014 01:33 PM, Tim Chen wrote:
> > +unsigned long nr_running_cpu(int cpu)
> > +{
> > + if (cpumask_test_cpu(cpu, cpu_online_mask))
> > + return cpu_rq(cpu)->nr_running;
> > + else
> > + return 0;
> > +}
> > +
> EXPORT_SYMBOL?
Yes, thanks.
Tim
On Mon, Jul 14, 2014 at 12:50:50PM -0700, Tim Chen wrote:
> There is a generic multi-buffer infrastructure portion that manages
> pulling and queuing jobs on the crypto workqueue, and it is separated out
> in patch 1 of the patchset.
There's one very weird multi-line comment in that patch.
> The other portions are algorithm specific that defines
> algorithm specific data structure and does the crypto computation
> for a particular algorithm, mostly in
> assemblies and C glue code. The infrastructure code is
> meant to be reused for other similar
> multi-buffer algorithms.
The flushing part that uses the sched thing is sha1 specific, even
though it strikes me as not being so. Flushing buffers on idle seems
like a 'generic' thing.
> We use nr_running_cpu to check whether there are other tasks running on
> the *current* cpu, (not for another cpu),
And yet, the function allows you do to exactly that..
> to decide if we should flush
> and compute crypto jobs accumulated. If there's nobody else running,
> we can take advantage of available cpu cycles on the cpu we are running
> on to do computation on the existing jobs in a SIMD mannner.
> Waiting a bit longer may accumulate more jobs to process in parallel
> in a single SIMD instruction, but will have more delay.
So you already have an idle notifier (which is x86 only, we should fix
that I suppose), and you then double check there really isn't anything
else running.
How much, if anything, does that second check buy you? There's just not
a single word on that.
Also, there is not a word on the latency vs throughput tradeoff you
make. I can imagine that for very short idle durations you loose, not
win with this thing.
So for now I still see no reason for doing this.
Also, I wonder about SMT, the point of this is to make best use of the
SIMD pipelines, does it still make sense to use siblings at the same
time even though you're running hand crafted ASM to stuff the pipelines
to the brim? Should this thing be SMT aware and not gather queues for
both siblings?
On Tue, Jul 15, 2014 at 11:50:45AM +0200, Peter Zijlstra wrote:
> So you already have an idle notifier (which is x86 only, we should fix
> that I suppose), and you then double check there really isn't anything
> else running.
Note that we've already done a large part of the expense of going idle
by the time we call that idle notifier -- in specific, we've
reprogrammed the clock to stop the tick.
Its really wasteful to then generate work again, which means we have to
again reprogram the clock etc.
On Tue, 15 Jul 2014, Peter Zijlstra wrote:
> On Tue, Jul 15, 2014 at 11:50:45AM +0200, Peter Zijlstra wrote:
> > So you already have an idle notifier (which is x86 only, we should fix
> > that I suppose), and you then double check there really isn't anything
> > else running.
>
> Note that we've already done a large part of the expense of going idle
> by the time we call that idle notifier -- in specific, we've
> reprogrammed the clock to stop the tick.
>
> Its really wasteful to then generate work again, which means we have to
> again reprogram the clock etc.
Doing anything which is not related to idle itself in the idle
notifier is just plain wrong.
If that stuff wants to utilize idle slots, we really need to come up
with a generic and general solution. Otherwise we'll grow those warts
all over the architecture space, with slightly different ways of
wreckaging the world an some more.
This whole attidute of people thinking that they need their own
specialized scheduling around the real scheduler is a PITA. All this
stuff is just damanging any sensible approach of power saving, load
balancing, etc.
What we really want is infrastructure, which allows the scheduler to
actively query the async work situation and based on the results
actively decide when to process it and where.
Thanks,
tglx
On Mon, Jul 14, 2014 at 09:15:04PM +0200, Peter Zijlstra wrote:
> I still loathe all the async work, because it makes a mockery of
> accounting etc.. but that's a story for another day I suppose :-(
So, just to expand on this, we're already getting 'bug' reports because
worker threads are not cgroup aware. If work gets generated inside some
cgroup, the worker doesn't care and runs the worker thread wherever
(typically the root cgroup).
This means that the 'work' escapes the cgroup confines and creates
resource inversion etc. The same is of course true for nice and RT
priorities.
TJ, are you aware of this and/or given it any throught?
On Tue, 2014-07-15 at 14:59 +0200, Thomas Gleixner wrote:
> On Tue, 15 Jul 2014, Peter Zijlstra wrote:
>
> > On Tue, Jul 15, 2014 at 11:50:45AM +0200, Peter Zijlstra wrote:
> > > So you already have an idle notifier (which is x86 only, we should fix
> > > that I suppose), and you then double check there really isn't anything
> > > else running.
> >
> > Note that we've already done a large part of the expense of going idle
> > by the time we call that idle notifier -- in specific, we've
> > reprogrammed the clock to stop the tick.
> >
> > Its really wasteful to then generate work again, which means we have to
> > again reprogram the clock etc.
>
> Doing anything which is not related to idle itself in the idle
> notifier is just plain wrong.
>
> If that stuff wants to utilize idle slots, we really need to come up
> with a generic and general solution. Otherwise we'll grow those warts
> all over the architecture space, with slightly different ways of
> wreckaging the world an some more.
>
> This whole attidute of people thinking that they need their own
> specialized scheduling around the real scheduler is a PITA. All this
> stuff is just damanging any sensible approach of power saving, load
> balancing, etc.
Not to mention that we're already too rotund...
pipe-test scheduling cross core, ie ~0 work, ~pure full fastpath. All
kernels with same (obese) distro config, with drivers reduced to what my
boxen need. Squint a little, there is some jitter. These kernels are
all adjusted to eliminate various regressions that would otherwise skew
results up to and including _very_ badly. See "virgin", the numbers are
much more useful without that particular skew methinks :)
3.0.101-default 3.753363 usecs/loop -- avg 3.770737 530.4 KHz 1.000
3.1.10-default 3.723843 usecs/loop -- avg 3.716058 538.2 KHz 1.014
3.2.51-default 3.728060 usecs/loop -- avg 3.710372 539.0 KHz 1.016
3.3.8-default 3.906174 usecs/loop -- avg 3.900399 512.8 KHz .966
3.4.97-default 3.864158 usecs/loop -- avg 3.865281 517.4 KHz .975
3.5.7-default 3.967481 usecs/loop -- avg 3.962757 504.7 KHz .951
3.6.11-default 3.851186 usecs/loop -- avg 3.845321 520.1 KHz .980
3.7.10-default 3.777869 usecs/loop -- avg 3.776913 529.5 KHz .998
3.8.13-default 4.049927 usecs/loop -- avg 4.041905 494.8 KHz .932
3.9.11-default 3.973046 usecs/loop -- avg 3.974208 503.2 KHz .948
3.10.27-default 4.189598 usecs/loop -- avg 4.189298 477.4 KHz .900
3.11.10-default 4.293870 usecs/loop -- avg 4.297979 465.3 KHz .877
3.12.24-default 4.321570 usecs/loop -- avg 4.321961 462.8 KHz .872
3.13.11-default 4.137845 usecs/loop -- avg 4.134863 483.7 KHz .911
3.14.10-default 4.145348 usecs/loop -- avg 4.139987 483.1 KHz .910 1.000
3.15.4-default 4.355594 usecs/loop -- avg 4.351961 459.6 KHz .866 .951 1.000
3.16.0-default 4.537279 usecs/loop -- avg 4.543532 440.2 KHz .829 .911 .957
3.16.0-virgin 6.377331 usecs/loop -- avg 6.352794 314.8 KHz 0.sob
my local config, group sched, namespaces etc disabled
3.0.101-smp 3.692377 usecs/loop -- avg 3.690774 541.9 KHz 1.000
3.1.10-smp 3.573832 usecs/loop -- avg 3.563269 561.3 KHz 1.035
3.2.51-smp 3.632690 usecs/loop -- avg 3.628220 551.2 KHz 1.017
3.3.8-smp 3.801838 usecs/loop -- avg 3.803441 525.8 KHz .970
3.4.97-smp 3.836087 usecs/loop -- avg 3.843501 520.4 KHz .960
3.5.7-smp 3.646927 usecs/loop -- avg 3.646288 548.5 KHz 1.012
3.6.11-smp 3.674402 usecs/loop -- avg 3.680929 543.3 KHz 1.002
3.7.10-smp 3.644274 usecs/loop -- avg 3.644566 548.8 KHz 1.012
3.8.13-smp 3.678164 usecs/loop -- avg 3.675524 544.1 KHz 1.004
3.9.11-smp 3.834943 usecs/loop -- avg 3.845852 520.0 KHz .959
3.10.27-smp 3.651881 usecs/loop -- avg 3.634515 550.3 KHz 1.015
3.11.10-smp 3.716159 usecs/loop -- avg 3.720603 537.5 KHz .991
3.12.24-smp 3.862634 usecs/loop -- avg 3.872252 516.5 KHz .953
3.13.11-smp 3.803254 usecs/loop -- avg 3.802553 526.0 KHz .970
3.14.10-smp 4.010009 usecs/loop -- avg 4.009019 498.9 KHz .920
3.15.4-smp 3.882398 usecs/loop -- avg 3.884095 514.9 KHz .950
3.16.0-master 4.061003 usecs/loop -- avg 4.058244 492.8 KHz .909
echo 0 > sched_wakeup_granularity_ns, taskset -c 3 pipe-test 1 (shortest path)
3.0.101-default 3.352267 usecs/loop -- avg 3.352434 596.6 KHz 1.000
3.16.0-default 3.596559 usecs/loop -- avg 3.594023 556.5 KHz .932
3.0.101-smp 3.089251 usecs/loop -- avg 3.089556 647.3 KHz 1.000
3.16.0-master 3.254721 usecs/loop -- avg 3.251534 615.1 KHz .950
sched+idle is becoming more of a not-so-fastpath. Pure sched is not as
bad, but still, we're getting fat.
netperf TCP_RR trans/sec (unbound)
3.0.101-default 91360.56 1.000
3.16.0-default 72523.30 .793
3.0.101-smp 92166.23 1.000
3.16.0-master 81235.30 .881
echo 0 > sched_wakeup_granularity_ns, bound to cpu3
3.0.101-smp 94289.95 1.000
3.16.0-master 81219.02 .861
Leanest meanest kernel ever to run on this box (2.6.22 + cfs-2.6.25 etc)
did that bound TCP_RR at ~114k IIRC. My userspace became too new to
boot that kernel without a squabble, but I think I recall correctly.
-Mike
On Tue, Jul 15, 2014 at 04:45:25PM +0200, Mike Galbraith wrote:
>
> 3.0.101-default 3.753363 usecs/loop -- avg 3.770737 530.4 KHz 1.000
> 3.14.10-default 4.145348 usecs/loop -- avg 4.139987 483.1 KHz .910 1.000
> 3.15.4-default 4.355594 usecs/loop -- avg 4.351961 459.6 KHz .866 .951 1.000
> 3.16.0-default 4.537279 usecs/loop -- avg 4.543532 440.2 KHz .829 .911 .957
>
> 3.0.101-smp 3.692377 usecs/loop -- avg 3.690774 541.9 KHz 1.000
> 3.14.10-smp 4.010009 usecs/loop -- avg 4.009019 498.9 KHz .920
> 3.15.4-smp 3.882398 usecs/loop -- avg 3.884095 514.9 KHz .950
> 3.16.0-master 4.061003 usecs/loop -- avg 4.058244 492.8 KHz .909
Urgh,.. I need to go fix that :/
On Tue, Jul 15, 2014 at 03:36:27PM +0200, Peter Zijlstra wrote:
> So, just to expand on this, we're already getting 'bug' reports because
> worker threads are not cgroup aware. If work gets generated inside some
> cgroup, the worker doesn't care and runs the worker thread wherever
> (typically the root cgroup).
>
> This means that the 'work' escapes the cgroup confines and creates
> resource inversion etc. The same is of course true for nice and RT
> priorities.
>
> TJ, are you aware of this and/or given it any throught?
Yeap, I'm aware of the issue but haven't read any actual bug reports
yet. Can you point me to the reports?
Given that worker pool management is dynamic, spawning separate pools
for individual cgroups on-demand should be doable. Haven't been able
to decide how much we should be willing to pay in terms of complexity
yet.
Thanks.
--
tejun
On Tue, Jul 15, 2014 at 11:21:49AM -0400, Tejun Heo wrote:
> On Tue, Jul 15, 2014 at 03:36:27PM +0200, Peter Zijlstra wrote:
> > So, just to expand on this, we're already getting 'bug' reports because
> > worker threads are not cgroup aware. If work gets generated inside some
> > cgroup, the worker doesn't care and runs the worker thread wherever
> > (typically the root cgroup).
> >
> > This means that the 'work' escapes the cgroup confines and creates
> > resource inversion etc. The same is of course true for nice and RT
> > priorities.
> >
> > TJ, are you aware of this and/or given it any throught?
>
> Yeap, I'm aware of the issue but haven't read any actual bug reports
> yet. Can you point me to the reports?
lkml.kernel.org/r/[email protected]
The root level workqueue thingies disturb the cgroup level scheduling to
'some' extend.
That whole thread is somewhat confusing and I think there's more than
just this going on, but they're really seeing this as a pain point.
> Given that worker pool management is dynamic, spawning separate pools
> for individual cgroups on-demand should be doable. Haven't been able
> to decide how much we should be willing to pay in terms of complexity
> yet.
Yah, I figured. Back before you ripped up the workqueue I had a
worklet-PI patch in -rt, which basically sorted and ran works in a
RR/FIFO priority order, including boosting the current work when a
higher prio one was pending etc.
I never really figured out a way to make the new concurrent stuff do
something like that, and this 'problem' here is harder still, because
they're not static prios etc.
Ideally we'd run the works _in_ the same task-context (from a scheduler
POV) as the task creating the work. There's some very obvious problems
of implementation there, and some less obvious others, so bleh.
Also, there's the whole softirq trainwreck, which has many of the same
problems. Much of the network stack isn't necessarily aware for whom
they're doing work, so no way to propagate.
Point in case for the crypto stuff I suppose, that's a combination of
the two, god only knows who we should be accounting it to and in what
context things should run.
Ideally a socket has a 'single' (ha! if only) owner, and we'd know
throughout the entirely rx/tx paths, but I doubt we actually have that.
(Note that there's people really suffering because of this..)
Same for the 'shiny' block-mq stuff I suppose :-(
On Tue, 2014-07-15 at 16:53 +0200, Peter Zijlstra wrote:
> On Tue, Jul 15, 2014 at 04:45:25PM +0200, Mike Galbraith wrote:
> >
> > 3.0.101-default 3.753363 usecs/loop -- avg 3.770737 530.4 KHz 1.000
> > 3.14.10-default 4.145348 usecs/loop -- avg 4.139987 483.1 KHz .910 1.000
> > 3.15.4-default 4.355594 usecs/loop -- avg 4.351961 459.6 KHz .866 .951 1.000
> > 3.16.0-default 4.537279 usecs/loop -- avg 4.543532 440.2 KHz .829 .911 .957
> >
> > 3.0.101-smp 3.692377 usecs/loop -- avg 3.690774 541.9 KHz 1.000
> > 3.14.10-smp 4.010009 usecs/loop -- avg 4.009019 498.9 KHz .920
> > 3.15.4-smp 3.882398 usecs/loop -- avg 3.884095 514.9 KHz .950
> > 3.16.0-master 4.061003 usecs/loop -- avg 4.058244 492.8 KHz .909
>
> Urgh,.. I need to go fix that :/
I'm poking about. It's not just one thing 'course, just lots of change
adding up to less than wonderful. Idle changes are costing some, for
obese config, avg goop. The select_next_task() reorganization appears
to be costing, but eyeballing, I can see no excuse for that at all.
-Mike
On Tue, 2014-07-15 at 11:50 +0200, Peter Zijlstra wrote:
> On Mon, Jul 14, 2014 at 12:50:50PM -0700, Tim Chen wrote:
>
> > There is a generic multi-buffer infrastructure portion that manages
> > pulling and queuing jobs on the crypto workqueue, and it is separated out
> > in patch 1 of the patchset.
>
> There's one very weird multi-line comment in that patch.
>
> > The other portions are algorithm specific that defines
> > algorithm specific data structure and does the crypto computation
> > for a particular algorithm, mostly in
> > assemblies and C glue code. The infrastructure code is
> > meant to be reused for other similar
> > multi-buffer algorithms.
>
> The flushing part that uses the sched thing is sha1 specific, even
> though it strikes me as not being so. Flushing buffers on idle seems
> like a 'generic' thing.
>
> > We use nr_running_cpu to check whether there are other tasks running on
> > the *current* cpu, (not for another cpu),
>
> And yet, the function allows you do to exactly that..
How about a function "single_task_running()"? Something like
bool single_task_running()
{
if (cpu_rq(smp_processor_id())->nr_running == 1)
return true;
else
return false;
}
>
> > to decide if we should flush
> > and compute crypto jobs accumulated. If there's nobody else running,
> > we can take advantage of available cpu cycles on the cpu we are running
> > on to do computation on the existing jobs in a SIMD mannner.
> > Waiting a bit longer may accumulate more jobs to process in parallel
> > in a single SIMD instruction, but will have more delay.
>
> So you already have an idle notifier (which is x86 only, we should fix
> that I suppose), and you then double check there really isn't anything
> else running.
>
> How much, if anything, does that second check buy you? There's just not
> a single word on that.
>
> Also, there is not a word on the latency vs throughput tradeoff you
> make. I can imagine that for very short idle durations you loose, not
> win with this thing.
I am not crazy about the idle_notifier implementation either. Thinking
a bit more, I can probably achieve most of what I need in the code path
that pull jobs from the work queue. I'll flush the remaining
partially processed jobs if I don't have any more jobs to process and no
other tasks are running (i.e. single_task_running() is true).
I'll try to move this logic into the generic multi-buffer job handler
path and out of the idle_notifier.
>
> So for now I still see no reason for doing this.
>
> Also, I wonder about SMT, the point of this is to make best use of the
> SIMD pipelines, does it still make sense to use siblings at the same
> time even though you're running hand crafted ASM to stuff the pipelines
> to the brim? Should this thing be SMT aware and not gather queues for
> both siblings?
I am processing the crypto job on the same cpu that the crypto
jobs originate from to get best cache utilization.
It will be the scheduler's decision to move the task that's
generating a lot crypto jobs to another cpu, if the sibling
is also very busy.
Thanks for your comments and reviews.
Tim
On Tue, 2014-07-15 at 14:07 +0200, Peter Zijlstra wrote:
> On Tue, Jul 15, 2014 at 11:50:45AM +0200, Peter Zijlstra wrote:
> > So you already have an idle notifier (which is x86 only, we should fix
> > that I suppose), and you then double check there really isn't anything
> > else running.
>
> Note that we've already done a large part of the expense of going idle
> by the time we call that idle notifier -- in specific, we've
> reprogrammed the clock to stop the tick.
>
> Its really wasteful to then generate work again, which means we have to
> again reprogram the clock etc.
Will try another version of the patch without using the idle notifier.
Tim
On Tue, 2014-07-15 at 14:59 +0200, Thomas Gleixner wrote:
> On Tue, 15 Jul 2014, Peter Zijlstra wrote:
>
> > On Tue, Jul 15, 2014 at 11:50:45AM +0200, Peter Zijlstra wrote:
> > > So you already have an idle notifier (which is x86 only, we should fix
> > > that I suppose), and you then double check there really isn't anything
> > > else running.
> >
> > Note that we've already done a large part of the expense of going idle
> > by the time we call that idle notifier -- in specific, we've
> > reprogrammed the clock to stop the tick.
> >
> > Its really wasteful to then generate work again, which means we have to
> > again reprogram the clock etc.
>
> Doing anything which is not related to idle itself in the idle
> notifier is just plain wrong.
I don't like the kicking the multi-buffer job flush using idle_notifier
path either. I'll try another version of the patch by doing this in the
multi-buffer job handler path.
>
> If that stuff wants to utilize idle slots, we really need to come up
> with a generic and general solution. Otherwise we'll grow those warts
> all over the architecture space, with slightly different ways of
> wreckaging the world an some more.
>
> This whole attidute of people thinking that they need their own
> specialized scheduling around the real scheduler is a PITA. All this
> stuff is just damanging any sensible approach of power saving, load
> balancing, etc.
>
> What we really want is infrastructure, which allows the scheduler to
> actively query the async work situation and based on the results
> actively decide when to process it and where.
I agree with you. It will be great if we have such infrastructure.
Thanks.
Tim
On Tue, Jul 15, 2014 at 08:06:55PM +0200, Mike Galbraith wrote:
> On Tue, 2014-07-15 at 16:53 +0200, Peter Zijlstra wrote:
> > On Tue, Jul 15, 2014 at 04:45:25PM +0200, Mike Galbraith wrote:
> > >
> > > 3.0.101-default 3.753363 usecs/loop -- avg 3.770737 530.4 KHz 1.000
> > > 3.14.10-default 4.145348 usecs/loop -- avg 4.139987 483.1 KHz .910 1.000
> > > 3.15.4-default 4.355594 usecs/loop -- avg 4.351961 459.6 KHz .866 .951 1.000
> > > 3.16.0-default 4.537279 usecs/loop -- avg 4.543532 440.2 KHz .829 .911 .957
> > >
> > > 3.0.101-smp 3.692377 usecs/loop -- avg 3.690774 541.9 KHz 1.000
> > > 3.14.10-smp 4.010009 usecs/loop -- avg 4.009019 498.9 KHz .920
> > > 3.15.4-smp 3.882398 usecs/loop -- avg 3.884095 514.9 KHz .950
> > > 3.16.0-master 4.061003 usecs/loop -- avg 4.058244 492.8 KHz .909
> >
> > Urgh,.. I need to go fix that :/
>
> I'm poking about. It's not just one thing 'course, just lots of change
> adding up to less than wonderful. Idle changes are costing some, for
> obese config, avg goop. The select_next_task() reorganization appears
> to be costing, but eyeballing, I can see no excuse for that at all.
How is the idle stuff costing, cpu-affine pipe-test should pretty much
peg a cpu at 100%, right? Or did I mis-understand and are you running a
loose pipe-test?
On Tue, 2014-07-15 at 21:03 +0200, Peter Zijlstra wrote:
> On Tue, Jul 15, 2014 at 08:06:55PM +0200, Mike Galbraith wrote:
> > On Tue, 2014-07-15 at 16:53 +0200, Peter Zijlstra wrote:
> > > On Tue, Jul 15, 2014 at 04:45:25PM +0200, Mike Galbraith wrote:
> > > >
> > > > 3.0.101-default 3.753363 usecs/loop -- avg 3.770737 530.4 KHz 1.000
> > > > 3.14.10-default 4.145348 usecs/loop -- avg 4.139987 483.1 KHz .910 1.000
> > > > 3.15.4-default 4.355594 usecs/loop -- avg 4.351961 459.6 KHz .866 .951 1.000
> > > > 3.16.0-default 4.537279 usecs/loop -- avg 4.543532 440.2 KHz .829 .911 .957
> > > >
> > > > 3.0.101-smp 3.692377 usecs/loop -- avg 3.690774 541.9 KHz 1.000
> > > > 3.14.10-smp 4.010009 usecs/loop -- avg 4.009019 498.9 KHz .920
> > > > 3.15.4-smp 3.882398 usecs/loop -- avg 3.884095 514.9 KHz .950
> > > > 3.16.0-master 4.061003 usecs/loop -- avg 4.058244 492.8 KHz .909
> > >
> > > Urgh,.. I need to go fix that :/
> >
> > I'm poking about. It's not just one thing 'course, just lots of change
> > adding up to less than wonderful. Idle changes are costing some, for
> > obese config, avg goop. The select_next_task() reorganization appears
> > to be costing, but eyeballing, I can see no excuse for that at all.
>
> How is the idle stuff costing, cpu-affine pipe-test should pretty much
> peg a cpu at 100%, right? Or did I mis-understand and are you running a
> loose pipe-test?
Exactly, I'm measuring the cost of popping in and out of idle while
trying to reclaim overlap (which doesn't exist in pipe-test case).
-Mike
On Tue, 15 Jul 2014, Tim Chen wrote:
> On Tue, 2014-07-15 at 14:59 +0200, Thomas Gleixner wrote:
> > On Tue, 15 Jul 2014, Peter Zijlstra wrote:
> >
> > > On Tue, Jul 15, 2014 at 11:50:45AM +0200, Peter Zijlstra wrote:
> > > > So you already have an idle notifier (which is x86 only, we should fix
> > > > that I suppose), and you then double check there really isn't anything
> > > > else running.
> > >
> > > Note that we've already done a large part of the expense of going idle
> > > by the time we call that idle notifier -- in specific, we've
> > > reprogrammed the clock to stop the tick.
> > >
> > > Its really wasteful to then generate work again, which means we have to
> > > again reprogram the clock etc.
> >
> > Doing anything which is not related to idle itself in the idle
> > notifier is just plain wrong.
>
> I don't like the kicking the multi-buffer job flush using idle_notifier
> path either. I'll try another version of the patch by doing this in the
> multi-buffer job handler path.
>
> >
> > If that stuff wants to utilize idle slots, we really need to come up
> > with a generic and general solution. Otherwise we'll grow those warts
> > all over the architecture space, with slightly different ways of
> > wreckaging the world an some more.
> >
> > This whole attidute of people thinking that they need their own
> > specialized scheduling around the real scheduler is a PITA. All this
> > stuff is just damanging any sensible approach of power saving, load
> > balancing, etc.
> >
> > What we really want is infrastructure, which allows the scheduler to
> > actively query the async work situation and based on the results
> > actively decide when to process it and where.
>
> I agree with you. It will be great if we have such infrastructure.
You are heartly invited to come up with that. :)