2012-10-29 20:50:43

by Steven Rostedt

[permalink] [raw]
Subject: [PATCH 00/32] [RFC] nohz/cpuset: Start discussions on nohz CPUs

A while ago Frederic posted a series of patches to get an idea on
how to implement nohz cpusets. Where you can add a task to a cpuset
and mark the set to be 'nohz'. When the task runs on a CPU and is
the only task scheduled (nr_running == 1), the tick will stop.
The idea is to give the task the least amount of kernel interference
as possible. If the task doesn't do any system calls (and possibly
even if it does), no timer interrupt will bother it. By using
isocpus and nohz cpuset, a task would be able to achieve true cpu
isolation.

This has been long asked for by those in the RT community. If a task
requires uninterruptible CPU time, this would be able to give a task
that, even without the full PREEMPT-RT patch set.

This patch set is not for inclusion. It is just to get the topic
at the forefront again. The design requires more work and more
discussion.

I ported Frederic's work to v3.7-rc3 and I'm posting it here so that
people can comment on it. I just did the minimal to get it to compile
and boot. I haven't done any real tests with it yet. I may have screwed
some things up during the port, but that's OK, because the patch set
will most likely require a rewrite anyway.

Please have a look, and lets get this out the door.

-- Steve


Frederic Weisbecker (31):
nohz: Move nohz load balancer selection into idle logic
cpuset: Set up interface for nohz flag
nohz: Try not to give the timekeeping duty to an adaptive tickless cpu
x86: New cpuset nohz irq vector
nohz: Adaptive tick stop and restart on nohz cpuset
nohz/cpuset: Don't turn off the tick if rcu needs it
nohz/cpuset: Wake up adaptive nohz CPU when a timer gets enqueued
nohz/cpuset: Don't stop the tick if posix cpu timers are running
nohz/cpuset: Restart tick when nohz flag is cleared on cpuset
nohz/cpuset: Restart the tick if printk needs it
rcu: Restart the tick on non-responding adaptive nohz CPUs
rcu: Restart tick if we enqueue a callback in a nohz/cpuset CPU
nohz: Generalize tickless cpu time accounting
nohz/cpuset: Account user and system times in adaptive nohz mode
nohz/cpuset: New API to flush cputimes on nohz cpusets
nohz/cpuset: Flush cputime on threads in nohz cpusets when waiting leader
nohz/cpuset: Flush cputimes on procfs stat file read
nohz/cpuset: Flush cputimes for getrusage() and times() syscalls
x86: Syscall hooks for nohz cpusets
nohz: Don't restart the tick before scheduling to idle
sched: Comment on rq->clock correctness in ttwu_do_wakeup() in nohz
sched: Update rq clock on nohz CPU before migrating tasks
sched: Update rq clock on nohz CPU before setting fair group shares
sched: Update rq clock on tickless CPUs before calling check_preempt_curr()
sched: Update rq clock earlier in unthrottle_cfs_rq
sched: Update clock of nohz busiest rq before balancing
sched: Update rq clock before idle balancing
sched: Update nohz rq clock before searching busiest group on load balancing
rcu: Switch to extended quiescent state in userspace from nohz cpuset
nohz/cpuset: Disable under some configs
nohz, not for merge: Add tickless tracing

Hakan Akkan (1):
nohz/cpuset: enable addition&removal of cpus while in adaptive nohz mode

----
arch/Kconfig | 3 +
arch/x86/include/asm/entry_arch.h | 3 +
arch/x86/include/asm/hw_irq.h | 7 +
arch/x86/include/asm/irq_vectors.h | 2 +
arch/x86/include/asm/smp.h | 11 +-
arch/x86/kernel/entry_64.S | 4 +
arch/x86/kernel/irqinit.c | 4 +
arch/x86/kernel/ptrace.c | 11 +
arch/x86/kernel/smp.c | 28 +++
fs/proc/array.c | 2 +
include/linux/cpuset.h | 35 ++++
include/linux/kernel_stat.h | 2 +
include/linux/posix-timers.h | 1 +
include/linux/rcupdate.h | 1 +
include/linux/sched.h | 10 +-
include/linux/tick.h | 72 +++++--
init/Kconfig | 8 +
kernel/cpuset.c | 144 ++++++++++++-
kernel/exit.c | 8 +
kernel/posix-cpu-timers.c | 12 ++
kernel/printk.c | 15 +-
kernel/rcutree.c | 28 ++-
kernel/sched/core.c | 82 +++++++-
kernel/sched/cputime.c | 22 ++
kernel/sched/fair.c | 41 +++-
kernel/sched/sched.h | 18 ++
kernel/softirq.c | 6 +-
kernel/sys.c | 6 +
kernel/time/tick-sched.c | 398 ++++++++++++++++++++++++++++++++----
kernel/time/timer_list.c | 3 +-
kernel/timer.c | 2 +-
31 files changed, 912 insertions(+), 77 deletions(-)


2012-10-30 14:02:53

by Gilad Ben-Yossef

[permalink] [raw]
Subject: Re: [PATCH 00/32] [RFC] nohz/cpuset: Start discussions on nohz CPUs

On Mon, Oct 29, 2012 at 10:27 PM, Steven Rostedt <[email protected]> wrote:
>
> A while ago Frederic posted a series of patches to get an idea on
> how to implement nohz cpusets.
<snip>
> By using
> isocpus and nohz cpuset, a task would be able to achieve true cpu
> isolation.
>
> This has been long asked for by those in the RT community. If a task
> requires uninterruptible CPU time, this would be able to give a task
> that, even without the full PREEMPT-RT patch set.
>
> This patch set is not for inclusion. It is just to get the topic
> at the forefront again. The design requires more work and more
> discussion.
>

Three additional data points that might be of interest to the discussion:

1. AFAIK both Tilera and Cavium carry patch sets with similar
functionality in their respective kernels, so the idea has some real
world users already.

2. I tested a previous version of the same patch set (based on 3.3)
together with some fixes* and got the same latency, in cycles, from a
simple test program and a version of said program running bare metal
with no OS. The same program running without this patch got 3 orders
of magnitude higher latency. So, this certainly shows some great
potential.

3. Even if you don't care about latency at all, on a massively
multi-core (or hyperscale, as I've read some people call it now)
systems, assigning a task to a single CPU can makes a lot of sense
from a cache utilization perspective etc; if you that, this feature
can give a performance boost to anything that is mostly CPU bound and
perhaps for some workloads that are not so CPU bound as well.
Specifically, many high performance computing type of workloads come
to mind. So, this has the potential to be useful to both RT folks and
HPC folks, I think.

[*] A newer version patch set:
http://www.spinics.net/lists/linux-mm/msg33860.html and disabling the
part that sends IPI to update cputime for nohz/cpuset CPUs.

Thanks,
Gilad


--
Gilad Ben-Yossef
Chief Coffee Drinker
[email protected]
Israel Cell: +972-52-8260388
US Cell: +1-973-8260388
http://benyossef.com

"If you take a class in large-scale robotics, can you end up in a situation
where the homework eats your dog?"
-- Jean-Baptiste Queru

Subject: Re: [PATCH 00/32] [RFC] nohz/cpuset: Start discussions on nohz CPUs

On Mon, 29 Oct 2012, Steven Rostedt wrote:

> A while ago Frederic posted a series of patches to get an idea on
> how to implement nohz cpusets. Where you can add a task to a cpuset
> and mark the set to be 'nohz'. When the task runs on a CPU and is
> the only task scheduled (nr_running == 1), the tick will stop.
> The idea is to give the task the least amount of kernel interference
> as possible. If the task doesn't do any system calls (and possibly
> even if it does), no timer interrupt will bother it. By using
> isocpus and nohz cpuset, a task would be able to achieve true cpu
> isolation.

I thought isolcpus was on the way out? If there is no timer interrupt then
there will also be no scheduler activity. Why do we need both?

Also could we have this support without cpusets? There are multiple means
to do system segmentation (f.e. cgroups) and something like hz control is
pretty basic. Control via some cpumask like irq affinities in f.e.

/sys/devices/system/cpu/nohz

or a per cpu flag in

/sys/devices/system/cpu/cpu0/hz

would be easier and not be tied to something like cpusets.

also it would be best to sync this conceptually with the processors
enabled for rcu processing.

Maybe have a series of cpumasks in /sys/devices/system/cpu/ ?

> This has been long asked for by those in the RT community. If a task
> requires uninterruptible CPU time, this would be able to give a task
> that, even without the full PREEMPT-RT patch set.

Also those interested in low latency are very very interested in this
feature in particular in support without any preempt support on in the
kernel.

2012-11-02 14:37:31

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 00/32] [RFC] nohz/cpuset: Start discussions on nohz CPUs

On Fri, 2012-11-02 at 14:23 +0000, Christoph Lameter wrote:
> On Mon, 29 Oct 2012, Steven Rostedt wrote:
>
> > A while ago Frederic posted a series of patches to get an idea on
> > how to implement nohz cpusets. Where you can add a task to a cpuset
> > and mark the set to be 'nohz'. When the task runs on a CPU and is
> > the only task scheduled (nr_running == 1), the tick will stop.
> > The idea is to give the task the least amount of kernel interference
> > as possible. If the task doesn't do any system calls (and possibly
> > even if it does), no timer interrupt will bother it. By using
> > isocpus and nohz cpuset, a task would be able to achieve true cpu
> > isolation.
>
> I thought isolcpus was on the way out? If there is no timer interrupt then
> there will also be no scheduler activity. Why do we need both?

I probably shouldn't have mentioned isolcpus. I was using that as
something that is general to get everything off of a cpu (irq affinity
for example).

>
> Also could we have this support without cpusets? There are multiple means
> to do system segmentation (f.e. cgroups) and something like hz control is
> pretty basic. Control via some cpumask like irq affinities in f.e.
>
> /sys/devices/system/cpu/nohz
>
> or a per cpu flag in
>
> /sys/devices/system/cpu/cpu0/hz
>
> would be easier and not be tied to something like cpusets.

Frederic will have to answer this. I was just starting with his patches.
Note, we are holding off this work for now until Frederic's other work
is done (the irq_work and printk updates).

>
> also it would be best to sync this conceptually with the processors
> enabled for rcu processing.

Processors can be disabled for rcu processing? Or are you talking about
Paul's new work of offloading rcu callbacks?

>
> Maybe have a series of cpumasks in /sys/devices/system/cpu/ ?
>
> > This has been long asked for by those in the RT community. If a task
> > requires uninterruptible CPU time, this would be able to give a task
> > that, even without the full PREEMPT-RT patch set.
>
> Also those interested in low latency are very very interested in this
> feature in particular in support without any preempt support on in the
> kernel.
>

Yep understood. We really need to get things rolling.

-- Steve

2012-11-02 14:50:46

by David Nyström

[permalink] [raw]
Subject: Re: [PATCH 00/32] [RFC] nohz/cpuset: Start discussions on nohz CPUs

On 11/02/2012 03:37 PM, Steven Rostedt wrote:
> On Fri, 2012-11-02 at 14:23 +0000, Christoph Lameter wrote:
>> On Mon, 29 Oct 2012, Steven Rostedt wrote:
>>
>>> A while ago Frederic posted a series of patches to get an idea on
>>> how to implement nohz cpusets. Where you can add a task to a cpuset
>>> and mark the set to be 'nohz'. When the task runs on a CPU and is
>>> the only task scheduled (nr_running == 1), the tick will stop.
>>> The idea is to give the task the least amount of kernel interference
>>> as possible. If the task doesn't do any system calls (and possibly
>>> even if it does), no timer interrupt will bother it. By using
>>> isocpus and nohz cpuset, a task would be able to achieve true cpu
>>> isolation.
>>

One other aspect that this patch probably needs to address is the cache
localization of irq spinlocks.

At least in 3.6, with !CONFIG_SPARSE_IRQ
--
struct irq_desc irq_desc[NR_IRQS] __cacheline_aligned_in_smp = {
[0 ... NR_IRQS-1] = {
.handle_irq = handle_bad_irq,
.depth = 1,
.lock = __RAW_SPIN_LOCK_UNLOCKED(irq_desc->lock),
}
};
--

You are likely to get a cache miss in the top half of your low latency
CPU anytime some other CPU has taken a spinlock which lies within the
same cache line.

Or is my understanding of the __cacheline_aligned_in_smp declaration wrong ?

Br,
David

Subject: Re: [PATCH 00/32] [RFC] nohz/cpuset: Start discussions on nohz CPUs

On Fri, 2 Nov 2012, Steven Rostedt wrote:

> > also it would be best to sync this conceptually with the processors
> > enabled for rcu processing.
>
> Processors can be disabled for rcu processing? Or are you talking about
> Paul's new work of offloading rcu callbacks?

Yes. Paul's new work to remove rcu processing from processors. That needs
to be synced configuration wise somehow. It does not make sense to process
rcu callbacks on processors where the timer tick does not work anymore.

2012-11-02 15:14:05

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 00/32] [RFC] nohz/cpuset: Start discussions on nohz CPUs

On Fri, 2012-11-02 at 15:03 +0000, Christoph Lameter wrote:
> On Fri, 2 Nov 2012, Steven Rostedt wrote:
>
> > > also it would be best to sync this conceptually with the processors
> > > enabled for rcu processing.
> >
> > Processors can be disabled for rcu processing? Or are you talking about
> > Paul's new work of offloading rcu callbacks?
>
> Yes. Paul's new work to remove rcu processing from processors. That needs
> to be synced configuration wise somehow. It does not make sense to process
> rcu callbacks on processors where the timer tick does not work anymore.

Don't worry, Paul is working with us too ;-)

-- Steve

2012-11-02 18:35:54

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 00/32] [RFC] nohz/cpuset: Start discussions on nohz CPUs

On Fri, Nov 02, 2012 at 03:03:01PM +0000, Christoph Lameter wrote:
> On Fri, 2 Nov 2012, Steven Rostedt wrote:
>
> > > also it would be best to sync this conceptually with the processors
> > > enabled for rcu processing.
> >
> > Processors can be disabled for rcu processing? Or are you talking about
> > Paul's new work of offloading rcu callbacks?
>
> Yes. Paul's new work to remove rcu processing from processors. That needs
> to be synced configuration wise somehow. It does not make sense to process
> rcu callbacks on processors where the timer tick does not work anymore.

In kernels built with CONFIG_FAST_NO_HZ=n, if there are callbacks,
then there will be a tick, with or without Frederic's adaptive ticks.
If CONFIG_FAST_NO_HZ=y, if there are callbacks but no tick, RCU will
arrange for a timer to allow RCU processing to proceed as needed, but
much longer than one tick in duration, and only until such time as the
RCU callbacks drain.

So, yes, people who need absolutely all jitter to be banished at whatever
cost would want both adaptive ticks and no-CBs CPUs, but not everyone
who wants adaptive ticks would necessarily want the burden of choosing
which CPUs get callbacks offloaded from and where they should be executed.

So I believe that these need to be controlled separately for the immediate
future.

Thanx, Paul

Subject: Re: [PATCH 00/32] [RFC] nohz/cpuset: Start discussions on nohz CPUs

On Fri, 2 Nov 2012, Paul E. McKenney wrote:

> So I believe that these need to be controlled separately for the immediate
> future.

Yes they do but the configurations are similar and it would be best if
these were cpumasks in standard locations instead of being specified at
boot time or in a cpuset.

Put the cpu masks into

/sys/devices/system/cpu/{nohz_cpus,rcu_cpus}

or so?

2012-11-02 20:42:07

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 00/32] [RFC] nohz/cpuset: Start discussions on nohz CPUs

On Fri, Nov 02, 2012 at 08:16:58PM +0000, Christoph Lameter wrote:
> On Fri, 2 Nov 2012, Paul E. McKenney wrote:
>
> > So I believe that these need to be controlled separately for the immediate
> > future.
>
> Yes they do but the configurations are similar and it would be best if
> these were cpumasks in standard locations instead of being specified at
> boot time or in a cpuset.
>
> Put the cpu masks into
>
> /sys/devices/system/cpu/{nohz_cpus,rcu_cpus}
>
> or so?

The no-CBs mask would be read-only for some time -- changed only at
boot. Longer term, I hope to allow run-time modification, but...

Thanx, Paul

2012-11-02 20:51:55

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 00/32] [RFC] nohz/cpuset: Start discussions on nohz CPUs

On Fri, 2012-11-02 at 13:41 -0700, Paul E. McKenney wrote:

> The no-CBs mask would be read-only for some time -- changed only at
> boot. Longer term, I hope to allow run-time modification, but...
>

but what? You're not looking to retire already are you? ;-)

-- Steve

2012-11-03 02:08:54

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 00/32] [RFC] nohz/cpuset: Start discussions on nohz CPUs

On Fri, Nov 02, 2012 at 04:51:50PM -0400, Steven Rostedt wrote:
> On Fri, 2012-11-02 at 13:41 -0700, Paul E. McKenney wrote:
>
> > The no-CBs mask would be read-only for some time -- changed only at
> > boot. Longer term, I hope to allow run-time modification, but...
>
> but what? You're not looking to retire already are you? ;-)

Not for a few decades. ;-)

But let's add the no-CBs mask to sysfs when I add the ability to run-time
modify that mast.

Thanx, Paul

Subject: Re: [PATCH 00/32] [RFC] nohz/cpuset: Start discussions on nohz CPUs

On Fri, 2 Nov 2012, Paul E. McKenney wrote:

> On Fri, Nov 02, 2012 at 04:51:50PM -0400, Steven Rostedt wrote:
> > On Fri, 2012-11-02 at 13:41 -0700, Paul E. McKenney wrote:
> >
> > > The no-CBs mask would be read-only for some time -- changed only at
> > > boot. Longer term, I hope to allow run-time modification, but...
> >
> > but what? You're not looking to retire already are you? ;-)
>
> Not for a few decades. ;-)
>
> But let's add the no-CBs mask to sysfs when I add the ability to run-time
> modify that mast.

Well we are creating a user ABi with the boot time option. It would be
best to get it right out of the door.

2012-11-05 22:32:23

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 00/32] [RFC] nohz/cpuset: Start discussions on nohz CPUs

2012/11/2 Christoph Lameter <[email protected]>:
> Also could we have this support without cpusets? There are multiple means
> to do system segmentation (f.e. cgroups) and something like hz control is
> pretty basic. Control via some cpumask like irq affinities in f.e.
>
> /sys/devices/system/cpu/nohz
>
> or a per cpu flag in
>
> /sys/devices/system/cpu/cpu0/hz
>
> would be easier and not be tied to something like cpusets.

You really don't want that cpuset interface, do you? ;-)

Yeah I think I agree with you. This adds a dependency to
cpusets/cgroups, I wish we could avoid that if possible. Also cpuset
may be a bit counter intuitive for this usecase. What if a cpu is
included in both a nohz cpuset and a non-nohz cpuset? What is the
behaviour to adopt? An OR on the nohz flag such that as long as the
CPU is in at least one nohz cpuset, it's considered a nohz CPU? Or
only shutdown the tick for the tasks attached in the nohz cpusets? Do
we really want that per cgroup granularity and the overhead /
complexity that comes along?

No I think we should stay simple and have a simple per CPU property
for that, without involving cgroups aside.

So indeed a cpumask in /sys/devices/system/cpu/nohz looks like a
better interface.

>> This has been long asked for by those in the RT community. If a task
>> requires uninterruptible CPU time, this would be able to give a task
>> that, even without the full PREEMPT-RT patch set.
>
> Also those interested in low latency are very very interested in this
> feature in particular in support without any preempt support on in the
> kernel.

Sure, we are trying to make that full dyncticks approach as much
generic as possible.

2012-11-05 22:41:59

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 00/32] [RFC] nohz/cpuset: Start discussions on nohz CPUs

2012/11/5 Christoph Lameter <[email protected]>:
> On Fri, 2 Nov 2012, Paul E. McKenney wrote:
>
>> On Fri, Nov 02, 2012 at 04:51:50PM -0400, Steven Rostedt wrote:
>> > On Fri, 2012-11-02 at 13:41 -0700, Paul E. McKenney wrote:
>> >
>> > > The no-CBs mask would be read-only for some time -- changed only at
>> > > boot. Longer term, I hope to allow run-time modification, but...
>> >
>> > but what? You're not looking to retire already are you? ;-)
>>
>> Not for a few decades. ;-)
>>
>> But let's add the no-CBs mask to sysfs when I add the ability to run-time
>> modify that mast.
>
> Well we are creating a user ABi with the boot time option. It would be
> best to get it right out of the door.

I believe that a static setting through a boot option is a nice first
step already. Runtime tuning may involve dynamic migration and other
headaches. The nocb patch is tricky enough to review ;)