2013-03-09 00:51:10

by Frederic Weisbecker

[permalink] [raw]
Subject: [ANNOUNCE] 3.9-rc1-nohz1

Hi,

Several fixes there. And this version should have much lesser spurious
warnings. Your testing and reviews is very appreciated.

The 5 first patches of the series are pending on a pull request for -tip
(3.10 material).

I'm now considering how I should upstream the rest of the series.
All the pieces that got merged until now were sort of easy because the various
chunks were pretty self contained and independant (full dynticks cputime
accounting, printk, RCU user mode, dynticks API generalization, etc...).

Now what remains in this series is hard to cut into individual parts.
Everything depends on defining an interface with kernel parameter
to partition the full dynticks CPUs set.

I think we really need to start using a branch in -tip and move incrementally
from there with the following steps:

1) Set the kernel parameters and config option
2) Handle timers wakeup, timekeeping, posix cpu timers, perf, sched etc...
on top of kernel parameter based CPU partition
3) Once we know _everything_ is handled, bring the final dynticks infrastructure
4) Upstream

This will make everything much easier for everyone: easier piecewise reviews and easier for
other people to contribute.

Because you don't want me to spam you with ~40 commits for 2 more years, right?

Thanks.

This version can be found at:

git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
3.9-rc1-nohz1

---
Changes since 3.8-rc6-nohz4:

* Rebase against 3.9-rc1

* Fixed a few races with exception and preemption handling [1-3/29]

* Dropped commit "sched: Remove broken check for skip clock update"
that was buggy (thanks Steve for pointing that)

* Ignore noisy stale rq clock detection on boot and other situations
with rq->skip_clock_update [27/29]

* Dropped commit "sched: Update clock of nohz busiest rq before balancing"
that became useless (thanks Li Zhong)

* Don't issue a self IPI on timer enqueue if the CPU didn't stop its
tick [9/29]

* Rename a bit the Kconfig menu after discussion with Borislav [6/29]

* Handle broken full_nohz mask in kernel parameters (thanks Borislav) [6/29]

---
TODO list hasn't changed much:

- Posix CPU timers
- Perf events
- sched_class::task_tick()
- various other scheduler details
- ...

---
Frederic Weisbecker (29):
context_tracking: Move exception handling to generic code
context_tracking: Restore correct previous context state on exception
exit
context_tracking: Restore preempted context state after
preempt_schedule_irq()
cputime: Dynamically scale cputime for full dynticks accounting
context_tracking: Enable probes by default for selftesting
nohz: Basic full dynticks interface
nohz: Assign timekeeping duty to a non-full-nohz CPU
nohz: Trace timekeeping update
nohz: Wake up full dynticks CPUs when a timer gets enqueued
rcu: Restart the tick on non-responding full dynticks CPUs
sched: Comment on rq->clock correctness in ttwu_do_wakeup() in nohz
sched: Update rq clock on nohz CPU before migrating tasks
sched: Update rq clock on nohz CPU before setting fair group shares
sched: Update rq clock on tickless CPUs before calling
check_preempt_curr()
sched: Update rq clock earlier in unthrottle_cfs_rq
sched: Update rq clock before idle balancing
sched: Update nohz rq clock before searching busiest group on load
balancing
nohz: Move nohz load balancer selection into idle logic
nohz: Full dynticks mode
nohz: Only stop the tick on RCU nocb CPUs
nohz: Don't turn off the tick if rcu needs it
nohz: Don't stop the tick if posix cpu timers are running
nohz: Add some tracing
rcu: Don't keep the tick for RCU while in userspace
timer: Don't run non-pinned timer to full dynticks CPUs
sched: Use an accessor to read rq clock
sched: Debug nohz rq clock
sched: Update rq clock before rt sched average scale
sched: Disable lb_bias feature for full dynticks

arch/x86/include/asm/context_tracking.h | 21 ----
arch/x86/kernel/kvm.c | 8 +-
arch/x86/kernel/traps.c | 68 +++++++++-----
arch/x86/mm/fault.c | 8 +-
include/linux/context_tracking.h | 24 +++++-
include/linux/posix-timers.h | 1 +
include/linux/rcupdate.h | 8 ++
include/linux/sched.h | 14 ++-
include/linux/tick.h | 9 ++
init/Kconfig | 1 +
kernel/fork.c | 2 +-
kernel/hrtimer.c | 3 +-
kernel/posix-cpu-timers.c | 11 ++
kernel/rcutree.c | 19 +++-
kernel/rcutree.h | 1 -
kernel/rcutree_plugin.h | 13 +--
kernel/sched/core.c | 110 ++++++++++++++++++++--
kernel/sched/cputime.c | 154 ++++++++++++++++---------------
kernel/sched/fair.c | 79 +++++++++++-----
kernel/sched/features.h | 3 +
kernel/sched/rt.c | 8 +-
kernel/sched/sched.h | 61 ++++++++++++
kernel/sched/stats.h | 8 +-
kernel/sched/stop_task.c | 8 +-
kernel/softirq.c | 5 +-
kernel/time/Kconfig | 9 ++
kernel/time/tick-broadcast.c | 3 +-
kernel/time/tick-common.c | 5 +-
kernel/time/tick-sched.c | 134 ++++++++++++++++++++++++---
kernel/timer.c | 5 +-
30 files changed, 587 insertions(+), 216 deletions(-)

--
1.7.5.4


2013-03-09 08:27:13

by Ingo Molnar

[permalink] [raw]
Subject: Re: [ANNOUNCE] 3.9-rc1-nohz1


* Frederic Weisbecker <[email protected]> wrote:

> Hi,
>
> Several fixes there. And this version should have much lesser spurious warnings.
> Your testing and reviews is very appreciated.
>
> The 5 first patches of the series are pending on a pull request for -tip (3.10
> material).
>
> I'm now considering how I should upstream the rest of the series. All the pieces
> that got merged until now were sort of easy because the various chunks were pretty
> self contained and independant (full dynticks cputime accounting, printk, RCU user
> mode, dynticks API generalization, etc...).
>
> Now what remains in this series is hard to cut into individual parts. Everything
> depends on defining an interface with kernel parameter to partition the full
> dynticks CPUs set.
>
> I think we really need to start using a branch in -tip and move incrementally from
> there with the following steps:
>
> 1) Set the kernel parameters and config option
> 2) Handle timers wakeup, timekeeping, posix cpu timers, perf, sched etc...
> on top of kernel parameter based CPU partition
> 3) Once we know _everything_ is handled, bring the final dynticks infrastructure
> 4) Upstream
>
> This will make everything much easier for everyone: easier piecewise reviews and
> easier for other people to contribute.
>
> Because you don't want me to spam you with ~40 commits for 2 more years, right?
>
> Thanks.
>
> This version can be found at:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
> 3.9-rc1-nohz1
>
> ---
> Changes since 3.8-rc6-nohz4:
>
> * Rebase against 3.9-rc1
>
> * Fixed a few races with exception and preemption handling [1-3/29]
>
> * Dropped commit "sched: Remove broken check for skip clock update"
> that was buggy (thanks Steve for pointing that)
>
> * Ignore noisy stale rq clock detection on boot and other situations
> with rq->skip_clock_update [27/29]
>
> * Dropped commit "sched: Update clock of nohz busiest rq before balancing"
> that became useless (thanks Li Zhong)
>
> * Don't issue a self IPI on timer enqueue if the CPU didn't stop its
> tick [9/29]
>
> * Rename a bit the Kconfig menu after discussion with Borislav [6/29]
>
> * Handle broken full_nohz mask in kernel parameters (thanks Borislav) [6/29]
>
> ---
> TODO list hasn't changed much:
>
> - Posix CPU timers
> - Perf events
> - sched_class::task_tick()
> - various other scheduler details
> - ...

We could certainly start tip:sched/dynticks (or tip:timers/dynticks) to accelerate
the upstream merging of it. Nobody expressed deep concerns with the approach, so
what is left is some more hard work.

Two quick requests:

- Mind adding a Documentation/... file with a high level description,
rough design, open problems, etc.?

- Please outline how the current TODO entries affect upstream
mergability. Does it reduce the 'full'-ness of this dynticks mode?
Outright buggy behavior? Other trade-offs?

Thanks,

Ingo

2013-03-10 23:53:38

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [ANNOUNCE] 3.9-rc1-nohz1

2013/3/9 Ingo Molnar <[email protected]>:
> We could certainly start tip:sched/dynticks (or tip:timers/dynticks) to accelerate
> the upstream merging of it. Nobody expressed deep concerns with the approach, so
> what is left is some more hard work.

Great to see you're ok with that direction! I'm working on that then.

>
> Two quick requests:
>
> - Mind adding a Documentation/... file with a high level description,
> rough design, open problems, etc.?

Sure! We'll maintain that along the way.

>
> - Please outline how the current TODO entries affect upstream
> mergability. Does it reduce the 'full'-ness of this dynticks mode?
> Outright buggy behavior? Other trade-offs?

Mostly this is about upstream features that won't be working with the
current state of the art: enqueuing a posix cpu timer on a nohz CPU
may result in it being ignored by the target due to the lack of
ticking until expiration, perf events may not be round-robined, etc...
I'll make sure to document all these items.

Thanks.

2013-03-11 07:39:15

by Ingo Molnar

[permalink] [raw]
Subject: Re: [ANNOUNCE] 3.9-rc1-nohz1


* Frederic Weisbecker <[email protected]> wrote:

> > - Please outline how the current TODO entries affect upstream
> > mergability. Does it reduce the 'full'-ness of this dynticks mode?
> > Outright buggy behavior? Other trade-offs?
>
> Mostly this is about upstream features that won't be working with the current
> state of the art: enqueuing a posix cpu timer on a nohz CPU may result in it being
> ignored by the target due to the lack of ticking until expiration, perf events may
> not be round-robined, etc... I'll make sure to document all these items.

So it's "buggy behavior of existing features" it appears?

It would be really useful to add some sort of 'make it safe easily' mechanism:

- if a posix timer is enqueued on a CPU, then the CPU should have a timer ticking

- if perf events are active on a CPU, then it should have a timer ticking

this would make it mergable, as most of the time systems don't have any of these
facilities active. Plus this dynticks-off mechanism would also allow us to cover any
other (still unknown) facility that regresses. So it would be nice to have that
option.

Later on we could gradually eliminate these limitations. It would also be apparent
where they are, just from grepping the source.

If that's done, and if it tests fine for a few weeks then this could be v3.10
material IMO.

Thanks,

Ingo

2013-03-11 16:38:45

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [ANNOUNCE] 3.9-rc1-nohz1

2013/3/11 Ingo Molnar <[email protected]>:
>
> * Frederic Weisbecker <[email protected]> wrote:
>
>> > - Please outline how the current TODO entries affect upstream
>> > mergability. Does it reduce the 'full'-ness of this dynticks mode?
>> > Outright buggy behavior? Other trade-offs?
>>
>> Mostly this is about upstream features that won't be working with the current
>> state of the art: enqueuing a posix cpu timer on a nohz CPU may result in it being
>> ignored by the target due to the lack of ticking until expiration, perf events may
>> not be round-robined, etc... I'll make sure to document all these items.
>
> So it's "buggy behavior of existing features" it appears?

Right.

> It would be really useful to add some sort of 'make it safe easily' mechanism:
>
> - if a posix timer is enqueued on a CPU, then the CPU should have a timer ticking
>
> - if perf events are active on a CPU, then it should have a timer ticking
>
> this would make it mergable, as most of the time systems don't have any of these
> facilities active. Plus this dynticks-off mechanism would also allow us to cover any
> other (still unknown) facility that regresses. So it would be nice to have that
> option.

Yeah that's how I intended to solve the issue for these cases. I don't
worry that much about posix cpu timers and perf in fact. These should
be not hard to cope with. I'm more worried about scheduler details in
scheduler_tick().

I covered the rq clock and a part of update_cpu_load_active().

Now we have yet to care about sched_avg_update(),
calc_load_account_active() and sched_class::task_tick() to make sure
we are not letting something behind. There is rq->rt_avg that seem to
be used for load balancing when rt tasks are around. Then
calc_load_update. Idle load balancing is concerned as well. I haven't
looked deeply into these places so I don't know what can be shortcut
or not there.

> Later on we could gradually eliminate these limitations. It would also be apparent
> where they are, just from grepping the source.
>
> If that's done, and if it tests fine for a few weeks then this could be v3.10
> material IMO.

Ok, I won't be that optimistic about the release time but things are
certainly going to be faster now. I'm going to reshape and send you
what I have now then we'll have a fresher view of the rest.