From: Nicolai Stange <nicstange@gmail.com>
To: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>, linux-kernel@vger.kernel.org,
        Nicolai Stange <nicstange@gmail.com>
Subject: [RFC v8 00/28] adapt clockevents frequencies to mono clock
Date: Sat, 19 Nov 2016 17:00:27 +0100
Message-Id: <20161119160055.12491-1-nicstange@gmail.com>
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 17156
Lines: 400

Goal: avoid programming ced devices too early for large deltas, for
      details, c.f. the description of [23/28].

Previous v7 can be found at [1].

While the runtimes looked Ok for x86, you were concerned about
outliers on ARM:

Thomas Gleixner <tglx@linutronix.de> writes:
> On Wed, 21 Sep 2016, Nicolai Stange wrote:
>
>> On a Raspberry Pi 2B (bcm2836, ARMv7) with CONFIG_SMP=y, the mean over
>> ~5300 samples is 5.14+/-1.04us with a max of 11.15us.
>
> So why is the variance that high?
> You have an outlier on that intel as well which might be caused by
> NMI, but it might also be a systematic issue depending on the input
> parameters. 11 us on that ARM worries me.

I did those benchmarks on a defconfig'd linux-next with a Raspbian
running in userspace. This turned out to be the worst choice I could
have made.
- The USB host controller driven by dwc2 fired 8000 interrupts per
  second.
- Missing cpufreq support in upstream linux caused the core to always
  run at its minimum speed.
- The Raspbian userspace was a mostly idle but quite uncontrolled
  workload.

So, for the next round of measurements, I
- disabled almost all drivers, USB in particular,
- told the firmware to always run the core at its nominal max of 900MHz
- and used Busybox for the userspace.

While the adjustment process, i.e. the input parameters, had formerly
been driven by a NTP daemon, I switched to injecting deterministic
offsets via adjtimex(2) from time to time. Due to their exponential
decay, the adjustments span a larger range now.

On top of that, I generated various artificial workloads by means of
stress(1) and it turned out that clockevents_adjust_all_freqs()'
runtime is
- insensitive to input parameters,
- insensitive to CPU load (as it should),
- very sensitive to memory load.

Actual numbers can be found in the description of
[23/28] ("timekeeping: inform clockevents about freq adjustments"):
- CPU stressed system
  Mean: 1733.75+-81.10
  Quantiles:
    0%    25%    50%    75%   100%
  1458   1671   1717   1800   2083 (ns)

- idle system gives numbers very similar to the CPU stressed case

- Memory stressed system
  Mean: 8750.73+-1034.77
  Quantiles:
    0%    25%    50%    75%    100%
  3854   8171   8901   9452   13593 (ns)

Note that the "memory stressed system" is the ultimate worst case
scenario: all TLB, data and instruction caches are presumably cold and
memory is busy.

I did my best to improve on the memory loaded situation, c.f. the new
  [27/28] ("clockevents: optimize struct clock_event_device layout")
and
  [28/28] ("clockevents: move non-adjustable devices to the tail of the list")

The numbers finally became
  Mean: 6505.18+-740.85
  Quantiles:
    0%    25%    50%    75%    100%
  2708   6054   6523   6980   10885 (ns)

These are all >12h runs which accumulated >100000 samples and I think
that I may safely claim these 10.9us being the upper bound under a
worst case workload.

I played around with some other things like
- prefetchw()ing the clock_event_device structs,
- moving the clockevents_lock into .data, namely into the
  clockevent_devices list head's cacheline
with no further improvement.

A remaining possibility would be to always_inline
__clockevents_calc_adjust_freq(): for some reason, gcc emits it into a
page different from clockevents_adjust_all_freqs()' one. However, I
didn't do that for now.

Final note: clockevents_adjust_all_freqs()' runtime is
O(#adjustable clockevent devices). That ARM has 5 of them. Due to lack
of hardware access, I can't tell how the performance might look
like on NUMA or machines with hundreds of cores.

So, I think there are three options to proceed from here:
- leave this series (modulo other review comments) as is,
  i.e. continue to call clockevents_adjust_all_freqs() from
  update_wall_time -> ... -> timekeeping_freqadjust()
  in interrupt context,
- move the call to clockevents_adjust_all_freqs() out of interrupt
  context into some scheduled work
- or abandon this series because it's too intrusive/complex
  when weighed against the improvements it attempts to achieve.


Thanks,

Nicolai

[1] http://lkml.kernel.org/r/20160916200851.9273-1-nicstange@gmail.com


Changes to v7:
 Rebased against next-20161117.

 In order to make some room in struct clock_event_device's first
 cacheline,
  [20/28] ("clockevents: degrade ->min_delta_ticks to unsigned int")
  [21/28] ("clockevents: pack ->state_use_accessors and ->features together")
 have been added.

 This freed space is used by the new
  [27/28] ("clockevents: optimize struct clock_event_device layout")

 The also new
  [26/28] ("clockevents: degrade ->retries to unsigned int")
 allows the just mentioned [27/28] to keep struct clock_event_device
 from crossing an extra multiple of the cacheline size on x86_64.

 The relative order of former
  [14/23] ("clockevents: decouple ->max_delta_ns from ->max_delta_ticks")
 and
  [15-20/23], the ->min_delta_ns => ->min_delta_ticks transformation,
 has been reversed because this allows for a more natural evolution
 of struct clock_event_device w.r.t. to its final layout.
 The latter subseries can now be found at [13-18/28] while
 the  "decouple ->max_delta_ns from ->max_delta_ticks" patch is at [22/28].

 [28/28] ("clockevents: move non-adjustable devices to the tail of the list")
 is new.

 Finally, it's worth mentioning that in former
  [15/23] ("clockevents: do comparison of delta against minimum in terms of cycles"),
 now being [13/28], the superfluous
   dev->min_delta_ticks_adjusted = max(dev->min_delta_ticks_adjusted,
                                       dev->min_delta_ticks);
 has been purged: cev_delta2ns() always rounds up and this change avoids
 an access to the cache-cold ->min_delta_ticks from
 clockevents_increase_min_delta().

 Per-patch details:
 [1-10/28]: unchanged

 [11/28] ("clockevents: always initialize ->min_delta_ns and ->max_delta_ns")
   - Skip CLOCK_EVT_FEAT_DUMMY devices in __clockevents_update_bounds().

 [12/28] ("many clockevent drivers: don't set ->min_delta_ns and ->max_delta_ns")
   - Unchanged.

 [13/28] ("clockevents: do comparison of delta against minimum in terms of cycles")
   - Former [15/23].
   - As mentioned above, purge the 
     dev->min_delta_ticks_adjusted = max(dev->min_delta_ticks_adjusted,
                                         dev->min_delta_ticks);
   - Rebase in order to apply before former
     [14/23] ("clockevents: decouple ->max_delta_ns from ->max_delta_ticks")
     which is now at [22/28]

 [14/28] ("clockevents: clockevents_program_min_delta(): don't set ->next_event")
   - Former [16/23], otherwise unchanged.

 [15/28] ("clockevents: use ->min_delta_ticks_adjusted to program minimum delta")
   - Former [17/23], otherwise unchanged.

 [16/28] ("clockevents: min delta increment: calculate min_delta_ns from ticks")
   - Former [18/23].
   - Rebase in order to apply with the max(...) removal in [13/28].

 [17/28] ("timer_list: print_tickdevice(): calculate ->min_delta_ns dynamically")
   - Former [19/23].
   - Rebase in order to apply before former
     [14/23] ("clockevents: decouple ->max_delta_ns from ->max_delta_ticks")
     which is now at [22/28].

 [18/28] ("clockevents: purge ->min_delta_ns")
   - Former [20/23].
   - Rebase in order to apply before former
     [14/23] ("clockevents: decouple ->max_delta_ns from ->max_delta_ticks")
     which is now at [22/28].
   - Rebase in order to apply to apply with the max(...) removal in [13/28].

 [19/28] ("clockevents: introduce CLOCK_EVT_FEAT_NO_ADJUST flag")
   - Former [13/23], otherwise unchanged.

 [20/28] ("clockevents: degrade ->min_delta_ticks to unsigned int")
   - New.

 [21/28] ("clockevents: pack ->state_use_accessors and ->features together")
   - New.

 [22/28] ("clockevents: decouple ->max_delta_ns from ->max_delta_ticks")
   - Former [14/23].
   - Rebase to apply at this position in the series.

 [23/28] ("clockevents: initial support for mono to raw time conversion")
   - Former [21/23].
   - Changes to clockchips.h rebased in order to apply with the new
     layout of struct clock_event_device.
   - Rebase in order to apply with the max(...) removal in [13/28].
   - Rebase in order to apply after the new
     [20/28] ("clockevents: degrade ->min_delta_ticks to unsigned int")
   - Skip CLOCK_EVT_FEAT_DUMMY devices in __clockevents_adjust_freq().

 [24/28] ("clockevents: make setting of ->mult and ->mult_adjusted atomic")
   - Former [22/23], otherwise unchanged.

 [25/28] ("timekeeping: inform clockevents about freq adjustments")
   - Former [23/23].
   - Add benchmark results to description.
   - Skip CLOCK_EVT_FEAT_DUMMY devices in clockevents_adjust_all_freqs().

 [26/28] ("clockevents: degrade ->retries to unsigned int")
   - New.

 [27/28] ("clockevents: optimize struct clock_event_device layout")
   - New.

 [28/28] ("clockevents: move non-adjustable devices to the tail of the list")
   - New.


Changes to v6:
 Rebased against next-20160916.

 [1/23]  ("clocksource: sh_cmt: compute rate before registration again")
   Do not remove braces at if statement.


Changes to v5:
 [21/23] ("clockevents: initial support for mono to raw time conversion")
   Replace the max_t() in
     adj = max_t(u64, adj, mult_ce_raw / 8);
   by min_t(): mult_ce_raw / 8 actually sets an upper bound on the
   mult adjustments.

 [23/23] ("timekeeping: inform clockevents about freq adjustments")
   Move the clockevents_adjust_all_freqs() invocation from
   timekeeping_apply_adjustment() to timekeeping_freqadjust().
   Reason is given in the patch description.


Changes to v4:
 [1-12/23] Unchanged

 [13/23] ("clockevents: introduce CLOCK_EVT_FEAT_NO_ADJUST flag")
   New.

 [14/23] ("clockevents: decouple ->max_delta_ns from ->max_delta_ticks")
   New. Solves the overflow problem the former
   [13/22] ("clockevents: check a programmed delta's bounds in terms of cycles")
   from v4 introduced.

   (Note that the former
    [14/22] ("clockevents: clockevents_program_event(): turn clc into unsigned long")
    from v4 has been purged.)

 [15/23] ("clockevents: do comparison of delta against minimum in terms of cycles")
   This is the former
   [13/22] ("clockevents: check a programmed delta's bounds in terms of cycles"),
   but only for the ->min_delta_* -- the ->max_delta_* are handled by [14/23] now.

 [16/23] ("clockevents: clockevents_program_min_delta(): don't set ->next_event")
   Former [15/22] unchanged.

 [17/23] ("clockevents: use ->min_delta_ticks_adjusted to program minimum delta")
   Former [16/22]. Trivially fix compilation error with
   CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=n.

 [18/22] ("clockevents: min delta increment: calculate min_delta_ns from ticks")
   Former [17/22] unchanged.

 [19/23] ("timer_list: print_tickdevice(): calculate ->min_delta_ns dynamically")
   Corresponds to former
   [18/22] ("timer_list: print_tickdevice(): calculate ->*_delta_ns dynamically")
   from v4, but only for ->min_delta_ns. The changes required for the display of
   ->max_delta_ns are now being made in [14/23] already.

 [20/23] ("clockevents: purge ->min_delta_ns")
   Corresponds to former
   [19/22] ("clockevents: purge ->min_delta_ns and ->max_delta_ns"),
   but with ->max_delta_ns being kept.

 [21/23] ("clockevents: initial support for mono to raw time conversion")
   Former [20/22] with the following changes:
   - Don't adjust mult for those ced's that have CLOCK_EVT_FEAT_NO_ADJUST set.
   - Don't meld __clockevents_update_bounds() into __clockevents_adjust_freq()
     anymore: the bounds for those devices having CLOCK_EVT_FEAT_NO_ADJUST set
     must have got their bounds set as well.
   - In __clockevents_calc_adjust_freq(), make sure that the adjusted mult
     doesn't exceed the original by more than 12.5%. C.f. [14/23].
   - In timekeeping, define timekeeping_get_mono_mult() only for
     CONFIG_GENERIC_CLOCKEVENTS=y.

 [22/23] ("clockevents: make setting of ->mult and ->mult_adjusted atomic")
   Former [12/22], but with the description updated: previously, it said that
   this patch would introduce a new locking dependency. This is not true.

 [23/23] ("timekeeping: inform clockevents about freq adjustments")
   Former [22/22] with the following changes:
    - Don't adjust mult for those ced's that have CLOCK_EVT_FEAT_NO_ADJUST set.
    - In clockevents_adjust_all_freqs(), reuse the adjusted cached mult only
      if the associated ->shift also matches.
    - Introduce noop clockevents_adjust_all_freqs() in order to fix a
      compilation error with CONFIG_GENERIC_CLOCKEVENTS=n.


Nicolai Stange (28):
  clocksource: sh_cmt: compute rate before registration again
  clocksource: sh_tmu: compute rate before registration again
  clocksource: em_sti: split clock prepare and enable steps
  clocksource: em_sti: compute rate before registration
  clocksource: h8300_timer8: don't reset rate in ->set_state_oneshot()
  clockevents: make clockevents_config() static
  many clockevent drivers: set ->min_delta_ticks and ->max_delta_ticks
  arch/s390/kernel/time: set ->min_delta_ticks and ->max_delta_ticks
  arch/x86/platform/uv/uv_time: set ->min_delta_ticks and
    ->max_delta_ticks
  arch/tile/kernel/time: set ->min_delta_ticks and ->max_delta_ticks
  clockevents: always initialize ->min_delta_ns and ->max_delta_ns
  many clockevent drivers: don't set ->min_delta_ns and ->max_delta_ns
  clockevents: do comparison of delta against minimum in terms of cycles
  clockevents: clockevents_program_min_delta(): don't set ->next_event
  clockevents: use ->min_delta_ticks_adjusted to program minimum delta
  clockevents: min delta increment: calculate min_delta_ns from ticks
  timer_list: print_tickdevice(): calculate ->min_delta_ns dynamically
  clockevents: purge ->min_delta_ns
  clockevents: introduce CLOCK_EVT_FEAT_NO_ADJUST flag
  clockevents: degrade ->min_delta_ticks to unsigned int
  clockevents: pack ->state_use_accessors and ->features together
  clockevents: decouple ->max_delta_ns from ->max_delta_ticks
  clockevents: initial support for mono to raw time conversion
  clockevents: make setting of ->mult and ->mult_adjusted atomic
  timekeeping: inform clockevents about freq adjustments
  clockevents: degrade ->retries to unsigned int
  clockevents: optimize struct clock_event_device layout
  clockevents: move non-adjustable devices to the tail of the list

 arch/avr32/kernel/time.c                          |   4 +-
 arch/blackfin/kernel/time-ts.c                    |   8 +-
 arch/c6x/platforms/timer64.c                      |   4 +-
 arch/hexagon/kernel/time.c                        |   4 +-
 arch/m68k/coldfire/pit.c                          |   6 +-
 arch/microblaze/kernel/timer.c                    |   6 +-
 arch/mips/alchemy/common/time.c                   |   4 +-
 arch/mips/jz4740/time.c                           |   4 +-
 arch/mips/kernel/cevt-bcm1480.c                   |   4 +-
 arch/mips/kernel/cevt-ds1287.c                    |   4 +-
 arch/mips/kernel/cevt-gt641xx.c                   |   4 +-
 arch/mips/kernel/cevt-sb1250.c                    |   4 +-
 arch/mips/kernel/cevt-txx9.c                      |   5 +-
 arch/mips/loongson32/common/time.c                |   4 +-
 arch/mips/loongson64/common/cs5536/cs5536_mfgpt.c |   4 +-
 arch/mips/loongson64/loongson-3/hpet.c            |   4 +-
 arch/mips/ralink/cevt-rt3352.c                    |   4 +-
 arch/mips/sgi-ip27/ip27-timer.c                   |   4 +-
 arch/mn10300/kernel/cevt-mn10300.c                |   4 +-
 arch/powerpc/kernel/time.c                        |   6 +-
 arch/s390/kernel/time.c                           |   4 +-
 arch/score/kernel/time.c                          |   6 +-
 arch/sparc/kernel/time_32.c                       |   4 +-
 arch/sparc/kernel/time_64.c                       |   6 +-
 arch/tile/kernel/time.c                           |   4 +-
 arch/um/kernel/time.c                             |   4 +-
 arch/unicore32/kernel/time.c                      |   6 +-
 arch/x86/kernel/apic/apic.c                       |  12 +-
 arch/x86/lguest/boot.c                            |   4 +-
 arch/x86/platform/uv/uv_time.c                    |   6 +-
 arch/x86/xen/time.c                               |   8 +-
 drivers/clocksource/dw_apb_timer.c                |   5 +-
 drivers/clocksource/em_sti.c                      |  49 +++--
 drivers/clocksource/h8300_timer8.c                |   8 -
 drivers/clocksource/metag_generic.c               |   4 +-
 drivers/clocksource/numachip.c                    |   4 +-
 drivers/clocksource/sh_cmt.c                      |  45 ++--
 drivers/clocksource/sh_tmu.c                      |  26 +--
 drivers/clocksource/timer-atlas7.c                |   4 +-
 include/linux/clockchips.h                        |  49 +++--
 kernel/time/clockevents.c                         | 244 ++++++++++++++++++----
 kernel/time/tick-broadcast-hrtimer.c              |   2 -
 kernel/time/tick-internal.h                       |   6 +
 kernel/time/timekeeping.c                         |  16 ++
 kernel/time/timer_list.c                          |  11 +-
 45 files changed, 409 insertions(+), 219 deletions(-)

-- 
2.10.2