Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753250AbcKSQBZ (ORCPT ); Sat, 19 Nov 2016 11:01:25 -0500 Received: from mail-wm0-f67.google.com ([74.125.82.67]:33185 "EHLO mail-wm0-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752985AbcKSQBW (ORCPT ); Sat, 19 Nov 2016 11:01:22 -0500 From: Nicolai Stange To: Thomas Gleixner Cc: John Stultz , linux-kernel@vger.kernel.org, Nicolai Stange Subject: [RFC v8 00/28] adapt clockevents frequencies to mono clock Date: Sat, 19 Nov 2016 17:00:27 +0100 Message-Id: <20161119160055.12491-1-nicstange@gmail.com> X-Mailer: git-send-email 2.10.2 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 17156 Lines: 400 Goal: avoid programming ced devices too early for large deltas, for details, c.f. the description of [23/28]. Previous v7 can be found at [1]. While the runtimes looked Ok for x86, you were concerned about outliers on ARM: Thomas Gleixner writes: > On Wed, 21 Sep 2016, Nicolai Stange wrote: > >> On a Raspberry Pi 2B (bcm2836, ARMv7) with CONFIG_SMP=y, the mean over >> ~5300 samples is 5.14+/-1.04us with a max of 11.15us. > > So why is the variance that high? > You have an outlier on that intel as well which might be caused by > NMI, but it might also be a systematic issue depending on the input > parameters. 11 us on that ARM worries me. I did those benchmarks on a defconfig'd linux-next with a Raspbian running in userspace. This turned out to be the worst choice I could have made. - The USB host controller driven by dwc2 fired 8000 interrupts per second. - Missing cpufreq support in upstream linux caused the core to always run at its minimum speed. - The Raspbian userspace was a mostly idle but quite uncontrolled workload. So, for the next round of measurements, I - disabled almost all drivers, USB in particular, - told the firmware to always run the core at its nominal max of 900MHz - and used Busybox for the userspace. While the adjustment process, i.e. the input parameters, had formerly been driven by a NTP daemon, I switched to injecting deterministic offsets via adjtimex(2) from time to time. Due to their exponential decay, the adjustments span a larger range now. On top of that, I generated various artificial workloads by means of stress(1) and it turned out that clockevents_adjust_all_freqs()' runtime is - insensitive to input parameters, - insensitive to CPU load (as it should), - very sensitive to memory load. Actual numbers can be found in the description of [23/28] ("timekeeping: inform clockevents about freq adjustments"): - CPU stressed system Mean: 1733.75+-81.10 Quantiles: 0% 25% 50% 75% 100% 1458 1671 1717 1800 2083 (ns) - idle system gives numbers very similar to the CPU stressed case - Memory stressed system Mean: 8750.73+-1034.77 Quantiles: 0% 25% 50% 75% 100% 3854 8171 8901 9452 13593 (ns) Note that the "memory stressed system" is the ultimate worst case scenario: all TLB, data and instruction caches are presumably cold and memory is busy. I did my best to improve on the memory loaded situation, c.f. the new [27/28] ("clockevents: optimize struct clock_event_device layout") and [28/28] ("clockevents: move non-adjustable devices to the tail of the list") The numbers finally became Mean: 6505.18+-740.85 Quantiles: 0% 25% 50% 75% 100% 2708 6054 6523 6980 10885 (ns) These are all >12h runs which accumulated >100000 samples and I think that I may safely claim these 10.9us being the upper bound under a worst case workload. I played around with some other things like - prefetchw()ing the clock_event_device structs, - moving the clockevents_lock into .data, namely into the clockevent_devices list head's cacheline with no further improvement. A remaining possibility would be to always_inline __clockevents_calc_adjust_freq(): for some reason, gcc emits it into a page different from clockevents_adjust_all_freqs()' one. However, I didn't do that for now. Final note: clockevents_adjust_all_freqs()' runtime is O(#adjustable clockevent devices). That ARM has 5 of them. Due to lack of hardware access, I can't tell how the performance might look like on NUMA or machines with hundreds of cores. So, I think there are three options to proceed from here: - leave this series (modulo other review comments) as is, i.e. continue to call clockevents_adjust_all_freqs() from update_wall_time -> ... -> timekeeping_freqadjust() in interrupt context, - move the call to clockevents_adjust_all_freqs() out of interrupt context into some scheduled work - or abandon this series because it's too intrusive/complex when weighed against the improvements it attempts to achieve. Thanks, Nicolai [1] http://lkml.kernel.org/r/20160916200851.9273-1-nicstange@gmail.com Changes to v7: Rebased against next-20161117. In order to make some room in struct clock_event_device's first cacheline, [20/28] ("clockevents: degrade ->min_delta_ticks to unsigned int") [21/28] ("clockevents: pack ->state_use_accessors and ->features together") have been added. This freed space is used by the new [27/28] ("clockevents: optimize struct clock_event_device layout") The also new [26/28] ("clockevents: degrade ->retries to unsigned int") allows the just mentioned [27/28] to keep struct clock_event_device from crossing an extra multiple of the cacheline size on x86_64. The relative order of former [14/23] ("clockevents: decouple ->max_delta_ns from ->max_delta_ticks") and [15-20/23], the ->min_delta_ns => ->min_delta_ticks transformation, has been reversed because this allows for a more natural evolution of struct clock_event_device w.r.t. to its final layout. The latter subseries can now be found at [13-18/28] while the "decouple ->max_delta_ns from ->max_delta_ticks" patch is at [22/28]. [28/28] ("clockevents: move non-adjustable devices to the tail of the list") is new. Finally, it's worth mentioning that in former [15/23] ("clockevents: do comparison of delta against minimum in terms of cycles"), now being [13/28], the superfluous dev->min_delta_ticks_adjusted = max(dev->min_delta_ticks_adjusted, dev->min_delta_ticks); has been purged: cev_delta2ns() always rounds up and this change avoids an access to the cache-cold ->min_delta_ticks from clockevents_increase_min_delta(). Per-patch details: [1-10/28]: unchanged [11/28] ("clockevents: always initialize ->min_delta_ns and ->max_delta_ns") - Skip CLOCK_EVT_FEAT_DUMMY devices in __clockevents_update_bounds(). [12/28] ("many clockevent drivers: don't set ->min_delta_ns and ->max_delta_ns") - Unchanged. [13/28] ("clockevents: do comparison of delta against minimum in terms of cycles") - Former [15/23]. - As mentioned above, purge the dev->min_delta_ticks_adjusted = max(dev->min_delta_ticks_adjusted, dev->min_delta_ticks); - Rebase in order to apply before former [14/23] ("clockevents: decouple ->max_delta_ns from ->max_delta_ticks") which is now at [22/28] [14/28] ("clockevents: clockevents_program_min_delta(): don't set ->next_event") - Former [16/23], otherwise unchanged. [15/28] ("clockevents: use ->min_delta_ticks_adjusted to program minimum delta") - Former [17/23], otherwise unchanged. [16/28] ("clockevents: min delta increment: calculate min_delta_ns from ticks") - Former [18/23]. - Rebase in order to apply with the max(...) removal in [13/28]. [17/28] ("timer_list: print_tickdevice(): calculate ->min_delta_ns dynamically") - Former [19/23]. - Rebase in order to apply before former [14/23] ("clockevents: decouple ->max_delta_ns from ->max_delta_ticks") which is now at [22/28]. [18/28] ("clockevents: purge ->min_delta_ns") - Former [20/23]. - Rebase in order to apply before former [14/23] ("clockevents: decouple ->max_delta_ns from ->max_delta_ticks") which is now at [22/28]. - Rebase in order to apply to apply with the max(...) removal in [13/28]. [19/28] ("clockevents: introduce CLOCK_EVT_FEAT_NO_ADJUST flag") - Former [13/23], otherwise unchanged. [20/28] ("clockevents: degrade ->min_delta_ticks to unsigned int") - New. [21/28] ("clockevents: pack ->state_use_accessors and ->features together") - New. [22/28] ("clockevents: decouple ->max_delta_ns from ->max_delta_ticks") - Former [14/23]. - Rebase to apply at this position in the series. [23/28] ("clockevents: initial support for mono to raw time conversion") - Former [21/23]. - Changes to clockchips.h rebased in order to apply with the new layout of struct clock_event_device. - Rebase in order to apply with the max(...) removal in [13/28]. - Rebase in order to apply after the new [20/28] ("clockevents: degrade ->min_delta_ticks to unsigned int") - Skip CLOCK_EVT_FEAT_DUMMY devices in __clockevents_adjust_freq(). [24/28] ("clockevents: make setting of ->mult and ->mult_adjusted atomic") - Former [22/23], otherwise unchanged. [25/28] ("timekeeping: inform clockevents about freq adjustments") - Former [23/23]. - Add benchmark results to description. - Skip CLOCK_EVT_FEAT_DUMMY devices in clockevents_adjust_all_freqs(). [26/28] ("clockevents: degrade ->retries to unsigned int") - New. [27/28] ("clockevents: optimize struct clock_event_device layout") - New. [28/28] ("clockevents: move non-adjustable devices to the tail of the list") - New. Changes to v6: Rebased against next-20160916. [1/23] ("clocksource: sh_cmt: compute rate before registration again") Do not remove braces at if statement. Changes to v5: [21/23] ("clockevents: initial support for mono to raw time conversion") Replace the max_t() in adj = max_t(u64, adj, mult_ce_raw / 8); by min_t(): mult_ce_raw / 8 actually sets an upper bound on the mult adjustments. [23/23] ("timekeeping: inform clockevents about freq adjustments") Move the clockevents_adjust_all_freqs() invocation from timekeeping_apply_adjustment() to timekeeping_freqadjust(). Reason is given in the patch description. Changes to v4: [1-12/23] Unchanged [13/23] ("clockevents: introduce CLOCK_EVT_FEAT_NO_ADJUST flag") New. [14/23] ("clockevents: decouple ->max_delta_ns from ->max_delta_ticks") New. Solves the overflow problem the former [13/22] ("clockevents: check a programmed delta's bounds in terms of cycles") from v4 introduced. (Note that the former [14/22] ("clockevents: clockevents_program_event(): turn clc into unsigned long") from v4 has been purged.) [15/23] ("clockevents: do comparison of delta against minimum in terms of cycles") This is the former [13/22] ("clockevents: check a programmed delta's bounds in terms of cycles"), but only for the ->min_delta_* -- the ->max_delta_* are handled by [14/23] now. [16/23] ("clockevents: clockevents_program_min_delta(): don't set ->next_event") Former [15/22] unchanged. [17/23] ("clockevents: use ->min_delta_ticks_adjusted to program minimum delta") Former [16/22]. Trivially fix compilation error with CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=n. [18/22] ("clockevents: min delta increment: calculate min_delta_ns from ticks") Former [17/22] unchanged. [19/23] ("timer_list: print_tickdevice(): calculate ->min_delta_ns dynamically") Corresponds to former [18/22] ("timer_list: print_tickdevice(): calculate ->*_delta_ns dynamically") from v4, but only for ->min_delta_ns. The changes required for the display of ->max_delta_ns are now being made in [14/23] already. [20/23] ("clockevents: purge ->min_delta_ns") Corresponds to former [19/22] ("clockevents: purge ->min_delta_ns and ->max_delta_ns"), but with ->max_delta_ns being kept. [21/23] ("clockevents: initial support for mono to raw time conversion") Former [20/22] with the following changes: - Don't adjust mult for those ced's that have CLOCK_EVT_FEAT_NO_ADJUST set. - Don't meld __clockevents_update_bounds() into __clockevents_adjust_freq() anymore: the bounds for those devices having CLOCK_EVT_FEAT_NO_ADJUST set must have got their bounds set as well. - In __clockevents_calc_adjust_freq(), make sure that the adjusted mult doesn't exceed the original by more than 12.5%. C.f. [14/23]. - In timekeeping, define timekeeping_get_mono_mult() only for CONFIG_GENERIC_CLOCKEVENTS=y. [22/23] ("clockevents: make setting of ->mult and ->mult_adjusted atomic") Former [12/22], but with the description updated: previously, it said that this patch would introduce a new locking dependency. This is not true. [23/23] ("timekeeping: inform clockevents about freq adjustments") Former [22/22] with the following changes: - Don't adjust mult for those ced's that have CLOCK_EVT_FEAT_NO_ADJUST set. - In clockevents_adjust_all_freqs(), reuse the adjusted cached mult only if the associated ->shift also matches. - Introduce noop clockevents_adjust_all_freqs() in order to fix a compilation error with CONFIG_GENERIC_CLOCKEVENTS=n. Nicolai Stange (28): clocksource: sh_cmt: compute rate before registration again clocksource: sh_tmu: compute rate before registration again clocksource: em_sti: split clock prepare and enable steps clocksource: em_sti: compute rate before registration clocksource: h8300_timer8: don't reset rate in ->set_state_oneshot() clockevents: make clockevents_config() static many clockevent drivers: set ->min_delta_ticks and ->max_delta_ticks arch/s390/kernel/time: set ->min_delta_ticks and ->max_delta_ticks arch/x86/platform/uv/uv_time: set ->min_delta_ticks and ->max_delta_ticks arch/tile/kernel/time: set ->min_delta_ticks and ->max_delta_ticks clockevents: always initialize ->min_delta_ns and ->max_delta_ns many clockevent drivers: don't set ->min_delta_ns and ->max_delta_ns clockevents: do comparison of delta against minimum in terms of cycles clockevents: clockevents_program_min_delta(): don't set ->next_event clockevents: use ->min_delta_ticks_adjusted to program minimum delta clockevents: min delta increment: calculate min_delta_ns from ticks timer_list: print_tickdevice(): calculate ->min_delta_ns dynamically clockevents: purge ->min_delta_ns clockevents: introduce CLOCK_EVT_FEAT_NO_ADJUST flag clockevents: degrade ->min_delta_ticks to unsigned int clockevents: pack ->state_use_accessors and ->features together clockevents: decouple ->max_delta_ns from ->max_delta_ticks clockevents: initial support for mono to raw time conversion clockevents: make setting of ->mult and ->mult_adjusted atomic timekeeping: inform clockevents about freq adjustments clockevents: degrade ->retries to unsigned int clockevents: optimize struct clock_event_device layout clockevents: move non-adjustable devices to the tail of the list arch/avr32/kernel/time.c | 4 +- arch/blackfin/kernel/time-ts.c | 8 +- arch/c6x/platforms/timer64.c | 4 +- arch/hexagon/kernel/time.c | 4 +- arch/m68k/coldfire/pit.c | 6 +- arch/microblaze/kernel/timer.c | 6 +- arch/mips/alchemy/common/time.c | 4 +- arch/mips/jz4740/time.c | 4 +- arch/mips/kernel/cevt-bcm1480.c | 4 +- arch/mips/kernel/cevt-ds1287.c | 4 +- arch/mips/kernel/cevt-gt641xx.c | 4 +- arch/mips/kernel/cevt-sb1250.c | 4 +- arch/mips/kernel/cevt-txx9.c | 5 +- arch/mips/loongson32/common/time.c | 4 +- arch/mips/loongson64/common/cs5536/cs5536_mfgpt.c | 4 +- arch/mips/loongson64/loongson-3/hpet.c | 4 +- arch/mips/ralink/cevt-rt3352.c | 4 +- arch/mips/sgi-ip27/ip27-timer.c | 4 +- arch/mn10300/kernel/cevt-mn10300.c | 4 +- arch/powerpc/kernel/time.c | 6 +- arch/s390/kernel/time.c | 4 +- arch/score/kernel/time.c | 6 +- arch/sparc/kernel/time_32.c | 4 +- arch/sparc/kernel/time_64.c | 6 +- arch/tile/kernel/time.c | 4 +- arch/um/kernel/time.c | 4 +- arch/unicore32/kernel/time.c | 6 +- arch/x86/kernel/apic/apic.c | 12 +- arch/x86/lguest/boot.c | 4 +- arch/x86/platform/uv/uv_time.c | 6 +- arch/x86/xen/time.c | 8 +- drivers/clocksource/dw_apb_timer.c | 5 +- drivers/clocksource/em_sti.c | 49 +++-- drivers/clocksource/h8300_timer8.c | 8 - drivers/clocksource/metag_generic.c | 4 +- drivers/clocksource/numachip.c | 4 +- drivers/clocksource/sh_cmt.c | 45 ++-- drivers/clocksource/sh_tmu.c | 26 +-- drivers/clocksource/timer-atlas7.c | 4 +- include/linux/clockchips.h | 49 +++-- kernel/time/clockevents.c | 244 ++++++++++++++++++---- kernel/time/tick-broadcast-hrtimer.c | 2 - kernel/time/tick-internal.h | 6 + kernel/time/timekeeping.c | 16 ++ kernel/time/timer_list.c | 11 +- 45 files changed, 409 insertions(+), 219 deletions(-) -- 2.10.2