Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754001AbcKSQMg (ORCPT ); Sat, 19 Nov 2016 11:12:36 -0500 Received: from mail-wj0-f193.google.com ([209.85.210.193]:33902 "EHLO mail-wj0-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753510AbcKSQMb (ORCPT ); Sat, 19 Nov 2016 11:12:31 -0500 X-Greylist: delayed 655 seconds by postgrey-1.27 at vger.kernel.org; Sat, 19 Nov 2016 11:12:30 EST From: Nicolai Stange To: Thomas Gleixner Cc: John Stultz , linux-kernel@vger.kernel.org, Nicolai Stange Subject: [RFC v8 27/28] clockevents: optimize struct clock_event_device layout Date: Sat, 19 Nov 2016 17:10:35 +0100 Message-Id: <20161119161036.12679-16-nicstange@gmail.com> X-Mailer: git-send-email 2.10.2 In-Reply-To: <20161119160055.12491-1-nicstange@gmail.com> References: <20161119160055.12491-1-nicstange@gmail.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 6085 Lines: 144 Benchmarks have shown that clockevents_adjust_all_freqs() is memory bound. The members of struct clock_event_device accessed therein, namely ->features, ->mult_adjusted, ->shift, ->mult and ->list are split across two cachelines. In what follows, a cacheline size of 64 bytes will be assumed. I.) sizeof(long) == 4 At least for sizeof(long) == sizeof(void*) == 4, e.g. on ARM, this situation can be improved by reordering the members of struct clock_event_device. This patch exploits the fact that with the current layout, there is some padding inserted within the first cacheline on ARM: right now, the first three members, which are of pointer type and 12 bytes in total, are followed by a ktime_t. Baically being a s64, the latter gets aligned on an 8 byte boundary. Thus, bytes 12 to 16 are occupied by unused padding. Furthermore, for sizeof(unsigned long) == 4, this patch series has achieved another net saving of 4 bytes within the first cacheline so far. Summarizing: through reordering of the first cacheline's members, a total space of 8 bytes can be made available, compared to the state before this series. This allows for the placement of the 12 bytes made up by ->list and ->mult within the first cacheline while moving only one 4 byte member out. This victim will be ->set_state_periodic() which will be moved to its ->set_state_*() relatives within the second cacheline. II.) sizeof(long) == 8 For the case of sizeof(long) == sizeof(void*) == 8, the first cacheline's contents will be unchanged. The ->list and ->mult members are moved from the third to the start of the second cacheline. Compared to the situation before this series, this will move ->broadcast() and ->suspend() from the end of the second cacheline to the start of the third one. A nice side effect is that due to the elimination of padding in the third cacheline, together with the patch ("clockevents: degrade ->retries to unsigned int"), the struct clock_event_device now fits exactly into a multiple of three cachelines. Thus, in struct clock_event_device: - move the ->event_handler(), ->set_next_event() and ->set_next_ktime() in order to avoid padding in front of ->next_event on ARM. - move the ->list and ->mult members right before the ->retries member. Benchmark results: The following measurements have been carried out on a Raspberry Pi 2B (armv7, 4 cores, 900MHz). The adjustment process has been driven by periodically injecting a certain offset via adjtimex(2) approximately every 3.7h. A 'stress --vm 8 --vm-bytes 32M' was running. The runtime of clockevents_adjust_all_freqs() has been measured. - Before this patch: Mean: 8750.73+-1034.77 Quantiles: 0% 25% 50% 75% 100% 3854 8171 8901 9452 13593 (ns) - After this patch: Mean: 6916.90+-782.63 Quantiles: 0% 25% 50% 75% 100% 3072 6412 6927 7430 10989 (ns) Signed-off-by: Nicolai Stange --- include/linux/clockchips.h | 22 +++++++++++----------- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/include/linux/clockchips.h b/include/linux/clockchips.h index 878a802..17ef940 100644 --- a/include/linux/clockchips.h +++ b/include/linux/clockchips.h @@ -74,10 +74,6 @@ enum clock_event_state { /** * struct clock_event_device - clock event device descriptor - * @event_handler: Assigned by the framework to be called by the low - * level handler of the event source - * @set_next_event: set next event function using a clocksource delta - * @set_next_ktime: set next event function using a direct ktime value * @next_event: local storage for the next event in oneshot mode * @max_delta_ns: maximum delta value in ns * @max_delta_ticks: maximum delta value in ticks @@ -86,6 +82,12 @@ enum clock_event_state { * @shift: nanoseconds to cycles divisor (power of two) * @state_use_accessors:current state of the device, assigned by the core code * @features: features + * @event_handler: Assigned by the framework to be called by the low + * level handler of the event source + * @set_next_event: set next event function using a clocksource delta + * @set_next_ktime: set next event function using a direct ktime value + * @list: list head for the management code + * @mult: ns to cycles multiplier stored for reconfiguration * @retries: number of forced programming retries * @set_state_periodic: switch state to periodic * @set_state_oneshot: switch state to oneshot @@ -95,18 +97,13 @@ enum clock_event_state { * @broadcast: function to broadcast events * @name: ptr to clock event name * @min_delta_ticks: minimum delta value in ticks stored for reconfiguration - * @mult: ns to cycles multiplier stored for reconfiguration * @rating: variable to rate clock event devices * @irq: IRQ number (only for non CPU local devices) * @bound_on: Bound on CPU * @cpumask: cpumask to indicate for which CPUs this device works - * @list: list head for the management code * @owner: module reference */ struct clock_event_device { - void (*event_handler)(struct clock_event_device *); - int (*set_next_event)(unsigned long evt, struct clock_event_device *); - int (*set_next_ktime)(ktime_t expires, struct clock_event_device *); ktime_t next_event; u64 max_delta_ns; unsigned long max_delta_ticks; @@ -115,6 +112,11 @@ struct clock_event_device { u32 shift; enum clock_event_state state_use_accessors:8; unsigned int features:24; + void (*event_handler)(struct clock_event_device *); + int (*set_next_event)(unsigned long evt, struct clock_event_device *); + int (*set_next_ktime)(ktime_t expires, struct clock_event_device *); + struct list_head list; + u32 mult; unsigned int retries; int (*set_state_periodic)(struct clock_event_device *); @@ -129,12 +131,10 @@ struct clock_event_device { const char *name; unsigned int min_delta_ticks; - u32 mult; int rating; int irq; int bound_on; const struct cpumask *cpumask; - struct list_head list; struct module *owner; } ____cacheline_aligned; -- 2.10.2