2015-02-08 12:02:58

by Daniel Thompson

[permalink] [raw]
Subject: [PATCH v4 0/5] sched_clock: Optimize and avoid deadlock during read from NMI

This patchset optimizes the generic sched_clock implementation by
removing branches and significantly reducing the data cache profile. It
also makes it safe to call sched_clock() from NMI (or FIQ on ARM).

The data cache profile of sched_clock() in the original code is
somewhere between 2 and 3 (64-byte) cache lines, depending on alignment
of struct clock_data. After patching, the cache profile for the normal
case should be a single cacheline.

NMI safety was tested on i.MX6 with perf drowning the system in FIQs and
using the perf handler to check that sched_clock() returned monotonic
values. At the same time I forcefully reduced kt_wrap so that
update_sched_clock() is being called at >1000Hz.

Without the patches the above system is grossly unstable, surviving
[9K,115K,25K] perf event cycles during three separate runs. With the
patch I ran for over 9M perf event cycles before getting bored.

v4:
* Optimized sched_clock() to be branchless by introducing a dummy
function to provide clock values while the clock is suspended
(Stephen Boyd).
* Improved commenting, including the kerneldoc comments (Stephen Boyd).
* Removed a redundant notrace from the update logic (Steven Rostedt).

v3:
* Optimized to minimise cache profile, including elimination of
the suspended flag (Thomas Gleixner).
* Replaced the update_bank_begin/end with a single update function
(Thomas Gleixner).
* Split into multiple patches to aid review.

v2:
* Extended the scope of the read lock in sched_clock() so we can bank
all data consumed there (John Stultz)


Daniel Thompson (5):
sched_clock: Match scope of read and write seqcounts
sched_clock: Optimize cache line usage
sched_clock: Remove suspend from clock_read_data
sched_clock: Remove redundant notrace from update function
sched_clock: Avoid deadlock during read from NMI

kernel/time/sched_clock.c | 195 ++++++++++++++++++++++++++++++++--------------
1 file changed, 138 insertions(+), 57 deletions(-)

--
2.1.0


2015-02-08 12:03:04

by Daniel Thompson

[permalink] [raw]
Subject: [PATCH v4 1/5] sched_clock: Match scope of read and write seqcounts

Currently the scope of the raw_write_seqcount_begin/end in
sched_clock_register far exceeds the scope of the read section in
sched_clock. This gives the impression of safety during cursory review
but achieves little.

Note that this is likely to be a latent issue at present because
sched_clock_register() is typically called before we enable interrupts,
however the issue does risk bugs being needlessly introduced as the code
evolves.

This patch fixes the problem by increasing the scope of the read locking
performed by sched_clock() to cover all data modified by
sched_clock_register.

We also improve clarity by moving writes to struct clock_data that do
not impact sched_clock() outside of the critical section.

Signed-off-by: Daniel Thompson <[email protected]>
Cc: Russell King <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Catalin Marinas <[email protected]>
---
kernel/time/sched_clock.c | 25 +++++++++++--------------
1 file changed, 11 insertions(+), 14 deletions(-)

diff --git a/kernel/time/sched_clock.c b/kernel/time/sched_clock.c
index 01d2d15aa662..3d21a8719444 100644
--- a/kernel/time/sched_clock.c
+++ b/kernel/time/sched_clock.c
@@ -58,23 +58,21 @@ static inline u64 notrace cyc_to_ns(u64 cyc, u32 mult, u32 shift)

unsigned long long notrace sched_clock(void)
{
- u64 epoch_ns;
- u64 epoch_cyc;
- u64 cyc;
+ u64 cyc, res;
unsigned long seq;

- if (cd.suspended)
- return cd.epoch_ns;
-
do {
seq = raw_read_seqcount_begin(&cd.seq);
- epoch_cyc = cd.epoch_cyc;
- epoch_ns = cd.epoch_ns;
+
+ res = cd.epoch_ns;
+ if (!cd.suspended) {
+ cyc = read_sched_clock();
+ cyc = (cyc - cd.epoch_cyc) & sched_clock_mask;
+ res += cyc_to_ns(cyc, cd.mult, cd.shift);
+ }
} while (read_seqcount_retry(&cd.seq, seq));

- cyc = read_sched_clock();
- cyc = (cyc - epoch_cyc) & sched_clock_mask;
- return epoch_ns + cyc_to_ns(cyc, cd.mult, cd.shift);
+ return res;
}

/*
@@ -124,10 +122,11 @@ void __init sched_clock_register(u64 (*read)(void), int bits,
clocks_calc_mult_shift(&new_mult, &new_shift, rate, NSEC_PER_SEC, 3600);

new_mask = CLOCKSOURCE_MASK(bits);
+ cd.rate = rate;

/* calculate how many ns until we wrap */
wrap = clocks_calc_max_nsecs(new_mult, new_shift, 0, new_mask);
- new_wrap_kt = ns_to_ktime(wrap - (wrap >> 3));
+ cd.wrap_kt = ns_to_ktime(wrap - (wrap >> 3));

/* update epoch for new counter and update epoch_ns from old counter*/
new_epoch = read();
@@ -138,8 +137,6 @@ void __init sched_clock_register(u64 (*read)(void), int bits,
raw_write_seqcount_begin(&cd.seq);
read_sched_clock = read;
sched_clock_mask = new_mask;
- cd.rate = rate;
- cd.wrap_kt = new_wrap_kt;
cd.mult = new_mult;
cd.shift = new_shift;
cd.epoch_cyc = new_epoch;
--
2.1.0

2015-02-08 12:04:32

by Daniel Thompson

[permalink] [raw]
Subject: [PATCH v4 2/5] sched_clock: Optimize cache line usage

Currently sched_clock(), a very hot code path, is not optimized to
minimise its cache profile. In particular:

1. cd is not ____cacheline_aligned,

2. struct clock_data does not distinguish between hotpath and
coldpath data, reducing locality of reference in the hotpath,

3. Some hotpath data is missing from struct clock_data and is marked
__read_mostly (which more or less guarantees it will not share a
cache line with cd).

This patch corrects these problems by extracting all hotpath data
into a separate structure and using ____cacheline_aligned to ensure
the hotpath uses a single (64 byte) cache line.

Signed-off-by: Daniel Thompson <[email protected]>
Cc: Russell King <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Catalin Marinas <[email protected]>
---
kernel/time/sched_clock.c | 113 +++++++++++++++++++++++++++++++---------------
1 file changed, 77 insertions(+), 36 deletions(-)

diff --git a/kernel/time/sched_clock.c b/kernel/time/sched_clock.c
index 3d21a8719444..695b2ac2e8b4 100644
--- a/kernel/time/sched_clock.c
+++ b/kernel/time/sched_clock.c
@@ -18,28 +18,59 @@
#include <linux/seqlock.h>
#include <linux/bitops.h>

-struct clock_data {
- ktime_t wrap_kt;
+/**
+ * struct clock_read_data - data required to read from sched_clock
+ *
+ * @epoch_ns: sched_clock value at last update
+ * @epoch_cyc: Clock cycle value at last update
+ * @sched_clock_mask: Bitmask for two's complement subtraction of non 64bit
+ * clocks
+ * @read_sched_clock: Current clock source (or dummy source when suspended)
+ * @mult: Multipler for scaled math conversion
+ * @shift: Shift value for scaled math conversion
+ * @suspended: Flag to indicate if the clock is suspended (stopped)
+ *
+ * Care must be taken when updating this structure; it is read by
+ * some very hot code paths. It occupies <=48 bytes and, when combined
+ * with the seqcount used to synchronize access, comfortably fits into
+ * a 64 byte cache line.
+ */
+struct clock_read_data {
u64 epoch_ns;
u64 epoch_cyc;
- seqcount_t seq;
- unsigned long rate;
+ u64 sched_clock_mask;
+ u64 (*read_sched_clock)(void);
u32 mult;
u32 shift;
bool suspended;
};

+/**
+ * struct clock_data - all data needed for sched_clock (including
+ * registration of a new clock source)
+ *
+ * @seq: Sequence counter for protecting updates.
+ * @read_data: Data required to read from sched_clock.
+ * @wrap_kt: Duration for which clock can run before wrapping
+ * @rate: Tick rate of the registered clock
+ * @actual_read_sched_clock: Registered clock read function
+ *
+ * The ordering of this structure has been chosen to optimize cache
+ * performance. In particular seq and read_data (combined) should fit
+ * into a single 64 byte cache line.
+ */
+struct clock_data {
+ seqcount_t seq;
+ struct clock_read_data read_data;
+ ktime_t wrap_kt;
+ unsigned long rate;
+};
+
static struct hrtimer sched_clock_timer;
static int irqtime = -1;

core_param(irqtime, irqtime, int, 0400);

-static struct clock_data cd = {
- .mult = NSEC_PER_SEC / HZ,
-};
-
-static u64 __read_mostly sched_clock_mask;
-
static u64 notrace jiffy_sched_clock_read(void)
{
/*
@@ -49,7 +80,10 @@ static u64 notrace jiffy_sched_clock_read(void)
return (u64)(jiffies - INITIAL_JIFFIES);
}

-static u64 __read_mostly (*read_sched_clock)(void) = jiffy_sched_clock_read;
+static struct clock_data cd ____cacheline_aligned = {
+ .read_data = { .mult = NSEC_PER_SEC / HZ,
+ .read_sched_clock = jiffy_sched_clock_read, },
+};

static inline u64 notrace cyc_to_ns(u64 cyc, u32 mult, u32 shift)
{
@@ -60,15 +94,16 @@ unsigned long long notrace sched_clock(void)
{
u64 cyc, res;
unsigned long seq;
+ struct clock_read_data *rd = &cd.read_data;

do {
seq = raw_read_seqcount_begin(&cd.seq);

- res = cd.epoch_ns;
- if (!cd.suspended) {
- cyc = read_sched_clock();
- cyc = (cyc - cd.epoch_cyc) & sched_clock_mask;
- res += cyc_to_ns(cyc, cd.mult, cd.shift);
+ res = rd->epoch_ns;
+ if (!rd->suspended) {
+ cyc = rd->read_sched_clock();
+ cyc = (cyc - rd->epoch_cyc) & rd->sched_clock_mask;
+ res += cyc_to_ns(cyc, rd->mult, rd->shift);
}
} while (read_seqcount_retry(&cd.seq, seq));

@@ -83,16 +118,17 @@ static void notrace update_sched_clock(void)
unsigned long flags;
u64 cyc;
u64 ns;
+ struct clock_read_data *rd = &cd.read_data;

- cyc = read_sched_clock();
- ns = cd.epoch_ns +
- cyc_to_ns((cyc - cd.epoch_cyc) & sched_clock_mask,
- cd.mult, cd.shift);
+ cyc = rd->read_sched_clock();
+ ns = rd->epoch_ns +
+ cyc_to_ns((cyc - rd->epoch_cyc) & rd->sched_clock_mask,
+ rd->mult, rd->shift);

raw_local_irq_save(flags);
raw_write_seqcount_begin(&cd.seq);
- cd.epoch_ns = ns;
- cd.epoch_cyc = cyc;
+ rd->epoch_ns = ns;
+ rd->epoch_cyc = cyc;
raw_write_seqcount_end(&cd.seq);
raw_local_irq_restore(flags);
}
@@ -109,9 +145,9 @@ void __init sched_clock_register(u64 (*read)(void), int bits,
{
u64 res, wrap, new_mask, new_epoch, cyc, ns;
u32 new_mult, new_shift;
- ktime_t new_wrap_kt;
unsigned long r;
char r_unit;
+ struct clock_read_data *rd = &cd.read_data;

if (cd.rate > rate)
return;
@@ -130,17 +166,18 @@ void __init sched_clock_register(u64 (*read)(void), int bits,

/* update epoch for new counter and update epoch_ns from old counter*/
new_epoch = read();
- cyc = read_sched_clock();
- ns = cd.epoch_ns + cyc_to_ns((cyc - cd.epoch_cyc) & sched_clock_mask,
- cd.mult, cd.shift);
+ cyc = rd->read_sched_clock();
+ ns = rd->epoch_ns +
+ cyc_to_ns((cyc - rd->epoch_cyc) & rd->sched_clock_mask,
+ rd->mult, rd->shift);

raw_write_seqcount_begin(&cd.seq);
- read_sched_clock = read;
- sched_clock_mask = new_mask;
- cd.mult = new_mult;
- cd.shift = new_shift;
- cd.epoch_cyc = new_epoch;
- cd.epoch_ns = ns;
+ rd->read_sched_clock = read;
+ rd->sched_clock_mask = new_mask;
+ rd->mult = new_mult;
+ rd->shift = new_shift;
+ rd->epoch_cyc = new_epoch;
+ rd->epoch_ns = ns;
raw_write_seqcount_end(&cd.seq);

r = rate;
@@ -172,7 +209,7 @@ void __init sched_clock_postinit(void)
* If no sched_clock function has been provided at that point,
* make it the final one one.
*/
- if (read_sched_clock == jiffy_sched_clock_read)
+ if (cd.read_data.read_sched_clock == jiffy_sched_clock_read)
sched_clock_register(jiffy_sched_clock_read, BITS_PER_LONG, HZ);

update_sched_clock();
@@ -188,17 +225,21 @@ void __init sched_clock_postinit(void)

static int sched_clock_suspend(void)
{
+ struct clock_read_data *rd = &cd.read_data;
+
update_sched_clock();
hrtimer_cancel(&sched_clock_timer);
- cd.suspended = true;
+ rd->suspended = true;
return 0;
}

static void sched_clock_resume(void)
{
- cd.epoch_cyc = read_sched_clock();
+ struct clock_read_data *rd = &cd.read_data;
+
+ rd->epoch_cyc = rd->read_sched_clock();
hrtimer_start(&sched_clock_timer, cd.wrap_kt, HRTIMER_MODE_REL);
- cd.suspended = false;
+ rd->suspended = false;
}

static struct syscore_ops sched_clock_ops = {
--
2.1.0

2015-02-08 12:03:10

by Daniel Thompson

[permalink] [raw]
Subject: [PATCH v4 3/5] sched_clock: Remove suspend from clock_read_data

Currently cd.read_data.suspended is read by the hotpath function
sched_clock(). This variable need not be accessed on the hotpath. In
fact, once it is removed, we can remove the conditional branches from
sched_clock() and install a dummy read_sched_clock function to suspend
the clock.

The new master copy of the function pointer (actual_read_sched_clock)
is introduced and is used this for all reads of the clock hardware
except those within sched_clock itself.

Suggested-by: Thomas Gleixner <[email protected]>
Signed-off-by: Daniel Thompson <[email protected]>
Cc: Russell King <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Catalin Marinas <[email protected]>
---
kernel/time/sched_clock.c | 40 +++++++++++++++++++++++++---------------
1 file changed, 25 insertions(+), 15 deletions(-)

diff --git a/kernel/time/sched_clock.c b/kernel/time/sched_clock.c
index 695b2ac2e8b4..5d6407951fb8 100644
--- a/kernel/time/sched_clock.c
+++ b/kernel/time/sched_clock.c
@@ -28,10 +28,9 @@
* @read_sched_clock: Current clock source (or dummy source when suspended)
* @mult: Multipler for scaled math conversion
* @shift: Shift value for scaled math conversion
- * @suspended: Flag to indicate if the clock is suspended (stopped)
*
* Care must be taken when updating this structure; it is read by
- * some very hot code paths. It occupies <=48 bytes and, when combined
+ * some very hot code paths. It occupies <=40 bytes and, when combined
* with the seqcount used to synchronize access, comfortably fits into
* a 64 byte cache line.
*/
@@ -42,7 +41,6 @@ struct clock_read_data {
u64 (*read_sched_clock)(void);
u32 mult;
u32 shift;
- bool suspended;
};

/**
@@ -64,6 +62,7 @@ struct clock_data {
struct clock_read_data read_data;
ktime_t wrap_kt;
unsigned long rate;
+ u64 (*actual_read_sched_clock)(void);
};

static struct hrtimer sched_clock_timer;
@@ -83,6 +82,8 @@ static u64 notrace jiffy_sched_clock_read(void)
static struct clock_data cd ____cacheline_aligned = {
.read_data = { .mult = NSEC_PER_SEC / HZ,
.read_sched_clock = jiffy_sched_clock_read, },
+ .actual_read_sched_clock = jiffy_sched_clock_read,
+
};

static inline u64 notrace cyc_to_ns(u64 cyc, u32 mult, u32 shift)
@@ -99,12 +100,9 @@ unsigned long long notrace sched_clock(void)
do {
seq = raw_read_seqcount_begin(&cd.seq);

- res = rd->epoch_ns;
- if (!rd->suspended) {
- cyc = rd->read_sched_clock();
- cyc = (cyc - rd->epoch_cyc) & rd->sched_clock_mask;
- res += cyc_to_ns(cyc, rd->mult, rd->shift);
- }
+ cyc = (rd->read_sched_clock() - rd->epoch_cyc) &
+ rd->sched_clock_mask;
+ res = rd->epoch_ns + cyc_to_ns(cyc, rd->mult, rd->shift);
} while (read_seqcount_retry(&cd.seq, seq));

return res;
@@ -120,7 +118,7 @@ static void notrace update_sched_clock(void)
u64 ns;
struct clock_read_data *rd = &cd.read_data;

- cyc = rd->read_sched_clock();
+ cyc = cd.actual_read_sched_clock();
ns = rd->epoch_ns +
cyc_to_ns((cyc - rd->epoch_cyc) & rd->sched_clock_mask,
rd->mult, rd->shift);
@@ -166,10 +164,11 @@ void __init sched_clock_register(u64 (*read)(void), int bits,

/* update epoch for new counter and update epoch_ns from old counter*/
new_epoch = read();
- cyc = rd->read_sched_clock();
+ cyc = cd.actual_read_sched_clock();
ns = rd->epoch_ns +
cyc_to_ns((cyc - rd->epoch_cyc) & rd->sched_clock_mask,
rd->mult, rd->shift);
+ cd.actual_read_sched_clock = read;

raw_write_seqcount_begin(&cd.seq);
rd->read_sched_clock = read;
@@ -209,7 +208,7 @@ void __init sched_clock_postinit(void)
* If no sched_clock function has been provided at that point,
* make it the final one one.
*/
- if (cd.read_data.read_sched_clock == jiffy_sched_clock_read)
+ if (cd.actual_read_sched_clock == jiffy_sched_clock_read)
sched_clock_register(jiffy_sched_clock_read, BITS_PER_LONG, HZ);

update_sched_clock();
@@ -223,13 +222,24 @@ void __init sched_clock_postinit(void)
hrtimer_start(&sched_clock_timer, cd.wrap_kt, HRTIMER_MODE_REL);
}

+/*
+ * Clock read function for use when the clock is suspended.
+ *
+ * This function makes it appear to sched_clock() as if the clock
+ * stopped counting at its last update.
+ */
+static u64 notrace suspended_sched_clock_read(void)
+{
+ return cd.read_data.epoch_cyc;
+}
+
static int sched_clock_suspend(void)
{
struct clock_read_data *rd = &cd.read_data;

update_sched_clock();
hrtimer_cancel(&sched_clock_timer);
- rd->suspended = true;
+ rd->read_sched_clock = suspended_sched_clock_read;
return 0;
}

@@ -237,9 +247,9 @@ static void sched_clock_resume(void)
{
struct clock_read_data *rd = &cd.read_data;

- rd->epoch_cyc = rd->read_sched_clock();
+ rd->epoch_cyc = cd.actual_read_sched_clock();
hrtimer_start(&sched_clock_timer, cd.wrap_kt, HRTIMER_MODE_REL);
- rd->suspended = false;
+ rd->read_sched_clock = cd.actual_read_sched_clock;
}

static struct syscore_ops sched_clock_ops = {
--
2.1.0

2015-02-08 12:04:10

by Daniel Thompson

[permalink] [raw]
Subject: [PATCH v4 4/5] sched_clock: Remove redundant notrace from update function

Currently update_sched_clock() is marked as notrace but this function
is not called by ftrace. This is trivially fixed by removing the mark
up.

Signed-off-by: Daniel Thompson <[email protected]>
---
kernel/time/sched_clock.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/time/sched_clock.c b/kernel/time/sched_clock.c
index 5d6407951fb8..9280327676dc 100644
--- a/kernel/time/sched_clock.c
+++ b/kernel/time/sched_clock.c
@@ -111,7 +111,7 @@ unsigned long long notrace sched_clock(void)
/*
* Atomically update the sched_clock epoch.
*/
-static void notrace update_sched_clock(void)
+static void update_sched_clock(void)
{
unsigned long flags;
u64 cyc;
--
2.1.0

2015-02-08 12:03:16

by Daniel Thompson

[permalink] [raw]
Subject: [PATCH v4 5/5] sched_clock: Avoid deadlock during read from NMI

Currently it is possible for an NMI (or FIQ on ARM) to come in and
read sched_clock() whilst update_sched_clock() has locked the seqcount
for writing. This results in the NMI handler locking up when it calls
raw_read_seqcount_begin().

This patch fixes the NMI safety issues by providing banked clock data.
This is a similar approach to the one used in Thomas Gleixner's
4396e058c52e("timekeeping: Provide fast and NMI safe access to
CLOCK_MONOTONIC").

Suggested-by: Stephen Boyd <[email protected]>
Signed-off-by: Daniel Thompson <[email protected]>
Cc: Russell King <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Catalin Marinas <[email protected]>
---
kernel/time/sched_clock.c | 103 ++++++++++++++++++++++++++++++----------------
1 file changed, 68 insertions(+), 35 deletions(-)

diff --git a/kernel/time/sched_clock.c b/kernel/time/sched_clock.c
index 9280327676dc..a23d98c33dab 100644
--- a/kernel/time/sched_clock.c
+++ b/kernel/time/sched_clock.c
@@ -47,19 +47,20 @@ struct clock_read_data {
* struct clock_data - all data needed for sched_clock (including
* registration of a new clock source)
*
- * @seq: Sequence counter for protecting updates.
+ * @seq: Sequence counter for protecting updates. The lowest
+ * bit is the index for @read_data.
* @read_data: Data required to read from sched_clock.
* @wrap_kt: Duration for which clock can run before wrapping
* @rate: Tick rate of the registered clock
* @actual_read_sched_clock: Registered clock read function
*
* The ordering of this structure has been chosen to optimize cache
- * performance. In particular seq and read_data (combined) should fit
+ * performance. In particular seq and read_data[0] (combined) should fit
* into a single 64 byte cache line.
*/
struct clock_data {
seqcount_t seq;
- struct clock_read_data read_data;
+ struct clock_read_data read_data[2];
ktime_t wrap_kt;
unsigned long rate;
u64 (*actual_read_sched_clock)(void);
@@ -80,10 +81,9 @@ static u64 notrace jiffy_sched_clock_read(void)
}

static struct clock_data cd ____cacheline_aligned = {
- .read_data = { .mult = NSEC_PER_SEC / HZ,
- .read_sched_clock = jiffy_sched_clock_read, },
+ .read_data[0] = { .mult = NSEC_PER_SEC / HZ,
+ .read_sched_clock = jiffy_sched_clock_read, },
.actual_read_sched_clock = jiffy_sched_clock_read,
-
};

static inline u64 notrace cyc_to_ns(u64 cyc, u32 mult, u32 shift)
@@ -95,10 +95,11 @@ unsigned long long notrace sched_clock(void)
{
u64 cyc, res;
unsigned long seq;
- struct clock_read_data *rd = &cd.read_data;
+ struct clock_read_data *rd;

do {
- seq = raw_read_seqcount_begin(&cd.seq);
+ seq = raw_read_seqcount(&cd.seq);
+ rd = cd.read_data + (seq & 1);

cyc = (rd->read_sched_clock() - rd->epoch_cyc) &
rd->sched_clock_mask;
@@ -109,26 +110,50 @@ unsigned long long notrace sched_clock(void)
}

/*
+ * Updating the data required to read the clock.
+ *
+ * sched_clock will never observe mis-matched data even if called from
+ * an NMI. We do this by maintaining an odd/even copy of the data and
+ * steering sched_clock to one or the other using a sequence counter.
+ * In order to preserve the data cache profile of sched_clock as much
+ * as possible the system reverts back to the even copy when the update
+ * completes; the odd copy is used *only* during an update.
+ */
+static void update_clock_read_data(struct clock_read_data *rd)
+{
+ /* update the backup (odd) copy with the new data */
+ cd.read_data[1] = *rd;
+
+ /* steer readers towards the odd copy */
+ raw_write_seqcount_latch(&cd.seq);
+
+ /* now its safe for us to update the normal (even) copy */
+ cd.read_data[0] = *rd;
+
+ /* switch readers back to the even copy */
+ raw_write_seqcount_latch(&cd.seq);
+}
+
+/*
* Atomically update the sched_clock epoch.
*/
static void update_sched_clock(void)
{
- unsigned long flags;
u64 cyc;
u64 ns;
- struct clock_read_data *rd = &cd.read_data;
+ struct clock_read_data rd;
+
+ rd = cd.read_data[0];

cyc = cd.actual_read_sched_clock();
- ns = rd->epoch_ns +
- cyc_to_ns((cyc - rd->epoch_cyc) & rd->sched_clock_mask,
- rd->mult, rd->shift);
-
- raw_local_irq_save(flags);
- raw_write_seqcount_begin(&cd.seq);
- rd->epoch_ns = ns;
- rd->epoch_cyc = cyc;
- raw_write_seqcount_end(&cd.seq);
- raw_local_irq_restore(flags);
+ ns = rd.epoch_ns +
+ cyc_to_ns((cyc - rd.epoch_cyc) & rd.sched_clock_mask,
+ rd.mult, rd.shift);
+
+ rd.epoch_ns = ns;
+ rd.epoch_cyc = cyc;
+
+ update_clock_read_data(&rd);
}

static enum hrtimer_restart sched_clock_poll(struct hrtimer *hrt)
@@ -145,7 +170,7 @@ void __init sched_clock_register(u64 (*read)(void), int bits,
u32 new_mult, new_shift;
unsigned long r;
char r_unit;
- struct clock_read_data *rd = &cd.read_data;
+ struct clock_read_data rd;

if (cd.rate > rate)
return;
@@ -162,22 +187,23 @@ void __init sched_clock_register(u64 (*read)(void), int bits,
wrap = clocks_calc_max_nsecs(new_mult, new_shift, 0, new_mask);
cd.wrap_kt = ns_to_ktime(wrap - (wrap >> 3));

+ rd = cd.read_data[0];
+
/* update epoch for new counter and update epoch_ns from old counter*/
new_epoch = read();
cyc = cd.actual_read_sched_clock();
- ns = rd->epoch_ns +
- cyc_to_ns((cyc - rd->epoch_cyc) & rd->sched_clock_mask,
- rd->mult, rd->shift);
+ ns = rd.epoch_ns +
+ cyc_to_ns((cyc - rd.epoch_cyc) & rd.sched_clock_mask,
+ rd.mult, rd.shift);
cd.actual_read_sched_clock = read;

- raw_write_seqcount_begin(&cd.seq);
- rd->read_sched_clock = read;
- rd->sched_clock_mask = new_mask;
- rd->mult = new_mult;
- rd->shift = new_shift;
- rd->epoch_cyc = new_epoch;
- rd->epoch_ns = ns;
- raw_write_seqcount_end(&cd.seq);
+ rd.read_sched_clock = read;
+ rd.sched_clock_mask = new_mask;
+ rd.mult = new_mult;
+ rd.shift = new_shift;
+ rd.epoch_cyc = new_epoch;
+ rd.epoch_ns = ns;
+ update_clock_read_data(&rd);

r = rate;
if (r >= 4000000) {
@@ -227,15 +253,22 @@ void __init sched_clock_postinit(void)
*
* This function makes it appear to sched_clock() as if the clock
* stopped counting at its last update.
+ *
+ * This function must only be called from the critical
+ * section in sched_clock(). It relies on the read_seqcount_retry()
+ * at the end of the critical section to be sure we observe the
+ * correct copy of epoch_cyc.
*/
static u64 notrace suspended_sched_clock_read(void)
{
- return cd.read_data.epoch_cyc;
+ unsigned long seq = raw_read_seqcount(&cd.seq);
+
+ return cd.read_data[seq & 1].epoch_cyc;
}

static int sched_clock_suspend(void)
{
- struct clock_read_data *rd = &cd.read_data;
+ struct clock_read_data *rd = &cd.read_data[0];

update_sched_clock();
hrtimer_cancel(&sched_clock_timer);
@@ -245,7 +278,7 @@ static int sched_clock_suspend(void)

static void sched_clock_resume(void)
{
- struct clock_read_data *rd = &cd.read_data;
+ struct clock_read_data *rd = &cd.read_data[0];

rd->epoch_cyc = cd.actual_read_sched_clock();
hrtimer_start(&sched_clock_timer, cd.wrap_kt, HRTIMER_MODE_REL);
--
2.1.0

2015-02-09 01:28:08

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH v4 2/5] sched_clock: Optimize cache line usage

On Sun, Feb 08, 2015 at 12:02:37PM +0000, Daniel Thompson wrote:
> Currently sched_clock(), a very hot code path, is not optimized to
> minimise its cache profile. In particular:
>
> 1. cd is not ____cacheline_aligned,
>
> 2. struct clock_data does not distinguish between hotpath and
> coldpath data, reducing locality of reference in the hotpath,
>
> 3. Some hotpath data is missing from struct clock_data and is marked
> __read_mostly (which more or less guarantees it will not share a
> cache line with cd).
>
> This patch corrects these problems by extracting all hotpath data
> into a separate structure and using ____cacheline_aligned to ensure
> the hotpath uses a single (64 byte) cache line.

Have you got any performance figures for this change, or is this just a
theoretical optimisation? It would be interesting to see what effect this
has on systems with 32-byte cachelines and also scenarios where there's
contention on the sequence counter.

Will

2015-02-09 09:47:25

by Daniel Thompson

[permalink] [raw]
Subject: Re: [PATCH v4 2/5] sched_clock: Optimize cache line usage


On 09/02/15 09:28, Will Deacon wrote:
> On Sun, Feb 08, 2015 at 12:02:37PM +0000, Daniel Thompson wrote:
>> Currently sched_clock(), a very hot code path, is not optimized to
>> minimise its cache profile. In particular:
>>
>> 1. cd is not ____cacheline_aligned,
>>
>> 2. struct clock_data does not distinguish between hotpath and
>> coldpath data, reducing locality of reference in the hotpath,
>>
>> 3. Some hotpath data is missing from struct clock_data and is marked
>> __read_mostly (which more or less guarantees it will not share a
>> cache line with cd).
>>
>> This patch corrects these problems by extracting all hotpath data
>> into a separate structure and using ____cacheline_aligned to ensure
>> the hotpath uses a single (64 byte) cache line.
>
> Have you got any performance figures for this change, or is this just a
> theoretical optimisation? It would be interesting to see what effect this
> has on systems with 32-byte cachelines and also scenarios where there's
> contention on the sequence counter.

Most of my testing has focused on proving the NMI safety parts of the
patch work as advertised so its mostly theoretical.

However there are some numbers from simple tight loop calls to
sched_clock (Stephen Boyd's results are more interesting than mine
because I observe pretty wild quantization effects that render the
results hard to trust):
http://thread.gmane.org/gmane.linux.kernel/1871157/focus=1879265

Not sure what useful figures would be useful for a contended sequence
counter. Firstly the counter is taken for write at 7/8 wrap time of the
times so even for the fastest timers the interval is likely to be >3s
and is very short duration. Additionally, the NMI safety changes make it
possible to read the timer whilst it is being updated so it is only
during the very short struct-copy/write/struct-copy/write update
sequence that we will observe the extra cache line used for a read.
Benchmarks that show the effect of update are therefore non-trivial to
construct.

2015-02-10 02:37:41

by Stephen Boyd

[permalink] [raw]
Subject: Re: [PATCH v4 2/5] sched_clock: Optimize cache line usage

On 02/09/15 01:47, Daniel Thompson wrote:
> On 09/02/15 09:28, Will Deacon wrote:
>> On Sun, Feb 08, 2015 at 12:02:37PM +0000, Daniel Thompson wrote:
>>> Currently sched_clock(), a very hot code path, is not optimized to
>>> minimise its cache profile. In particular:
>>>
>>> 1. cd is not ____cacheline_aligned,
>>>
>>> 2. struct clock_data does not distinguish between hotpath and
>>> coldpath data, reducing locality of reference in the hotpath,
>>>
>>> 3. Some hotpath data is missing from struct clock_data and is marked
>>> __read_mostly (which more or less guarantees it will not share a
>>> cache line with cd).
>>>
>>> This patch corrects these problems by extracting all hotpath data
>>> into a separate structure and using ____cacheline_aligned to ensure
>>> the hotpath uses a single (64 byte) cache line.
>> Have you got any performance figures for this change, or is this just a
>> theoretical optimisation? It would be interesting to see what effect this
>> has on systems with 32-byte cachelines and also scenarios where there's
>> contention on the sequence counter.
> Most of my testing has focused on proving the NMI safety parts of the
> patch work as advertised so its mostly theoretical.
>
> However there are some numbers from simple tight loop calls to
> sched_clock (Stephen Boyd's results are more interesting than mine
> because I observe pretty wild quantization effects that render the
> results hard to trust):
> http://thread.gmane.org/gmane.linux.kernel/1871157/focus=1879265
>
> Not sure what useful figures would be useful for a contended sequence
> counter. Firstly the counter is taken for write at 7/8 wrap time of the
> times so even for the fastest timers the interval is likely to be >3s
> and is very short duration. Additionally, the NMI safety changes make it
> possible to read the timer whilst it is being updated so it is only
> during the very short struct-copy/write/struct-copy/write update
> sequence that we will observe the extra cache line used for a read.
> Benchmarks that show the effect of update are therefore non-trivial to
> construct.
>

Here's the raw numbers for the tight loop. I noticed that if I don't use
perf I get a larger number of calls per 10s, most likely because we
aren't doing anything else. These are very lightly loaded systems, i.e.
busybox ramdisk with nothing going on. Kernel version is v3.19-rc4. The
CPU is Krait on msm8960 and msm8974, except on msm8974 it has the arm
architected timers backing sched_clock() vs. our own custom timer IP on
msm8960. The cache line size is 64 bytes. I also ran it on msm8660 which
is a Scorpion CPU with the same timer as msm8960 (custom timer IP) and a
cache line size of 32 bytes. Unfortunately nobody has ported Scorpion
over to perf events, so we don't hardware events.

msm8960 (before patch)
----------------------
# perf stat -r 10 --post "rmmod sched_clock_test" modprobe sched_clock_test
Made 14528449 calls in 10000000290 ns
Made 14528925 calls in 10000000142 ns
Made 14524549 calls in 10000000587 ns
Made 14528164 calls in 10000000734 ns
Made 14524468 calls in 10000000290 ns
Made 14527198 calls in 10000000438 ns
Made 14523508 calls in 10000000734 ns
Made 14527894 calls in 10000000290 ns
Made 14529609 calls in 10000000734 ns
Made 14523114 calls in 10000000142 ns

Performance counter stats for 'modprobe sched_clock_test' (10 runs):

10009.635016 task-clock (msec) # 1.000 CPUs utilized ( +- 0.00% )
7 context-switches # 0.001 K/sec ( +- 16.16% )
0 cpu-migrations # 0.000 K/sec
58 page-faults # 0.006 K/sec
4003806350 cycles # 0.400 GHz ( +- 0.00% )
0 stalled-cycles-frontend # 0.00% frontend cycles idle
0 stalled-cycles-backend # 0.00% backend cycles idle
921924235 instructions # 0.23 insns per cycle ( +- 0.01% )
0 branches # 0.000 K/sec
58521151 branch-misses # 5.846 M/sec ( +- 0.01% )

10.011767657 seconds time elapsed ( +- 0.00% )

msm8960 (after patch)
---------------------
# perf stat -r 10 --post "rmmod sched_clock_test" modprobe sched_clock_test
Made 19626366 calls in 10000000587 ns
Made 19623708 calls in 10000000142 ns
Made 19623282 calls in 10000000290 ns
Made 19625304 calls in 10000000290 ns
Made 19625151 calls in 10000000291 ns
Made 19624906 calls in 10000000290 ns
Made 19625383 calls in 10000000143 ns
Made 19625235 calls in 10000000290 ns
Made 19624969 calls in 10000000290 ns
Made 19625209 calls in 10000000438 ns

Performance counter stats for 'modprobe sched_clock_test' (10 runs):

10009.883401 task-clock (msec) # 1.000 CPUs utilized ( +- 0.00% )
7 context-switches # 0.001 K/sec ( +- 15.88% )
0 cpu-migrations # 0.000 K/sec
58 page-faults # 0.006 K/sec
4003901511 cycles # 0.400 GHz ( +- 0.00% )
0 stalled-cycles-frontend # 0.00% frontend cycles idle
0 stalled-cycles-backend # 0.00% backend cycles idle
1164635790 instructions # 0.29 insns per cycle ( +- 0.00% )
0 branches # 0.000 K/sec
20039814 branch-misses # 2.002 M/sec ( +- 0.00% )

10.012092383 seconds time elapsed ( +- 0.00% )

msm8974 (before patch)
----------------------
# perf stat -r 10 --post "rmmod sched_clock_test" modprobe sched_clock_test
Made 21289694 calls in 10000000083 ns
Made 21289072 calls in 10000000082 ns
Made 21289550 calls in 10000000395 ns
Made 21288892 calls in 10000000291 ns
Made 21288987 calls in 10000000135 ns
Made 21289140 calls in 10000000395 ns
Made 21289161 calls in 10000000395 ns
Made 21288911 calls in 10000000239 ns
Made 21289204 calls in 10000000135 ns
Made 21288738 calls in 10000000135 ns

Performance counter stats for 'modprobe sched_clock_test' (10 runs):

10003.839348 task-clock (msec) # 1.000 CPUs utilized ( +- 0.00% )
4 context-switches # 0.000 K/sec ( +- 3.70% )
0 cpu-migrations # 0.000 K/sec
58 page-faults # 0.006 K/sec
6146323757 cycles # 0.614 GHz ( +- 0.00% )
0 stalled-cycles-frontend # 0.00% frontend cycles idle
0 stalled-cycles-backend # 0.00% backend cycles idle
1155527762 instructions # 0.19 insns per cycle ( +- 0.00% )
107186099 branches # 10.714 M/sec ( +- 0.00% )
35548359 branch-misses # 33.17% of all branches ( +- 0.00% )

10.004769053 seconds time elapsed ( +- 0.00% )

msm8974 (after patch)
---------------------
# perf stat -r 10 --post "rmmod sched_clock_test" modprobe sched_clock_test
Made 21289357 calls in 10000000239 ns
Made 21384961 calls in 10000000396 ns
Made 22105925 calls in 10000000238 ns
Made 27384126 calls in 10000000239 ns
Made 22107737 calls in 10000000134 ns
Made 21368867 calls in 10000000239 ns
Made 22106065 calls in 10000000395 ns
Made 27384196 calls in 10000000083 ns
Made 22107334 calls in 10000000291 ns
Made 21365426 calls in 10000000291 ns

Performance counter stats for 'modprobe sched_clock_test' (10 runs):

10003.753333 task-clock (msec) # 1.000 CPUs utilized ( +- 0.00% )
7 context-switches # 0.001 K/sec ( +- 18.18% )
0 cpu-migrations # 0.000 K/sec
58 page-faults # 0.006 K/sec
6837664600 cycles # 0.684 GHz ( +- 6.74% )
0 stalled-cycles-frontend # 0.00% frontend cycles idle
0 stalled-cycles-backend # 0.00% backend cycles idle
1148993903 instructions # 0.17 insns per cycle ( +- 3.32% )
115049358 branches # 11.501 M/sec ( +- 3.31% )
42520513 branch-misses # 36.96% of all branches ( +- 5.00% )

10.004769533 seconds time elapsed ( +- 0.00% )

msm8660 (before patch)
----------------------

# perf stat -r 10 --post "rmmod sched_clock_test" modprobe sched_clock_test
Made 14099029 calls in 10000000586 ns
Made 14099227 calls in 10000000735 ns
Made 14098763 calls in 10000000439 ns
Made 14099042 calls in 10000000291 ns
Made 14099273 calls in 10000000290 ns
Made 14100377 calls in 10000000586 ns
Made 14100183 calls in 10000000586 ns
Made 14099220 calls in 10000000586 ns
Made 14098853 calls in 10000000587 ns
Made 14099368 calls in 10000000142 ns

Performance counter stats for 'modprobe sched_clock_test' (10 runs):

10006.700528 task-clock (msec) # 1.000 CPUs utilized ( +- 0.00% )
11 context-switches # 0.001 K/sec ( +- 10.38% )
0 cpu-migrations # 0.000 K/sec
56 page-faults # 0.006 K/sec
0 cycles # 0.000 GHz
0 stalled-cycles-frontend # 0.00% frontend cycles idle
0 stalled-cycles-backend # 0.00% backend cycles idle
0 instructions
0 branches # 0.000 K/sec
0 branch-misses # 0.000 K/sec

10.008796161 seconds time elapsed ( +- 0.00% )

msm8660 (after patch)
---------------------
# perf stat -r 10 --post "rmmod sched_clock_test" modprobe sched_clock_test
Made 20555901 calls in 10000000438 ns
Made 15510019 calls in 10000000142 ns
Made 15510371 calls in 10000000587 ns
Made 15509184 calls in 10000000439 ns
Made 15509068 calls in 10000000291 ns
Made 15510719 calls in 10000000439 ns
Made 15508899 calls in 10000000291 ns
Made 15509206 calls in 10000000587 ns
Made 15509057 calls in 10000000290 ns
Made 15509178 calls in 10000000735 ns

Performance counter stats for 'modprobe sched_clock_test' (10 runs):

10009.491416 task-clock (msec) # 1.000 CPUs utilized ( +- 0.00% )
13 context-switches # 0.001 K/sec ( +- 10.82% )
0 cpu-migrations # 0.000 K/sec
58 page-faults # 0.006 K/sec
0 cycles # 0.000 GHz
0 stalled-cycles-frontend # 0.00% frontend cycles idle
0 stalled-cycles-backend # 0.00% backend cycles idle
0 instructions
0 branches # 0.000 K/sec
0 branch-misses # 0.000 K/sec

10.011834087 seconds time elapsed ( +- 0.00% )

--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

2015-02-13 03:49:42

by Stephen Boyd

[permalink] [raw]
Subject: Re: [PATCH v4 0/5] sched_clock: Optimize and avoid deadlock during read from NMI

On 02/08/15 04:02, Daniel Thompson wrote:
> This patchset optimizes the generic sched_clock implementation by
> removing branches and significantly reducing the data cache profile. It
> also makes it safe to call sched_clock() from NMI (or FIQ on ARM).
>
> The data cache profile of sched_clock() in the original code is
> somewhere between 2 and 3 (64-byte) cache lines, depending on alignment
> of struct clock_data. After patching, the cache profile for the normal
> case should be a single cacheline.
>
> NMI safety was tested on i.MX6 with perf drowning the system in FIQs and
> using the perf handler to check that sched_clock() returned monotonic
> values. At the same time I forcefully reduced kt_wrap so that
> update_sched_clock() is being called at >1000Hz.
>
> Without the patches the above system is grossly unstable, surviving
> [9K,115K,25K] perf event cycles during three separate runs. With the
> patch I ran for over 9M perf event cycles before getting bored.
>

Looks good to me. For the series:

Reviewed-by: Stephen Boyd <[email protected]>

--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project