2023-01-06 06:37:54

by Guo Hui

[permalink] [raw]
Subject: [PATCH] timekeeping:add padding in timekeeper for Unixbench pipe

When the LLC cache line size is 128 bytes, such as Kunpeng 920,
the seq attribute and xtime attribute in the structure tk_core
are completely in the same LLC cache line,
and xtime_sec is the data protected by the seq lock
in the function ktime_get_coarse_real_ts64,
so seq and xtime_sec are in the same LLC cache line
causing the false sharing problem.

Adding padding before xtime_sec in the structure timekeeper
is based on the comment of the structure tk_read_base: "This
struct has size 56 byte on 64 bit. Together with a seqcount
it occupies a single 64byte cache line." Therefore,
seq and the structure tk_read_base
should be placed in the same 64-byte cacheline.

The performance data of Unixbench pipe on Kunpeng 920 is as follows:

Enable the LSE instruction:
seq and xtime are in the same LLC cache line:
System Benchmarks Partial Index BASELINE RESULT INDEX
Pipe Throughput 12440.0 14800574.4 11897.6
Pipe-based Context Switching 4000.0 4357419.0 10893.5
========
System Benchmarks Index Score (Partial Only) 11384.5

seq and xtime are not in the same LLC cache line:
System Benchmarks Partial Index BASELINE RESULT INDEX
Pipe Throughput 12440.0 16546306.6 13300.9
Pipe-based Context Switching 4000.0 5654281.8 14135.7
========
System Benchmarks Index Score (Partial Only) 13711.9

When the LSE instruction is enabled,
Pipe Throughput increases by 11.79%,
and Pipe-based Context Switching increases by 29.76%.

Close the LSE instruction:
seq and xtime are in the same LLC cache line:
System Benchmarks Partial Index BASELINE RESULT INDEX
Pipe Throughput 12440.0 36375286.5 29240.6
Pipe-based Context Switching 4000.0 11994739.7 29986.8
========
System Benchmarks Index Score (Partial Only) 29611.4

seq and xtime are not in the same LLC cache line:
System Benchmarks Partial Index BASELINE RESULT INDEX
Pipe Throughput 12440.0 44887148.8 36082.9
Pipe-based Context Switching 4000.0 13666392.0 34166.0
========
System Benchmarks Index Score (Partial Only) 35111.4

When the LSE instruction is disabled,
Pipe Throughput increases by 23.40%,
and Pipe-based Context Switching increases by 13.94%.

Signed-off-by: Guo Hui <[email protected]>
---
include/linux/timekeeper_internal.h | 1 +
1 file changed, 1 insertion(+)

diff --git a/include/linux/timekeeper_internal.h b/include/linux/timekeeper_internal.h
index 84ff2844d..d363cd1f3 100644
--- a/include/linux/timekeeper_internal.h
+++ b/include/linux/timekeeper_internal.h
@@ -92,6 +92,7 @@ struct tk_read_base {
struct timekeeper {
struct tk_read_base tkr_mono;
struct tk_read_base tkr_raw;
+ u64 padding;
u64 xtime_sec;
unsigned long ktime_sec;
struct timespec64 wall_to_monotonic;
--
2.20.1


2023-01-09 15:33:18

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] timekeeping:add padding in timekeeper for Unixbench pipe

Guo!

On Fri, Jan 06 2023 at 14:29, Guo Hui wrote:
> When the LLC cache line size is 128 bytes, such as Kunpeng 920,
> the seq attribute and xtime attribute in the structure tk_core
> are completely in the same LLC cache line,
> and xtime_sec is the data protected by the seq lock
> in the function ktime_get_coarse_real_ts64,
> so seq and xtime_sec are in the same LLC cache line
> causing the false sharing problem.

What exactly causes the alleged false sharing problem?

> Adding padding before xtime_sec in the structure timekeeper
> is based on the comment of the structure tk_read_base: "This
> struct has size 56 byte on 64 bit. Together with a seqcount
> it occupies a single 64byte cache line." Therefore,
> seq and the structure tk_read_base
> should be placed in the same 64-byte cacheline.

How is that relevant? They _are_ in the same cacheline independent of
your padding, no?
Offset Size
seqcount_raw_spinlock_t seq; /* 0 4 */
struct timekeeper timekeeper; /* 8 280 */
struct tk_read_base tkr_mono; /* 8 56 */

8 + 56 = 64 which is also the case with your padding.

If your false sharing thing exists then all other timekeeper read
functions which only use

tk_core.seq and tk_core.timekeeper.tkr_mono

have the very same false sharing problem because seq and data are in the
same cache line, no?

Aside of that, for architectures with 64 byte cache line size, your
change is fundamentally bad. Why?

It moves xtime_sec to offset 128 which means seq and xtime_sec are not
longer in consecutive cache lines.

If you change core data structures for the benefit of your platform,
then you have to provide proof that it does not cause any issues on
other architectures and platforms.

Thanks,

tglx