2018-10-11 00:35:41

by Eric Dumazet

[permalink] [raw]
Subject: [PATCH] x86/tsc: use real seqcount_latch in cyc2ns_read_begin()

While looking at native_sched_clock() disassembly I had
the surprise to see the compiler (gcc 7.3 here) had
optimized out the loop, meaning the code is broken.

Using the documented and approved API not only fixes the bug,
it also makes the code more readable.

Replacing five this_cpu_read() by one this_cpu_ptr() makes
the generated code smaller.

Fixes: 59eaef78bfea ("x86/tsc: Remodel cyc2ns to use seqcount_latch()")
Signed-off-by: Eric Dumazet <[email protected]>
Cc: Peter Zijlstra (Intel) <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
---
arch/x86/kernel/tsc.c | 15 ++++++---------
1 file changed, 6 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index b52bd2b6cdb443ba0c89d78aaa52b02b82a10b6e..b039ae0f358a4f78cc830c069ad90e4fb108489e 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -60,19 +60,16 @@ static DEFINE_PER_CPU_ALIGNED(struct cyc2ns, cyc2ns);

void cyc2ns_read_begin(struct cyc2ns_data *data)
{
- int seq, idx;
+ const struct cyc2ns *c2n;
+ int seq;

preempt_disable_notrace();

+ c2n = this_cpu_ptr(&cyc2ns);
do {
- seq = this_cpu_read(cyc2ns.seq.sequence);
- idx = seq & 1;
-
- data->cyc2ns_offset = this_cpu_read(cyc2ns.data[idx].cyc2ns_offset);
- data->cyc2ns_mul = this_cpu_read(cyc2ns.data[idx].cyc2ns_mul);
- data->cyc2ns_shift = this_cpu_read(cyc2ns.data[idx].cyc2ns_shift);
-
- } while (unlikely(seq != this_cpu_read(cyc2ns.seq.sequence)));
+ seq = raw_read_seqcount_latch(&c2n->seq);
+ *data = c2n->data[seq & 1];
+ } while (read_seqcount_retry(&c2n->seq, seq));
}

void cyc2ns_read_end(void)
--
2.19.0.605.g01d371f741-goog



2018-10-11 08:13:51

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] x86/tsc: use real seqcount_latch in cyc2ns_read_begin()

On Wed, Oct 10, 2018 at 05:33:36PM -0700, Eric Dumazet wrote:
> While looking at native_sched_clock() disassembly I had
> the surprise to see the compiler (gcc 7.3 here) had
> optimized out the loop, meaning the code is broken.
>
> Using the documented and approved API not only fixes the bug,
> it also makes the code more readable.
>
> Replacing five this_cpu_read() by one this_cpu_ptr() makes
> the generated code smaller.

Does not for me, that is, the resulting asm is actually larger

You're quite right the loop went missing; no idea wth that compiler is
smoking (gcc-8.2 for me). In order to eliminate that loop it needs to
think that two consecutive loads of this_cpu_read(cyc2ns.seq.sequence)
will return the same value. But this_cpu_read() is an asm() statement,
it _should_ not assume such.

We assume that this_cpu_read() implies READ_ONCE() in a number of
locations, this really should not happen.

The reason it was written using this_cpu_read() is so that it can use
%gs: prefixed instructions and avoid ever loading that percpu offset and
doing manual address computation.

Let me prod at this with a sharp stick.

2018-10-11 08:48:33

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] x86/tsc: use real seqcount_latch in cyc2ns_read_begin()

On Thu, Oct 11, 2018 at 09:31:33AM +0200, Peter Zijlstra wrote:
> On Wed, Oct 10, 2018 at 05:33:36PM -0700, Eric Dumazet wrote:
> > While looking at native_sched_clock() disassembly I had
> > the surprise to see the compiler (gcc 7.3 here) had
> > optimized out the loop, meaning the code is broken.
> >
> > Using the documented and approved API not only fixes the bug,
> > it also makes the code more readable.
> >
> > Replacing five this_cpu_read() by one this_cpu_ptr() makes
> > the generated code smaller.
>
> Does not for me, that is, the resulting asm is actually larger
>
> You're quite right the loop went missing; no idea wth that compiler is
> smoking (gcc-8.2 for me). In order to eliminate that loop it needs to
> think that two consecutive loads of this_cpu_read(cyc2ns.seq.sequence)
> will return the same value. But this_cpu_read() is an asm() statement,
> it _should_ not assume such.
>
> We assume that this_cpu_read() implies READ_ONCE() in a number of
> locations, this really should not happen.
>
> The reason it was written using this_cpu_read() is so that it can use
> %gs: prefixed instructions and avoid ever loading that percpu offset and
> doing manual address computation.
>
> Let me prod at this with a sharp stick.

OK, so apart from the inlining being crap, which is fixed by:

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 6490f618e096..638491062fea 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -57,7 +57,7 @@ struct cyc2ns {

static DEFINE_PER_CPU_ALIGNED(struct cyc2ns, cyc2ns);

-void cyc2ns_read_begin(struct cyc2ns_data *data)
+void __always_inline cyc2ns_read_begin(struct cyc2ns_data *data)
{
int seq, idx;

@@ -74,7 +74,7 @@ void cyc2ns_read_begin(struct cyc2ns_data *data)
} while (unlikely(seq != this_cpu_read(cyc2ns.seq.sequence)));
}

-void cyc2ns_read_end(void)
+void __always_inline cyc2ns_read_end(void)
{
preempt_enable_notrace();
}
@@ -103,7 +103,7 @@ void cyc2ns_read_end(void)
* [email protected] "math is hard, lets go shopping!"
*/

-static inline unsigned long long cycles_2_ns(unsigned long long cyc)
+static __always_inline unsigned long long cycles_2_ns(unsigned long long cyc)
{
struct cyc2ns_data data;
unsigned long long ns;


That gets us:

native_sched_clock:
pushq %rbp #
movq %rsp, %rbp #,

... jump label ...

rdtsc
salq $32, %rdx #, tmp110
orq %rax, %rdx # low, tmp110
movl %gs:cyc2ns+32(%rip),%ecx # cyc2ns.seq.sequence, pfo_ret__
andl $1, %ecx #, idx
salq $4, %rcx #, tmp116
movl %gs:cyc2ns(%rcx),%esi # cyc2ns.data[idx_14].cyc2ns_mul, pfo_ret__
movl %esi, %esi # pfo_ret__, pfo_ret__
movq %rsi, %rax # pfo_ret__, tmp133
mulq %rdx # _23
movq %gs:cyc2ns+8(%rcx),%rdi # cyc2ns.data[idx_14].cyc2ns_offset, pfo_ret__
addq $cyc2ns, %rcx #, tmp117
movl %gs:4(%rcx),%ecx # cyc2ns.data[idx_14].cyc2ns_shift, pfo_ret__
shrdq %rdx, %rax # pfo_ret__,, tmp134
shrq %cl, %rdx # pfo_ret__,
testb $64, %cl #, pfo_ret__
cmovne %rdx, %rax #,, tmp134
addq %rdi, %rax # pfo_ret__, <retval>
popq %rbp #
ret


If we then fix the percpu mess, with:

diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h
index e9202a0de8f0..1a19d11cfbbd 100644
--- a/arch/x86/include/asm/percpu.h
+++ b/arch/x86/include/asm/percpu.h
@@ -185,22 +185,22 @@ do { \
typeof(var) pfo_ret__; \
switch (sizeof(var)) { \
case 1: \
- asm(op "b "__percpu_arg(1)",%0" \
+ asm volatile(op "b "__percpu_arg(1)",%0"\
: "=q" (pfo_ret__) \
: "m" (var)); \
break; \
case 2: \
- asm(op "w "__percpu_arg(1)",%0" \
+ asm volatile(op "w "__percpu_arg(1)",%0"\
: "=r" (pfo_ret__) \
: "m" (var)); \
break; \
case 4: \
- asm(op "l "__percpu_arg(1)",%0" \
+ asm volatile(op "l "__percpu_arg(1)",%0"\
: "=r" (pfo_ret__) \
: "m" (var)); \
break; \
case 8: \
- asm(op "q "__percpu_arg(1)",%0" \
+ asm volatile(op "q "__percpu_arg(1)",%0"\
: "=r" (pfo_ret__) \
: "m" (var)); \
break; \


That turns into:

native_sched_clock:
pushq %rbp #
movq %rsp, %rbp #,

... jump label ...

rdtsc
salq $32, %rdx #, tmp110
orq %rax, %rdx # low, tmp110
movq %rdx, %r10 # tmp110, _23
movq $cyc2ns, %r9 #, tmp136
.L235:
movl %gs:cyc2ns+32(%rip),%eax # cyc2ns.seq.sequence, pfo_ret__
movl %eax, %ecx # pfo_ret__, idx
andl $1, %ecx #, idx
salq $4, %rcx #, tmp116
addq %r9, %rcx # tmp136, tmp117
movq %gs:8(%rcx),%rdi # cyc2ns.data[idx_14].cyc2ns_offset, pfo_ret__
movl %gs:(%rcx),%esi # cyc2ns.data[idx_14].cyc2ns_mul, pfo_ret__
movl %gs:4(%rcx),%ecx # cyc2ns.data[idx_14].cyc2ns_shift, pfo_ret__
movl %gs:cyc2ns+32(%rip),%r8d # cyc2ns.seq.sequence, pfo_ret__
cmpl %r8d, %eax # pfo_ret__, pfo_ret__
jne .L235 #,
movl %esi, %esi # pfo_ret__, pfo_ret__
movq %rsi, %rax # pfo_ret__, tmp133
mulq %r10 # _23
shrdq %rdx, %rax # pfo_ret__,, tmp134
shrq %cl, %rdx # pfo_ret__,
testb $64, %cl #, pfo_ret__
cmovne %rdx, %rax #,, tmp134
addq %rdi, %rax # pfo_ret__, <retval>
popq %rbp #
ret

which is exactly right. Except perhaps for the mess that
mul_u64_u32_shr() turns into.

2018-10-11 15:20:43

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH] x86/tsc: use real seqcount_latch in cyc2ns_read_begin()

On Thu, Oct 11, 2018 at 12:31 AM Peter Zijlstra <[email protected]> wrote:
>
> On Wed, Oct 10, 2018 at 05:33:36PM -0700, Eric Dumazet wrote:
> > While looking at native_sched_clock() disassembly I had
> > the surprise to see the compiler (gcc 7.3 here) had
> > optimized out the loop, meaning the code is broken.
> >
> > Using the documented and approved API not only fixes the bug,
> > it also makes the code more readable.
> >
> > Replacing five this_cpu_read() by one this_cpu_ptr() makes
> > the generated code smaller.
>
> Does not for me, that is, the resulting asm is actually larger

[resent in non html]

I counted the number of bytes of text, and really the after my patch
code size is smaller.

%gs prefixes are one-byte, but the 32bit offsets are adding up.

Prior version (with resinstaed loop, thanks to one smp_rmb()

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index b52bd2b6cdb443ba0c89d78aaa52b02b82a10b6e..d4ad30bbaa0cc8bf81da0272829da26f229734bf
100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -71,7 +71,7 @@ void cyc2ns_read_begin(struct cyc2ns_data *data)
data->cyc2ns_offset = this_cpu_read(cyc2ns.data[idx].cyc2ns_offset);
data->cyc2ns_mul = this_cpu_read(cyc2ns.data[idx].cyc2ns_mul);
data->cyc2ns_shift = this_cpu_read(cyc2ns.data[idx].cyc2ns_shift);
-
+ smp_rmb();
} while (unlikely(seq != this_cpu_read(cyc2ns.seq.sequence)));
}


Total length : 0xA7 bytes

00000000000002a0 <native_sched_clock>:
2a0: 4c 8d 54 24 08 lea 0x8(%rsp),%r10
2a5: 48 83 e4 f0 and $0xfffffffffffffff0,%rsp
2a9: 41 ff 72 f8 pushq -0x8(%r10)
2ad: 55 push %rbp
2ae: 48 89 e5 mov %rsp,%rbp
2b1: 41 52 push %r10
2b3: e9 60 00 00 00 jmpq 318 <native_sched_clock+0x78>
2b8: 0f 31 rdtsc
2ba: 48 c1 e2 20 shl $0x20,%rdx
2be: 48 09 c2 or %rax,%rdx
2c1: 49 89 d2 mov %rdx,%r10
2c4: 49 c7 c1 00 00 00 00 mov $0x0,%r9
2c7: R_X86_64_32S .data..percpu..shared_aligned
2cb: 65 8b 05 00 00 00 00 mov %gs:0x0(%rip),%eax # 2d2
<native_sched_clock+0x32>
2ce: R_X86_64_PC32 .data..percpu..shared_aligned+0x1c
2d2: 89 c1 mov %eax,%ecx
2d4: 83 e1 01 and $0x1,%ecx
2d7: 48 c1 e1 04 shl $0x4,%rcx
2db: 4c 01 c9 add %r9,%rcx
2de: 65 48 8b 79 08 mov %gs:0x8(%rcx),%rdi
2e3: 65 8b 31 mov %gs:(%rcx),%esi
2e6: 65 8b 49 04 mov %gs:0x4(%rcx),%ecx
2ea: 65 44 8b 05 00 00 00 mov %gs:0x0(%rip),%r8d # 2f2
<native_sched_clock+0x52>
2f1: 00
2ee: R_X86_64_PC32 .data..percpu..shared_aligned+0x1c
2f2: 44 39 c0 cmp %r8d,%eax
2f5: 75 d4 jne 2cb <native_sched_clock+0x2b>
2f7: 89 f6 mov %esi,%esi
2f9: 48 89 f0 mov %rsi,%rax
2fc: 49 f7 e2 mul %r10
2ff: 48 0f ad d0 shrd %cl,%rdx,%rax
303: 48 d3 ea shr %cl,%rdx
306: f6 c1 40 test $0x40,%cl
309: 48 0f 45 c2 cmovne %rdx,%rax
30d: 48 01 f8 add %rdi,%rax
310: 41 5a pop %r10
312: 5d pop %rbp
313: 49 8d 62 f8 lea -0x8(%r10),%rsp
317: c3 retq
318: 48 69 05 00 00 00 00 imul $0x3d0900,0x0(%rip),%rax
# 323 <native_sched_clock+0x83>
31f: 00 09 3d 00
31b: R_X86_64_PC32 jiffies_64-0x8
323: 41 5a pop %r10
325: 48 ba 00 b8 64 d9 45 movabs $0xffc2f745d964b800,%rdx
32c: f7 c2 ff
32f: 5d pop %rbp
330: 49 8d 62 f8 lea -0x8(%r10),%rsp
334: 48 01 d0 add %rdx,%rax
337: c3 retq

New version (my cleanup patch)

Total length = 0x91 bytes

00000000000002a0 <native_sched_clock>:
2a0: 4c 8d 54 24 08 lea 0x8(%rsp),%r10
2a5: 48 83 e4 f0 and $0xfffffffffffffff0,%rsp
2a9: 41 ff 72 f8 pushq -0x8(%r10)
2ad: 55 push %rbp
2ae: 48 89 e5 mov %rsp,%rbp
2b1: 41 52 push %r10
2b3: e9 59 00 00 00 jmpq 311 <native_sched_clock+0x71>
2b8: 0f 31 rdtsc
2ba: 48 c1 e2 20 shl $0x20,%rdx
2be: 48 09 c2 or %rax,%rdx
2c1: 49 89 d1 mov %rdx,%r9
2c4: 49 c7 c0 00 00 00 00 mov $0x0,%r8
2c7: R_X86_64_32S .data..percpu..shared_aligned
2cb: 65 4c 03 05 00 00 00 add %gs:0x0(%rip),%r8 # 2d3
<native_sched_clock+0x33>
2d2: 00
2cf: R_X86_64_PC32 this_cpu_off-0x4
2d3: 41 8b 40 20 mov 0x20(%r8),%eax
2d7: 89 c6 mov %eax,%esi
2d9: 83 e6 01 and $0x1,%esi
2dc: 48 c1 e6 04 shl $0x4,%rsi
2e0: 4c 01 c6 add %r8,%rsi
2e3: 8b 3e mov (%rsi),%edi
2e5: 8b 4e 04 mov 0x4(%rsi),%ecx
2e8: 48 8b 76 08 mov 0x8(%rsi),%rsi
2ec: 41 3b 40 20 cmp 0x20(%r8),%eax
2f0: 75 e1 jne 2d3 <native_sched_clock+0x33>
2f2: 48 89 f8 mov %rdi,%rax
2f5: 49 f7 e1 mul %r9
2f8: 48 0f ad d0 shrd %cl,%rdx,%rax
2fc: 48 d3 ea shr %cl,%rdx
2ff: f6 c1 40 test $0x40,%cl
302: 48 0f 45 c2 cmovne %rdx,%rax
306: 48 01 f0 add %rsi,%rax
309: 41 5a pop %r10
30b: 5d pop %rbp
30c: 49 8d 62 f8 lea -0x8(%r10),%rsp
310: c3 retq
311: 48 69 05 00 00 00 00 imul $0x3d0900,0x0(%rip),%rax
# 31c <native_sched_clock+0x7c>
318: 00 09 3d 00
314: R_X86_64_PC32 jiffies_64-0x8
31c: 41 5a pop %r10
31e: 48 ba 00 b8 64 d9 45 movabs $0xffc2f745d964b800,%rdx
325: f7 c2 ff
328: 5d pop %rbp
329: 49 8d 62 f8 lea -0x8(%r10),%rsp
32d: 48 01 d0 add %rdx,%rax
330: c3 retq
331:

2018-10-11 15:26:16

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] x86/tsc: use real seqcount_latch in cyc2ns_read_begin()

On Thu, Oct 11, 2018 at 08:00:42AM -0700, Eric Dumazet wrote:
> Yes, but the code size is bigger (I have looked at the disassembly)
>
> All these %gs plus offset add up

> Total length : 0xA7 bytes

effective length: 0x78 bytes

> 00000000000002a0 <native_sched_clock>:
> 2a0: 4c 8d 54 24 08 lea 0x8(%rsp),%r10
> 2a5: 48 83 e4 f0 and $0xfffffffffffffff0,%rsp
> 2a9: 41 ff 72 f8 pushq -0x8(%r10)
> 2ad: 55 push %rbp
> 2ae: 48 89 e5 mov %rsp,%rbp
> 2b1: 41 52 push %r10

2b3: 66 66 66 66 90 k8_nop5_atomic

> 2b8: 0f 31 rdtsc
> 2ba: 48 c1 e2 20 shl $0x20,%rdx
> 2be: 48 09 c2 or %rax,%rdx
> 2c1: 49 89 d2 mov %rdx,%r10
> 2c4: 49 c7 c1 00 00 00 00 mov $0x0,%r9
> 2c7: R_X86_64_32S .data..percpu..shared_aligned
> 2cb: 65 8b 05 00 00 00 00 mov %gs:0x0(%rip),%eax # 2d2
> <native_sched_clock+0x32>
> 2ce: R_X86_64_PC32 .data..percpu..shared_aligned+0x1c
> 2d2: 89 c1 mov %eax,%ecx
> 2d4: 83 e1 01 and $0x1,%ecx
> 2d7: 48 c1 e1 04 shl $0x4,%rcx
> 2db: 4c 01 c9 add %r9,%rcx
> 2de: 65 48 8b 79 08 mov %gs:0x8(%rcx),%rdi
> 2e3: 65 8b 31 mov %gs:(%rcx),%esi
> 2e6: 65 8b 49 04 mov %gs:0x4(%rcx),%ecx
> 2ea: 65 44 8b 05 00 00 00 mov %gs:0x0(%rip),%r8d # 2f2
> <native_sched_clock+0x52>
> 2f1: 00
> 2ee: R_X86_64_PC32 .data..percpu..shared_aligned+0x1c
> 2f2: 44 39 c0 cmp %r8d,%eax
> 2f5: 75 d4 jne 2cb <native_sched_clock+0x2b>
> 2f7: 89 f6 mov %esi,%esi
> 2f9: 48 89 f0 mov %rsi,%rax
> 2fc: 49 f7 e2 mul %r10
> 2ff: 48 0f ad d0 shrd %cl,%rdx,%rax
> 303: 48 d3 ea shr %cl,%rdx
> 306: f6 c1 40 test $0x40,%cl
> 309: 48 0f 45 c2 cmovne %rdx,%rax
> 30d: 48 01 f8 add %rdi,%rax
> 310: 41 5a pop %r10
> 312: 5d pop %rbp
> 313: 49 8d 62 f8 lea -0x8(%r10),%rsp
> 317: c3 retq
>
> New version :
>
> Total length = 0x91 bytes

effective: 0x71

>
> 00000000000002a0 <native_sched_clock>:
> 2a0: 4c 8d 54 24 08 lea 0x8(%rsp),%r10
> 2a5: 48 83 e4 f0 and $0xfffffffffffffff0,%rsp
> 2a9: 41 ff 72 f8 pushq -0x8(%r10)
> 2ad: 55 push %rbp
> 2ae: 48 89 e5 mov %rsp,%rbp
> 2b1: 41 52 push %r10

2b3: 66 66 66 66 90 k8_nop5_atomic

> 2b8: 0f 31 rdtsc
> 2ba: 48 c1 e2 20 shl $0x20,%rdx
> 2be: 48 09 c2 or %rax,%rdx
> 2c1: 49 89 d1 mov %rdx,%r9
> 2c4: 49 c7 c0 00 00 00 00 mov $0x0,%r8
> 2c7: R_X86_64_32S .data..percpu..shared_aligned
> 2cb: 65 4c 03 05 00 00 00 add %gs:0x0(%rip),%r8 # 2d3
> <native_sched_clock+0x33>
> 2d2: 00
> 2cf: R_X86_64_PC32 this_cpu_off-0x4
> 2d3: 41 8b 40 20 mov 0x20(%r8),%eax
> 2d7: 89 c6 mov %eax,%esi
> 2d9: 83 e6 01 and $0x1,%esi
> 2dc: 48 c1 e6 04 shl $0x4,%rsi
> 2e0: 4c 01 c6 add %r8,%rsi
> 2e3: 8b 3e mov (%rsi),%edi
> 2e5: 8b 4e 04 mov 0x4(%rsi),%ecx
> 2e8: 48 8b 76 08 mov 0x8(%rsi),%rsi
> 2ec: 41 3b 40 20 cmp 0x20(%r8),%eax
> 2f0: 75 e1 jne 2d3 <native_sched_clock+0x33>
> 2f2: 48 89 f8 mov %rdi,%rax
> 2f5: 49 f7 e1 mul %r9
> 2f8: 48 0f ad d0 shrd %cl,%rdx,%rax
> 2fc: 48 d3 ea shr %cl,%rdx
> 2ff: f6 c1 40 test $0x40,%cl
> 302: 48 0f 45 c2 cmovne %rdx,%rax
> 306: 48 01 f0 add %rsi,%rax
> 309: 41 5a pop %r10
> 30b: 5d pop %rbp
> 30c: 49 8d 62 f8 lea -0x8(%r10),%rsp
> 310: c3 retq

Ah, right you are. But my version only touches the one cacheline,
whereas yours will do that extra cpu offset load, which might or might
not be hot.

Difficult..

You have some weird stack setup though.. mine doesn't have that:

$ objdump -dr defconfig-build/arch/x86/kernel/tsc.o | awk '/<native_sched_clock>:$/ {p=1} /^$/ {p=0} {if (p) print $0}'
0000000000000b00 <native_sched_clock>:
b00: 55 push %rbp
b01: 48 89 e5 mov %rsp,%rbp
b04: 66 66 66 66 90 k8_nop5_atomic
b09: 0f 31 rdtsc
b0b: 48 c1 e2 20 shl $0x20,%rdx
b0f: 48 09 c2 or %rax,%rdx
b12: 49 89 d2 mov %rdx,%r10
b15: 49 c7 c1 00 00 00 00 mov $0x0,%r9
b18: R_X86_64_32S .data..percpu..shared_aligned
b1c: 65 8b 05 00 00 00 00 mov %gs:0x0(%rip),%eax # b23 <native_sched_clock+0x23>
b1f: R_X86_64_PC32 .data..percpu..shared_aligned+0x1c
b23: 89 c1 mov %eax,%ecx
b25: 83 e1 01 and $0x1,%ecx
b28: 48 c1 e1 04 shl $0x4,%rcx
b2c: 4c 01 c9 add %r9,%rcx
b2f: 65 48 8b 79 08 mov %gs:0x8(%rcx),%rdi
b34: 65 8b 31 mov %gs:(%rcx),%esi
b37: 65 8b 49 04 mov %gs:0x4(%rcx),%ecx
b3b: 65 44 8b 05 00 00 00 mov %gs:0x0(%rip),%r8d # b43 <native_sched_clock+0x43>
b42: 00
b3f: R_X86_64_PC32 .data..percpu..shared_aligned+0x1c
b43: 44 39 c0 cmp %r8d,%eax
b46: 75 d4 jne b1c <native_sched_clock+0x1c>
b48: 89 f6 mov %esi,%esi
b4a: 48 89 f0 mov %rsi,%rax
b4d: 49 f7 e2 mul %r10
b50: 48 0f ad d0 shrd %cl,%rdx,%rax
b54: 48 d3 ea shr %cl,%rdx
b57: f6 c1 40 test $0x40,%cl
b5a: 48 0f 45 c2 cmovne %rdx,%rax
b5e: 48 01 f8 add %rdi,%rax
b61: 5d pop %rbp
b62: c3 retq

Which gets me to 0x62 effective bytes.




2018-10-11 23:28:55

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH] x86/tsc: use real seqcount_latch in cyc2ns_read_begin()

Hi Eric,

I love your patch! Perhaps something to improve:

[auto build test WARNING on tip/x86/core]
[also build test WARNING on v4.19-rc7 next-20181011]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url: https://github.com/0day-ci/linux/commits/Eric-Dumazet/x86-tsc-use-real-seqcount_latch-in-cyc2ns_read_begin/20181012-065625
config: x86_64-randconfig-x000-201840 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64

All warnings (new ones prefixed by >>):

arch/x86/kernel/tsc.c: In function 'cyc2ns_read_begin':
>> arch/x86/kernel/tsc.c:69:33: warning: passing argument 1 of 'raw_read_seqcount_latch' discards 'const' qualifier from pointer target type [-Wdiscarded-qualifiers]
seq = raw_read_seqcount_latch(&c2n->seq);
^
In file included from include/linux/time.h:6:0,
from include/linux/ktime.h:24,
from include/linux/timer.h:6,
from include/linux/workqueue.h:9,
from include/linux/rhashtable-types.h:15,
from include/linux/ipc.h:7,
from include/uapi/linux/sem.h:5,
from include/linux/sem.h:5,
from include/linux/sched.h:15,
from arch/x86/kernel/tsc.c:4:
include/linux/seqlock.h:279:19: note: expected 'seqcount_t * {aka struct seqcount *}' but argument is of type 'const seqcount_t * {aka const struct seqcount *}'
static inline int raw_read_seqcount_latch(seqcount_t *s)
^~~~~~~~~~~~~~~~~~~~~~~

vim +69 arch/x86/kernel/tsc.c

59
60 void cyc2ns_read_begin(struct cyc2ns_data *data)
61 {
62 const struct cyc2ns *c2n;
63 int seq;
64
65 preempt_disable_notrace();
66
67 c2n = this_cpu_ptr(&cyc2ns);
68 do {
> 69 seq = raw_read_seqcount_latch(&c2n->seq);
70 *data = c2n->data[seq & 1];
71 } while (read_seqcount_retry(&c2n->seq, seq));
72 }
73

---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation


Attachments:
(No filename) (2.30 kB)
.config.gz (29.06 kB)
Download all attachments