2020-10-30 15:24:34

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH 4/6] perf: Optimize get_recursion_context()

"Look ma, no branches!"

Cc: Jesper Dangaard Brouer <[email protected]>
Cc: Steven Rostedt <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
kernel/events/internal.h | 17 ++++++++---------
1 file changed, 8 insertions(+), 9 deletions(-)

--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -205,16 +205,15 @@ DEFINE_OUTPUT_COPY(__output_copy_user, a

static inline int get_recursion_context(int *recursion)
{
- int rctx;
+ unsigned int pc = preempt_count();
+ unsigned int rctx = 0;

- if (unlikely(in_nmi()))
- rctx = 3;
- else if (in_irq())
- rctx = 2;
- else if (in_serving_softirq())
- rctx = 1;
- else
- rctx = 0;
+ if (pc & (NMI_MASK))
+ rctx++;
+ if (pc & (NMI_MASK | HARDIRQ_MASK))
+ rctx++;
+ if (pc & (NMI_MASK | HARDIRQ_MASK | SOFTIRQ_OFFSET))
+ rctx++;

if (recursion[rctx])
return -1;



2020-10-30 17:13:46

by Jesper Dangaard Brouer

[permalink] [raw]
Subject: Re: [PATCH 4/6] perf: Optimize get_recursion_context()

On Fri, 30 Oct 2020 16:13:49 +0100
Peter Zijlstra <[email protected]> wrote:

> "Look ma, no branches!"
>
> Cc: Jesper Dangaard Brouer <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> ---

Cool trick! :-)

Acked-by: Jesper Dangaard Brouer <[email protected]>

> kernel/events/internal.h | 17 ++++++++---------
> 1 file changed, 8 insertions(+), 9 deletions(-)
>
> --- a/kernel/events/internal.h
> +++ b/kernel/events/internal.h
> @@ -205,16 +205,15 @@ DEFINE_OUTPUT_COPY(__output_copy_user, a
>
> static inline int get_recursion_context(int *recursion)
> {
> - int rctx;
> + unsigned int pc = preempt_count();
> + unsigned int rctx = 0;
>
> - if (unlikely(in_nmi()))
> - rctx = 3;
> - else if (in_irq())
> - rctx = 2;
> - else if (in_serving_softirq())
> - rctx = 1;
> - else
> - rctx = 0;
> + if (pc & (NMI_MASK))
> + rctx++;
> + if (pc & (NMI_MASK | HARDIRQ_MASK))
> + rctx++;
> + if (pc & (NMI_MASK | HARDIRQ_MASK | SOFTIRQ_OFFSET))
> + rctx++;
>
> if (recursion[rctx])
> return -1;
>
>



--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer

2020-10-30 20:25:31

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 4/6] perf: Optimize get_recursion_context()

On Fri, 30 Oct 2020 18:11:38 +0100
Jesper Dangaard Brouer <[email protected]> wrote:

> On Fri, 30 Oct 2020 16:13:49 +0100
> Peter Zijlstra <[email protected]> wrote:
>
> > "Look ma, no branches!"
> >
> > Cc: Jesper Dangaard Brouer <[email protected]>
> > Cc: Steven Rostedt <[email protected]>
> > Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> > ---
>
> Cool trick! :-)
>
> Acked-by: Jesper Dangaard Brouer <[email protected]>
>
> > kernel/events/internal.h | 17 ++++++++---------
> > 1 file changed, 8 insertions(+), 9 deletions(-)
> >
> > --- a/kernel/events/internal.h
> > +++ b/kernel/events/internal.h
> > @@ -205,16 +205,15 @@ DEFINE_OUTPUT_COPY(__output_copy_user, a
> >
> > static inline int get_recursion_context(int *recursion)
> > {
> > - int rctx;
> > + unsigned int pc = preempt_count();
> > + unsigned int rctx = 0;
> >
> > - if (unlikely(in_nmi()))
> > - rctx = 3;
> > - else if (in_irq())
> > - rctx = 2;
> > - else if (in_serving_softirq())
> > - rctx = 1;
> > - else
> > - rctx = 0;
> > + if (pc & (NMI_MASK))
> > + rctx++;
> > + if (pc & (NMI_MASK | HARDIRQ_MASK))
> > + rctx++;
> > + if (pc & (NMI_MASK | HARDIRQ_MASK | SOFTIRQ_OFFSET))
> > + rctx++;
> >
> > if (recursion[rctx])
> > return -1;
> >
> >

As this is something that ftrace recursion also does, perhaps we should
move this into interrupt.h so that anyone that needs a counter can get
it quickly, and not keep re-implementing it.

/*
* Quickly find what context you are in.
* 0 - normal
* 1 - softirq
* 2 - hard interrupt
* 3 - NMI
*/
static inline int irq_context()
{
unsigned int pc = preempt_count();
int rctx = 0;

if (pc & (NMI_MASK))
rctx++;
if (pc & (NMI_MASK | HARDIRQ_MASK))
rctx++;
if (pc & (NMI_MASK | HARDIRQ_MASK | SOFTIRQ_OFFSET))
rctx++;

return rctx;
}

-- Steve

2020-10-30 22:18:58

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 4/6] perf: Optimize get_recursion_context()

On Fri, Oct 30 2020 at 16:22, Steven Rostedt wrote:
> As this is something that ftrace recursion also does, perhaps we should
> move this into interrupt.h so that anyone that needs a counter can get

Not in interrupt.h please. We should create kernel/include/ for stuff
which really should only be available in the core kernel code.

Thanks,

tglx


2020-10-30 23:06:24

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 4/6] perf: Optimize get_recursion_context()

On Fri, Oct 30, 2020 at 04:22:48PM -0400, Steven Rostedt wrote:
> As this is something that ftrace recursion also does, perhaps we should
> move this into interrupt.h so that anyone that needs a counter can get
> it quickly, and not keep re-implementing it.

Works for me, however:

> /*
> * Quickly find what context you are in.
> * 0 - normal
> * 1 - softirq
> * 2 - hard interrupt
> * 3 - NMI
> */
> static inline int irq_context()
> {
> unsigned int pc = preempt_count();
> int rctx = 0;

unsigned

>
> if (pc & (NMI_MASK))
> rctx++;
> if (pc & (NMI_MASK | HARDIRQ_MASK))
> rctx++;
> if (pc & (NMI_MASK | HARDIRQ_MASK | SOFTIRQ_OFFSET))
> rctx++;
>
> return rctx;
> }

otherwise you'll get an extra instruction to sign extend it, which is
daft (yes, i've been staring at GCC output far too much).

Also, gcc-9 does worse (like 1 byte iirc) with:

rctx += !!(pc & (NMI_MASK));
rctx += !!(pc & (NMI_MASK | HARDIRQ_MASK));
rctx += !!(pc & (NMI_MASK | HARDIRQ_MASK | SOFTIRQ_OFFSET));

but gcc-10 doesn't seem to care.

2020-10-30 23:33:28

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 4/6] perf: Optimize get_recursion_context()

On Fri, 30 Oct 2020 23:14:18 +0100
Thomas Gleixner <[email protected]> wrote:

> On Fri, Oct 30 2020 at 16:22, Steven Rostedt wrote:
> > As this is something that ftrace recursion also does, perhaps we should
> > move this into interrupt.h so that anyone that needs a counter can get
>
> Not in interrupt.h please. We should create kernel/include/ for stuff
> which really should only be available in the core kernel code.
>

The recursion protection is needed for anything that registers a
callback to ftrace. I have patches that already basically do the same
thing (although, with branches) that I'm going to place in
include/linux/trace_recursion.h, so that there are helper functions
that ftrace callbacks can use, instead of having to implement their own
recursion protection. After fixing two bugs in the recursion protection
code, it's something that should be reused, instead of everyone making
similar mistakes.

Note, I thought that in_nmi() and friends was in interrupt.h, but is
really in preempt.h. All the values used in Peter's code is also
defined in preempt.h, so why not have something like that there?

I take back adding it to interrupt.h but have it in preempt.h, as it's
not defining anything new there.

-- Steve

2020-10-31 11:28:53

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 4/6] perf: Optimize get_recursion_context()

On Fri, Oct 30, 2020 at 07:31:24PM -0400, Steven Rostedt wrote:

> Note, I thought that in_nmi() and friends was in interrupt.h, but is
> really in preempt.h. All the values used in Peter's code is also
> defined in preempt.h, so why not have something like that there?
>
> I take back adding it to interrupt.h but have it in preempt.h, as it's
> not defining anything new there.

Yeah, preempt.h is the right place for it.

2020-10-31 12:14:48

by David Laight

[permalink] [raw]
Subject: RE: [PATCH 4/6] perf: Optimize get_recursion_context()

From: Peter Zijlstra
> Sent: 30 October 2020 23:02
>
> On Fri, Oct 30, 2020 at 04:22:48PM -0400, Steven Rostedt wrote:
> > As this is something that ftrace recursion also does, perhaps we should
> > move this into interrupt.h so that anyone that needs a counter can get
> > it quickly, and not keep re-implementing it.
>
> Works for me, however:
>
> > /*
> > * Quickly find what context you are in.
> > * 0 - normal
> > * 1 - softirq
> > * 2 - hard interrupt
> > * 3 - NMI
> > */
> > static inline int irq_context()
> > {
> > unsigned int pc = preempt_count();
> > int rctx = 0;
>
> unsigned
>
> >
> > if (pc & (NMI_MASK))
> > rctx++;
> > if (pc & (NMI_MASK | HARDIRQ_MASK))
> > rctx++;
> > if (pc & (NMI_MASK | HARDIRQ_MASK | SOFTIRQ_OFFSET))
> > rctx++;
> >
> > return rctx;
> > }
>
> otherwise you'll get an extra instruction to sign extend it, which is
> daft (yes, i've been staring at GCC output far too much).
>
> Also, gcc-9 does worse (like 1 byte iirc) with:
>
> rctx += !!(pc & (NMI_MASK));
> rctx += !!(pc & (NMI_MASK | HARDIRQ_MASK));
> rctx += !!(pc & (NMI_MASK | HARDIRQ_MASK | SOFTIRQ_OFFSET));
>
> but gcc-10 doesn't seem to care.

You've made be look at some gcc output (it's raining).

The gcc 7.5.0 I have handy probably generates the best code for:

unsigned char q_2(unsigned int pc)
{
unsigned char rctx = 0;

rctx += !!(pc & (NMI_MASK));
rctx += !!(pc & (NMI_MASK | HARDIRQ_MASK));
rctx += !!(pc & (NMI_MASK | HARDIRQ_MASK | SOFTIRQ_OFFSET));

return rctx;
}

0000000000000000 <q_2>:
0: f7 c7 00 00 f0 00 test $0xf00000,%edi # clock 0
6: 0f 95 c0 setne %al # clock 1
9: f7 c7 00 00 ff 00 test $0xff0000,%edi # clock 0
f: 0f 95 c2 setne %dl # clock 1
12: 01 c2 add %eax,%edx # clock 2
14: 81 e7 00 01 ff 00 and $0xff0100,%edi
1a: 0f 95 c0 setne %al
1d: 01 d0 add %edx,%eax # clock 3
1f: c3 retq

I doubt that is beatable.

I've annotated the register dependency chain.
Likely to be 3 (or maybe 4) clocks.
The other versions are a lot worse (7 or 8) without allowing
for 'sbb' taking 2 clocks on a lot of Intel cpus.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2020-10-31 13:21:40

by David Laight

[permalink] [raw]
Subject: RE: [PATCH 4/6] perf: Optimize get_recursion_context()

From: David Laight
> Sent: 31 October 2020 12:12
>
...
> The gcc 7.5.0 I have handy probably generates the best code for:
>
> unsigned char q_2(unsigned int pc)
> {
> unsigned char rctx = 0;
>
> rctx += !!(pc & (NMI_MASK));
> rctx += !!(pc & (NMI_MASK | HARDIRQ_MASK));
> rctx += !!(pc & (NMI_MASK | HARDIRQ_MASK | SOFTIRQ_OFFSET));
>
> return rctx;
> }
>
> 0000000000000000 <q_2>:
> 0: f7 c7 00 00 f0 00 test $0xf00000,%edi # clock 0
> 6: 0f 95 c0 setne %al # clock 1
> 9: f7 c7 00 00 ff 00 test $0xff0000,%edi # clock 0
> f: 0f 95 c2 setne %dl # clock 1
> 12: 01 c2 add %eax,%edx # clock 2
> 14: 81 e7 00 01 ff 00 and $0xff0100,%edi
> 1a: 0f 95 c0 setne %al
> 1d: 01 d0 add %edx,%eax # clock 3
> 1f: c3 retq
>
> I doubt that is beatable.

I lied, you should be able to get:
test $0xff0000,%edi # clock 0
setne %al # clock 1
test $0xff0100,%edi # clock 0
setne %dl # clock 1
add $fffff000,%edi
adc %dl, %al # clock 2

But I suspect getting it from the compiler might be hard!

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2020-11-09 12:15:35

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 4/6] perf: Optimize get_recursion_context()

On Sat, Oct 31, 2020 at 12:11:42PM +0000, David Laight wrote:
> The gcc 7.5.0 I have handy probably generates the best code for:
>
> unsigned char q_2(unsigned int pc)
> {
> unsigned char rctx = 0;
>
> rctx += !!(pc & (NMI_MASK));
> rctx += !!(pc & (NMI_MASK | HARDIRQ_MASK));
> rctx += !!(pc & (NMI_MASK | HARDIRQ_MASK | SOFTIRQ_OFFSET));
>
> return rctx;
> }
>
> 0000000000000000 <q_2>:
> 0: f7 c7 00 00 f0 00 test $0xf00000,%edi # clock 0
> 6: 0f 95 c0 setne %al # clock 1
> 9: f7 c7 00 00 ff 00 test $0xff0000,%edi # clock 0
> f: 0f 95 c2 setne %dl # clock 1
> 12: 01 c2 add %eax,%edx # clock 2
> 14: 81 e7 00 01 ff 00 and $0xff0100,%edi
> 1a: 0f 95 c0 setne %al
> 1d: 01 d0 add %edx,%eax # clock 3
> 1f: c3 retq
>
> I doubt that is beatable.
>
> I've annotated the register dependency chain.
> Likely to be 3 (or maybe 4) clocks.
> The other versions are a lot worse (7 or 8) without allowing
> for 'sbb' taking 2 clocks on a lot of Intel cpus.

https://godbolt.org/z/EfnG8E

Recent GCC just doesn't want to do that. Still, using u8 makes sense, so
I've kept that.

2020-11-09 14:17:09

by David Laight

[permalink] [raw]
Subject: RE: [PATCH 4/6] perf: Optimize get_recursion_context()



> -----Original Message-----
> From: Peter Zijlstra <[email protected]>
> Sent: 09 November 2020 12:13
> To: David Laight <[email protected]>
> Cc: Steven Rostedt <[email protected]>; Jesper Dangaard Brouer <[email protected]>;
> [email protected]; [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected]
> Subject: Re: [PATCH 4/6] perf: Optimize get_recursion_context()
>
> On Sat, Oct 31, 2020 at 12:11:42PM +0000, David Laight wrote:
> > The gcc 7.5.0 I have handy probably generates the best code for:
> >
> > unsigned char q_2(unsigned int pc)
> > {
> > unsigned char rctx = 0;
> >
> > rctx += !!(pc & (NMI_MASK));
> > rctx += !!(pc & (NMI_MASK | HARDIRQ_MASK));
> > rctx += !!(pc & (NMI_MASK | HARDIRQ_MASK | SOFTIRQ_OFFSET));
> >
> > return rctx;
> > }
> >
> > 0000000000000000 <q_2>:
> > 0: f7 c7 00 00 f0 00 test $0xf00000,%edi # clock 0
> > 6: 0f 95 c0 setne %al # clock 1
> > 9: f7 c7 00 00 ff 00 test $0xff0000,%edi # clock 0
> > f: 0f 95 c2 setne %dl # clock 1
> > 12: 01 c2 add %eax,%edx # clock 2
> > 14: 81 e7 00 01 ff 00 and $0xff0100,%edi
> > 1a: 0f 95 c0 setne %al
> > 1d: 01 d0 add %edx,%eax # clock 3
> > 1f: c3 retq
> >
> > I doubt that is beatable.
> >
> > I've annotated the register dependency chain.
> > Likely to be 3 (or maybe 4) clocks.
> > The other versions are a lot worse (7 or 8) without allowing
> > for 'sbb' taking 2 clocks on a lot of Intel cpus.
>
> https://godbolt.org/z/EfnG8E
>
> Recent GCC just doesn't want to do that. Still, using u8 makes sense, so
> I've kept that.

u8 helps x86 because its 'setne' only affects the low 8 bits.
I guess that seemed a good idea when it was added (386).
It doesn't seem to make the other architectures much worse.

gcc 10.x can be persuaded to generate the above code.

https://godbolt.org/z/6GoT94

It sometimes seems to me that every new version of gcc is
larger, slower and generates worse code than the previous one.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Subject: [tip: perf/urgent] perf: Optimize get_recursion_context()

The following commit has been merged into the perf/urgent branch of tip:

Commit-ID: 09da9c81253dd8e43e0d2d7cea02de6f9f19499d
Gitweb: https://git.kernel.org/tip/09da9c81253dd8e43e0d2d7cea02de6f9f19499d
Author: Peter Zijlstra <[email protected]>
AuthorDate: Fri, 30 Oct 2020 13:43:16 +01:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Mon, 09 Nov 2020 18:12:34 +01:00

perf: Optimize get_recursion_context()

"Look ma, no branches!"

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Jesper Dangaard Brouer <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
kernel/events/internal.h | 16 ++++++----------
1 file changed, 6 insertions(+), 10 deletions(-)

diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 402054e..228801e 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -205,16 +205,12 @@ DEFINE_OUTPUT_COPY(__output_copy_user, arch_perf_out_copy_user)

static inline int get_recursion_context(int *recursion)
{
- int rctx;
-
- if (unlikely(in_nmi()))
- rctx = 3;
- else if (in_irq())
- rctx = 2;
- else if (in_serving_softirq())
- rctx = 1;
- else
- rctx = 0;
+ unsigned int pc = preempt_count();
+ unsigned char rctx = 0;
+
+ rctx += !!(pc & (NMI_MASK));
+ rctx += !!(pc & (NMI_MASK | HARDIRQ_MASK));
+ rctx += !!(pc & (NMI_MASK | HARDIRQ_MASK | SOFTIRQ_OFFSET));

if (recursion[rctx])
return -1;