2021-03-05 03:46:40

by Linus Torvalds

[permalink] [raw]
Subject: "struct perf_sample_data" alignment

Sp there's a note about new warnings in 5.12-rc1 that I looked at, and
many of the warnings made me go "Whaaa?". They were all of the type

warning: 'perf_event_aux_event' uses dynamic stack allocation

and then when I go look, I see nothing that looks like a dynamic stack
allocation at all.

But you know what? The warning is kind of misleading, but at the same
time it's true in a sense. The problem is that the function (and a lot
of other functions) has a local variable like this:

struct perf_sample_data sample;

and the definition of that "struct perf_sample_data" has
____cacheline_aligned at the end.

And guess what? That means that now the compiler has actually to play
games with manually aligning the frame of that function, since the
natural stack alignment is *not* a full cacheline aligned thing. So
it's kind of true: the frame allocation is mnot a simple static thing,
it's a nasty complex thing that wastes memory and time.

That ____cacheline_aligned goes back many years, this is not new, it
seems to come from back in 2014: commit 2565711fb7d7 ("perf: Improve
the perf_sample_data struct layout").

But it really seems entirely and utterly bogus. That cacheline
alignment makes things *worse*, when the variables are on the local
stack. The local stack is already going to be dirty and in the cache,
and aligning those things isn't going to - and I quote from the code
in that commend in that commit - "minimize the cachelines touched".

Quite the reverse. It's just going to make the stack frame use *more*
memory, and make any cacheline usage _worse_.

Maybe there are static (non-automatic) cases of that "struct
perf_sample_data" where the alignment makes sense, but it does not
seem to make sense on the _type_. It should be on those individual
variables, not on every perf_sample_data.

I didn't make any real effort to analyze this, but from a few quick
greps, it looks like every single case is an automatic variable on the
stack, and that the forced alignment is actually a bad thing every
single time. It never ever generates better cache use, but it _does_
generate worse code, and bigger stack frames.

Hmm? What am I missing?

Linus


2021-03-05 08:38:54

by Peter Zijlstra

[permalink] [raw]
Subject: Re: "struct perf_sample_data" alignment

On Thu, Mar 04, 2021 at 07:45:44PM -0800, Linus Torvalds wrote:
> That ____cacheline_aligned goes back many years, this is not new, it
> seems to come from back in 2014: commit 2565711fb7d7 ("perf: Improve
> the perf_sample_data struct layout").

long time ago...

> But it really seems entirely and utterly bogus. That cacheline
> alignment makes things *worse*, when the variables are on the local
> stack. The local stack is already going to be dirty and in the cache,
> and aligning those things isn't going to - and I quote from the code
> in that commend in that commit - "minimize the cachelines touched".
>
> Quite the reverse. It's just going to make the stack frame use *more*
> memory, and make any cacheline usage _worse_.

IIRC there is more history here, but I can't seem to find references
just now.

What I remember is that since perf_sample_data is fairly large,
unconditionally initializing the whole thing is *slow* (and
-fauto-var-init=zero will hurt here).

So at some point I removed that full initialization and made sure we
only unconditionally touched the first few variables, which gave a
measurable speedup.

Then things got messy again and the commit 2565711fb7d7 referenced above
was cleanup, to get back to that initial state.

Now, you're right that __cacheline_aligned on on-stack (and this is
indeed mostly on-stack) is fairly tedious (there were a few patches
recently to reduce the amount of on-stack instances).

I'll put it on the todo list, along with that hotplug stuff (which I
tried to fix but ended up with an even bigger mess). I suppose we can
try and not have the alignment for the on-stack instances while
preserving it for the few off-stack ones.

Also; we're running on the NMI stack, and that's not typically hot.

2021-03-05 15:01:12

by Peter Zijlstra

[permalink] [raw]
Subject: Re: "struct perf_sample_data" alignment

On Fri, Mar 05, 2021 at 09:36:30AM +0100, Peter Zijlstra wrote:
> On Thu, Mar 04, 2021 at 07:45:44PM -0800, Linus Torvalds wrote:
> > That ____cacheline_aligned goes back many years, this is not new, it
> > seems to come from back in 2014: commit 2565711fb7d7 ("perf: Improve
> > the perf_sample_data struct layout").
>
> long time ago...
>
> > But it really seems entirely and utterly bogus. That cacheline
> > alignment makes things *worse*, when the variables are on the local
> > stack. The local stack is already going to be dirty and in the cache,
> > and aligning those things isn't going to - and I quote from the code
> > in that commend in that commit - "minimize the cachelines touched".
> >
> > Quite the reverse. It's just going to make the stack frame use *more*
> > memory, and make any cacheline usage _worse_.
>
> IIRC there is more history here, but I can't seem to find references
> just now.
>
> What I remember is that since perf_sample_data is fairly large,
> unconditionally initializing the whole thing is *slow* (and
> -fauto-var-init=zero will hurt here).
>
> So at some point I removed that full initialization and made sure we
> only unconditionally touched the first few variables, which gave a
> measurable speedup.
>
> Then things got messy again and the commit 2565711fb7d7 referenced above
> was cleanup, to get back to that initial state.
>
> Now, you're right that __cacheline_aligned on on-stack (and this is
> indeed mostly on-stack) is fairly tedious (there were a few patches
> recently to reduce the amount of on-stack instances).
>
> I'll put it on the todo list, along with that hotplug stuff (which I
> tried to fix but ended up with an even bigger mess). I suppose we can
> try and not have the alignment for the on-stack instances while
> preserving it for the few off-stack ones.
>
> Also; we're running on the NMI stack, and that's not typically hot.

This seems to be it... (completely untested)

---
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 3f7f89ea5e51..918a296d2ca2 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1032,7 +1032,9 @@ struct perf_sample_data {
u64 cgroup;
u64 data_page_size;
u64 code_page_size;
-} ____cacheline_aligned;
+};
+
+typedef struct perf_sample_data perf_sample_data_t ____cacheline_aligned;

/* default value for data source */
#define PERF_MEM_NA (PERF_MEM_S(OP, NA) |\
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index b0c45d923f0f..f32c623abef6 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -923,7 +923,7 @@ __bpf_perf_event_output(struct pt_regs *regs, struct bpf_map *map,
* bpf_perf_event_output
*/
struct bpf_trace_sample_data {
- struct perf_sample_data sds[3];
+ perf_sample_data_t sds[3];
};

static DEFINE_PER_CPU(struct bpf_trace_sample_data, bpf_trace_sds);

2021-03-05 15:59:23

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: "struct perf_sample_data" alignment

On Fri, Mar 5, 2021 at 7:01 AM Peter Zijlstra <[email protected]> wrote:
>
> On Fri, Mar 05, 2021 at 09:36:30AM +0100, Peter Zijlstra wrote:
> > On Thu, Mar 04, 2021 at 07:45:44PM -0800, Linus Torvalds wrote:
> > > That ____cacheline_aligned goes back many years, this is not new, it
> > > seems to come from back in 2014: commit 2565711fb7d7 ("perf: Improve
> > > the perf_sample_data struct layout").
> >
> > long time ago...
> >
> > > But it really seems entirely and utterly bogus. That cacheline
> > > alignment makes things *worse*, when the variables are on the local
> > > stack. The local stack is already going to be dirty and in the cache,
> > > and aligning those things isn't going to - and I quote from the code
> > > in that commend in that commit - "minimize the cachelines touched".
> > >
> > > Quite the reverse. It's just going to make the stack frame use *more*
> > > memory, and make any cacheline usage _worse_.
> >
> > IIRC there is more history here, but I can't seem to find references
> > just now.
> >
> > What I remember is that since perf_sample_data is fairly large,
> > unconditionally initializing the whole thing is *slow* (and
> > -fauto-var-init=zero will hurt here).
> >
> > So at some point I removed that full initialization and made sure we
> > only unconditionally touched the first few variables, which gave a
> > measurable speedup.
> >
> > Then things got messy again and the commit 2565711fb7d7 referenced above
> > was cleanup, to get back to that initial state.
> >
> > Now, you're right that __cacheline_aligned on on-stack (and this is
> > indeed mostly on-stack) is fairly tedious (there were a few patches
> > recently to reduce the amount of on-stack instances).
> >
> > I'll put it on the todo list, along with that hotplug stuff (which I
> > tried to fix but ended up with an even bigger mess). I suppose we can
> > try and not have the alignment for the on-stack instances while
> > preserving it for the few off-stack ones.
> >
> > Also; we're running on the NMI stack, and that's not typically hot.
>
> This seems to be it... (completely untested)
>
> ---
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 3f7f89ea5e51..918a296d2ca2 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -1032,7 +1032,9 @@ struct perf_sample_data {
> u64 cgroup;
> u64 data_page_size;
> u64 code_page_size;
> -} ____cacheline_aligned;
> +};
> +
> +typedef struct perf_sample_data perf_sample_data_t ____cacheline_aligned;
>
> /* default value for data source */
> #define PERF_MEM_NA (PERF_MEM_S(OP, NA) |\
> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> index b0c45d923f0f..f32c623abef6 100644
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -923,7 +923,7 @@ __bpf_perf_event_output(struct pt_regs *regs, struct bpf_map *map,
> * bpf_perf_event_output
> */
> struct bpf_trace_sample_data {
> - struct perf_sample_data sds[3];
> + perf_sample_data_t sds[3];

bpf side doesn't care about about cacheline aligned.
No need to add new typedef just for that.

2021-03-06 13:18:41

by Ingo Molnar

[permalink] [raw]
Subject: Re: "struct perf_sample_data" alignment


* Alexei Starovoitov <[email protected]> wrote:

> > This seems to be it... (completely untested)
> >
> > ---
> > diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> > index 3f7f89ea5e51..918a296d2ca2 100644
> > --- a/include/linux/perf_event.h
> > +++ b/include/linux/perf_event.h
> > @@ -1032,7 +1032,9 @@ struct perf_sample_data {
> > u64 cgroup;
> > u64 data_page_size;
> > u64 code_page_size;
> > -} ____cacheline_aligned;
> > +};
> > +
> > +typedef struct perf_sample_data perf_sample_data_t ____cacheline_aligned;
> >
> > /* default value for data source */
> > #define PERF_MEM_NA (PERF_MEM_S(OP, NA) |\
> > diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> > index b0c45d923f0f..f32c623abef6 100644
> > --- a/kernel/trace/bpf_trace.c
> > +++ b/kernel/trace/bpf_trace.c
> > @@ -923,7 +923,7 @@ __bpf_perf_event_output(struct pt_regs *regs, struct bpf_map *map,
> > * bpf_perf_event_output
> > */
> > struct bpf_trace_sample_data {
> > - struct perf_sample_data sds[3];
> > + perf_sample_data_t sds[3];
>
> bpf side doesn't care about about cacheline aligned.
> No need to add new typedef just for that.

So this structure is not supposed to be exposed to any ABI anywhere.

I did a (non-exhaustive) search of tooling, and there doesn't appear
to be any accidental exposure.

The in-kernel ABI interaction appears to be the following:

- In __perf_event_header_size() we only use fields within
perf_sample_data to size the header. Alignment won't change any of
the output.

- Ditto in perf_event__id_header_size().

I.e. I think we should just zap it per the patch below (untested).

Thanks,

Ingo

============>

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 3f7f89ea5e51..d75e03ff31ea 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1032,7 +1032,7 @@ struct perf_sample_data {
u64 cgroup;
u64 data_page_size;
u64 code_page_size;
-} ____cacheline_aligned;
+};

/* default value for data source */
#define PERF_MEM_NA (PERF_MEM_S(OP, NA) |\

2021-03-08 10:23:19

by David Laight

[permalink] [raw]
Subject: RE: "struct perf_sample_data" alignment

...
> What I remember is that since perf_sample_data is fairly large,
> unconditionally initializing the whole thing is *slow* (and
> -fauto-var-init=zero will hurt here).

That will hurt everywhere.
I can also imagine it hiding bugs and making people
shrink on-stack arrays to the point where they either overrun
or cause an unexpected error or string truncation.

Initialising to zero is a bad choice if the aim is to
avoid leaking stack.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)