2022-02-16 08:33:31

by Feng Tang

[permalink] [raw]
Subject: [PATCH] x86, vmlinux.lds: Add debug option to force all data sections aligned

0day has reported many strange performance changes (regression or
improvement), in which there was no obvious relation between the culprit
commit and the benchmark at the first look, and it causes people to doubt
the test itself is wrong.

Upon further check, many of these cases are caused by the change to the
alignment of kernel text or data, as whole text/data of kernel are linked
together, change in one domain can affect alignments of other domains.

To help quickly identifying if the strange performance change is caused
by _data_ alignment, add a debug option to force the data sections from
all .o files aligned on THREAD_SIZE, so that change in one domain won't
affect other modules' data alignment.

We have used this option to check some strange kernel changes [1][2][3],
and those performance changes were gone after enabling it, which proved
they are data alignment related. Besides these publicly reported cases,
recently there are other similar cases found by 0day, and this option
has been actively used by 0Day for analyzing strange performance changes.

Similarly, there is another kernel debug option to check text alignment
related performance changes: CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B,
which forces all function's start address to be 64 bytes alinged.

This option depends on CONFIG_DYNAMIC_DEBUG==n, as '__dyndbg' subsection
of .data has a hard requirement of ALIGN(8), shown in the 'vmlinux.lds':

"
. = ALIGN(8); __start___dyndbg = .; KEEP(*(__dyndbg)) __stop___dyndbg = .;
"

It contains all pointers to 'struct _ddebug', and dynamic_debug_init()
will "pointer++" to loop accessing these pointers, which will be broken
with this option enabled.

[1]. https://lore.kernel.org/lkml/20200205123216.GO12867@shao2-debian/
[2]. https://lore.kernel.org/lkml/20200305062138.GI5972@shao2-debian/
[3]. https://lore.kernel.org/lkml/20201112140625.GA21612@xsang-OptiPlex-9020/

Signed-off-by: Feng Tang <[email protected]>
---
Changelog:

since RFC (https://lore.kernel.org/lkml/[email protected]/)
* rebase against 5.17-rc1
* modify the changelog adding more recent info

arch/x86/Kconfig.debug | 13 +++++++++++++
arch/x86/kernel/vmlinux.lds.S | 7 ++++++-
2 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug
index d3a6f74a94bd..7b8cfe9717f2 100644
--- a/arch/x86/Kconfig.debug
+++ b/arch/x86/Kconfig.debug
@@ -225,6 +225,19 @@ config PUNIT_ATOM_DEBUG
The current power state can be read from
/sys/kernel/debug/punit_atom/dev_power_state

+config DEBUG_FORCE_DATA_SECTION_ALIGNED
+ bool "Force all data sections to be THREAD_SIZE aligned"
+ depends on EXPERT && !DYNAMIC_DEBUG
+ help
+ There are cases that a commit from one kernel domain changes
+ data sections' alignment of other domains, as they are all
+ linked together compactly, and cause magic performance bump
+ (regression or improvement), which is hard to debug. Enable
+ this option will help to verify if the bump is caused by
+ data alignment changes.
+
+ It is mainly for debug and performance tuning use.
+
choice
prompt "Choose kernel unwinder"
default UNWINDER_ORC if X86_64
diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index 27f830345b6f..79a631476513 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -155,7 +155,12 @@ SECTIONS
X86_ALIGN_RODATA_END

/* Data */
- .data : AT(ADDR(.data) - LOAD_OFFSET) {
+ .data : AT(ADDR(.data) - LOAD_OFFSET)
+#ifdef CONFIG_DEBUG_FORCE_DATA_SECTION_ALIGNED
+ /* Use the biggest alignment of below sections */
+ SUBALIGN(THREAD_SIZE)
+#endif
+ {
/* Start of data section */
_sdata = .;

--
2.27.0


2022-02-16 14:30:09

by Denys Vlasenko

[permalink] [raw]
Subject: Re: [PATCH] x86, vmlinux.lds: Add debug option to force all data sections aligned

On 2/16/22 9:28 AM, Feng Tang wrote:
> 0day has reported many strange performance changes (regression or
> improvement), in which there was no obvious relation between the culprit
> commit and the benchmark at the first look, and it causes people to doubt
> the test itself is wrong.
>
> Upon further check, many of these cases are caused by the change to the
> alignment of kernel text or data, as whole text/data of kernel are linked
> together, change in one domain can affect alignments of other domains.
>
> To help quickly identifying if the strange performance change is caused
> by _data_ alignment, add a debug option to force the data sections from
> all .o files aligned on THREAD_SIZE, so that change in one domain won't
> affect other modules' data alignment.
>
> We have used this option to check some strange kernel changes [1][2][3],
> and those performance changes were gone after enabling it, which proved
> they are data alignment related. Besides these publicly reported cases,
> recently there are other similar cases found by 0day, and this option
> has been actively used by 0Day for analyzing strange performance changes.
...
> + .data : AT(ADDR(.data) - LOAD_OFFSET)
> +#ifdef CONFIG_DEBUG_FORCE_DATA_SECTION_ALIGNED
> + /* Use the biggest alignment of below sections */
> + SUBALIGN(THREAD_SIZE)
> +#endif

"Align every input section to 4096 bytes" ?

This is way, way, WAY too much. The added padding will be very wasteful.

Performance differences are likely to be caused by cacheline alignment.
Factoring in an odd hardware prefetcher grabbing an additional
cacheline after every accessed one, I'd say alignment to 128 bytes
(on x86) should suffice for almost any scenario. Even 64 bytes
would almost always work fine.

The hardware prefetcher grabbing an additional cacheline was seen
adversely affecting locking performance in a structure - developers
thought two locks are not in the same cacheline, but because of
this "optimization" they effectively are, and thus they bounce
between CPUs. (1) Linker script can't help with this, since it was
struct layout issue, not section alignment issue.
(2) This "optimization" (unconditional fetch of next cacheline)
might be bad enough to warrant detecting and disabling on boot.

2022-02-17 11:37:43

by Feng Tang

[permalink] [raw]
Subject: Re: [PATCH] x86, vmlinux.lds: Add debug option to force all data sections aligned

On Thu, Feb 17, 2022 at 02:22:32PM +0800, Feng Tang wrote:
> Hi Densy,
>
> Thanks for the review!
>
> On Wed, Feb 16, 2022 at 02:35:10PM +0100, Denys Vlasenko wrote:
> > On 2/16/22 9:28 AM, Feng Tang wrote:
> > > 0day has reported many strange performance changes (regression or
> > > improvement), in which there was no obvious relation between the culprit
> > > commit and the benchmark at the first look, and it causes people to doubt
> > > the test itself is wrong.
> > >
> > > Upon further check, many of these cases are caused by the change to the
> > > alignment of kernel text or data, as whole text/data of kernel are linked
> > > together, change in one domain can affect alignments of other domains.
> > >
> > > To help quickly identifying if the strange performance change is caused
> > > by _data_ alignment, add a debug option to force the data sections from
> > > all .o files aligned on THREAD_SIZE, so that change in one domain won't
> > > affect other modules' data alignment.
> > >
> > > We have used this option to check some strange kernel changes [1][2][3],
> > > and those performance changes were gone after enabling it, which proved
> > > they are data alignment related. Besides these publicly reported cases,
> > > recently there are other similar cases found by 0day, and this option
> > > has been actively used by 0Day for analyzing strange performance changes.
> > ...
> > > + .data : AT(ADDR(.data) - LOAD_OFFSET)
> > > +#ifdef CONFIG_DEBUG_FORCE_DATA_SECTION_ALIGNED
> > > + /* Use the biggest alignment of below sections */
> > > + SUBALIGN(THREAD_SIZE)
> > > +#endif
> >
> > "Align every input section to 4096 bytes" ?
>
> The THREAD_SIZE depends on ARCH, and could be 16KB or more (KASAN related).
>
> > This is way, way, WAY too much. The added padding will be very wasteful.
>
> Yes, its too much. One general kernel's data section could be bumped
> from 18MB to 50+ MB.
>
> And I should make it more clear, that this is for debugging only.
>
> > Performance differences are likely to be caused by cacheline alignment.
> > Factoring in an odd hardware prefetcher grabbing an additional
> > cacheline after every accessed one, I'd say alignment to 128 bytes
> > (on x86) should suffice for almost any scenario. Even 64 bytes
> > would almost always work fine.
>
> Yep, when I started it, I tried 128 bytes for HW adjacent cacheline
> prefetch consideration. But the built kernel won't boot, then I
> tried 512/1024/2048/4096 but there were still boot issue.

I just did these tests again, and confirmed that alignment <= 2048
will not boot on x86 HW, while 4096 and bigger will work.

The reason is this SUBALIGN() has privilege over the alignment
inside the data sections. like in vmlinux.lds:

.data : AT(ADDR(.data) - 0xffffffff80000000)
SUBALIGN(2048)
{
_sdata = .;
. = ALIGN(((1 << 12) << (2 + 0))); __start_init_task = .; init_thread_union = .; init_stack = .; KEEP(*(.data..init_task)) KEEP(*(.data..init_thread_info)) . = __start_init_task + ((1 << 12) << (2 + 0)); __end_init_task = .;
. = ALIGN((1 << 12)); *(.data..page_aligned) . = ALIGN((1 << 12));
...

And the section of *(.data..page_aligned) asks for 4KB align but
our 2048 setting will break it, as confirmed by the System.map:

ffffffff82800000 D _sdata
ffffffff82800000 D init_stack
ffffffff82800000 D init_thread_union
ffffffff82804000 D __end_init_task
ffffffff82804000 d bringup_idt_table
ffffffff82804800 D __vsyscall_page <--------- broken
ffffffff82806000 d hpet

But this also means if necessary, we can change the alignment
from THREAD_SIZE to PAGE_SIZE, which will make data sections
much smaller.

Thanks,
Feng

> So I
> chose the biggest alignment within data sections, considering
> this is purely for analyzing/debugging purpose.

2022-02-17 14:00:21

by Feng Tang

[permalink] [raw]
Subject: Re: [PATCH] x86, vmlinux.lds: Add debug option to force all data sections aligned

Hi Densy,

Thanks for the review!

On Wed, Feb 16, 2022 at 02:35:10PM +0100, Denys Vlasenko wrote:
> On 2/16/22 9:28 AM, Feng Tang wrote:
> > 0day has reported many strange performance changes (regression or
> > improvement), in which there was no obvious relation between the culprit
> > commit and the benchmark at the first look, and it causes people to doubt
> > the test itself is wrong.
> >
> > Upon further check, many of these cases are caused by the change to the
> > alignment of kernel text or data, as whole text/data of kernel are linked
> > together, change in one domain can affect alignments of other domains.
> >
> > To help quickly identifying if the strange performance change is caused
> > by _data_ alignment, add a debug option to force the data sections from
> > all .o files aligned on THREAD_SIZE, so that change in one domain won't
> > affect other modules' data alignment.
> >
> > We have used this option to check some strange kernel changes [1][2][3],
> > and those performance changes were gone after enabling it, which proved
> > they are data alignment related. Besides these publicly reported cases,
> > recently there are other similar cases found by 0day, and this option
> > has been actively used by 0Day for analyzing strange performance changes.
> ...
> > + .data : AT(ADDR(.data) - LOAD_OFFSET)
> > +#ifdef CONFIG_DEBUG_FORCE_DATA_SECTION_ALIGNED
> > + /* Use the biggest alignment of below sections */
> > + SUBALIGN(THREAD_SIZE)
> > +#endif
>
> "Align every input section to 4096 bytes" ?

The THREAD_SIZE depends on ARCH, and could be 16KB or more (KASAN related).

> This is way, way, WAY too much. The added padding will be very wasteful.

Yes, its too much. One general kernel's data section could be bumped
from 18MB to 50+ MB.

And I should make it more clear, that this is for debugging only.

> Performance differences are likely to be caused by cacheline alignment.
> Factoring in an odd hardware prefetcher grabbing an additional
> cacheline after every accessed one, I'd say alignment to 128 bytes
> (on x86) should suffice for almost any scenario. Even 64 bytes
> would almost always work fine.

Yep, when I started it, I tried 128 bytes for HW adjacent cacheline
prefetch consideration. But the built kernel won't boot, then I
tried 512/1024/2048/4096 but there were still boot issue. So I
chose the biggest alignment within data sections, considering
this is purely for analyzing/debugging purpose.

Thanks,
Feng


> The hardware prefetcher grabbing an additional cacheline was seen
> adversely affecting locking performance in a structure - developers
> thought two locks are not in the same cacheline, but because of
> this "optimization" they effectively are, and thus they bounce
> between CPUs. (1) Linker script can't help with this, since it was
> struct layout issue, not section alignment issue.
> (2) This "optimization" (unconditional fetch of next cacheline)
> might be bad enough to warrant detecting and disabling on boot.