2021-06-29 00:02:41

by Kees Cook

[permalink] [raw]
Subject: [GIT PULL] Clang feature updates for v5.14-rc1

Hi Linus,

Please pull these Clang feature updates for v5.14-rc1.

Thanks!

-Kees

The following changes since commit d07f6ca923ea0927a1024dfccafc5b53b61cfecc:

Linux 5.13-rc2 (2021-05-16 15:27:44 -0700)

are available in the Git repository at:

https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git tags/clang-features-v5.14-rc1

for you to fetch changes up to 6a0544606ec7f03e4a2534c87ea989de4bac41ae:

pgo: rectify comment to proper kernel-doc syntax (2021-06-28 12:10:31 -0700)

----------------------------------------------------------------
Clang feature updates for v5.14-rc1

The big addition for this merge window is the core support for Clang's
Profile Guided Optimization, which lets Clang build the kernel for
improved performance when running specific kernel workloads. This
currently covers only vmlinux, but module support is under active
development. (Sami Tolvanen, Bill Wendling, Kees Cook, Jarmo Tiitto,
Lukas Bulwahn)

Added CC_HAS_NO_PROFILE_FN_ATTR in preparation for PGO support in
the face of the noinstr attribute, paving the way for PGO and fixing
GCOV. (Nick Desaulniers)

x86_64 LTO coverage is expaned to 32-bit x86. (Nathan Chancellor)

Small fixes to CFI. (Mark Rutland, Nathan Chancellor)

----------------------------------------------------------------
Bill Wendling (1):
pgo: rename the raw profile file to vmlinux.profraw

Jarmo Tiitto (2):
pgo: Limit allocate_node() to vmlinux sections
pgo: Fix sleep in atomic section in prf_open()

Kees Cook (2):
MAINTAINERS: Expand and relocate PGO entry
pgo: Clean up prf_open() error paths

Lukas Bulwahn (1):
pgo: rectify comment to proper kernel-doc syntax

Mark Rutland (1):
CFI: Move function_nocfi() into compiler.h

Nathan Chancellor (2):
MAINTAINERS: Add Clang CFI section
x86, lto: Enable Clang LTO for 32-bit as well

Nick Desaulniers (3):
compiler_attributes.h: define __no_profile, add to noinstr
compiler_attributes.h: cleanups for GCC 4.9+
Kconfig: Introduce ARCH_WANTS_NO_INSTR and CC_HAS_NO_PROFILE_FN_ATTR

Sami Tolvanen (1):
pgo: Add Clang's Profile Guided Optimization infrastructure

Documentation/dev-tools/index.rst | 1 +
Documentation/dev-tools/pgo.rst | 127 +++++++++++
MAINTAINERS | 25 ++
Makefile | 3 +
arch/Kconfig | 8 +
arch/arm64/Kconfig | 1 +
arch/arm64/include/asm/compiler.h | 16 ++
arch/arm64/include/asm/memory.h | 16 --
arch/s390/Kconfig | 1 +
arch/x86/Kconfig | 6 +-
arch/x86/boot/Makefile | 1 +
arch/x86/boot/compressed/Makefile | 1 +
arch/x86/crypto/Makefile | 3 +
arch/x86/entry/vdso/Makefile | 1 +
arch/x86/kernel/Makefile | 3 +
arch/x86/kernel/vmlinux.lds.S | 2 +
arch/x86/platform/efi/Makefile | 1 +
arch/x86/purgatory/Makefile | 1 +
arch/x86/realmode/rm/Makefile | 1 +
arch/x86/um/vdso/Makefile | 1 +
drivers/firmware/efi/libstub/Makefile | 1 +
include/asm-generic/vmlinux.lds.h | 32 +++
include/linux/compiler.h | 10 +
include/linux/compiler_attributes.h | 19 +-
include/linux/compiler_types.h | 2 +-
include/linux/mm.h | 10 -
init/Kconfig | 3 +
kernel/Makefile | 1 +
kernel/gcov/Kconfig | 1 +
kernel/pgo/Kconfig | 37 +++
kernel/pgo/Makefile | 5 +
kernel/pgo/fs.c | 413 ++++++++++++++++++++++++++++++++++
kernel/pgo/instrument.c | 188 ++++++++++++++++
kernel/pgo/pgo.h | 211 +++++++++++++++++
scripts/Makefile.lib | 10 +
35 files changed, 1130 insertions(+), 32 deletions(-)
create mode 100644 Documentation/dev-tools/pgo.rst
create mode 100644 kernel/pgo/Kconfig
create mode 100644 kernel/pgo/Makefile
create mode 100644 kernel/pgo/fs.c
create mode 100644 kernel/pgo/instrument.c
create mode 100644 kernel/pgo/pgo.h

--
Kees Cook


2021-06-29 02:50:47

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] Clang feature updates for v5.14-rc1

On Mon, Jun 28, 2021 at 12:32 PM Kees Cook <[email protected]> wrote:
>
> The big addition for this merge window is the core support for Clang's
> Profile Guided Optimization, which lets Clang build the kernel for
> improved performance when running specific kernel workloads. This
> currently covers only vmlinux, but module support is under active
> development. (Sami Tolvanen, Bill Wendling, Kees Cook, Jarmo Tiitto,
> Lukas Bulwahn)

Am I misreading this?

The PGO data seems to be done by using clang instrumentation, instead
of done sanely using sample data from a regular "perf" run?

That odd decision seems to not be documented anywhere, and it seems
odd and counter-productive, and causes all that odd special buffer
handling and that vmlinux.profraw file etc.

And it causes the kernel to be bigger and run slower.

The actual link to the clang pgo documentation even says that there is
already support for converting regular "perf" profile output to pgo
data, yet that model isn't actually used even though it appears vastly
superior.

Why use an inferior compiler instructmentation profile when we have
the much better tools?

Linus

2021-06-29 13:16:07

by Mark Rutland

[permalink] [raw]
Subject: Re: [GIT PULL] Clang feature updates for v5.14-rc1

Hi Kees,

On Mon, Jun 28, 2021 at 12:32:24PM -0700, Kees Cook wrote:
> Hi Linus,
>
> Please pull these Clang feature updates for v5.14-rc1.
>
> Thanks!
>
> -Kees
>
> The following changes since commit d07f6ca923ea0927a1024dfccafc5b53b61cfecc:
>
> Linux 5.13-rc2 (2021-05-16 15:27:44 -0700)
>
> are available in the Git repository at:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git tags/clang-features-v5.14-rc1
>
> for you to fetch changes up to 6a0544606ec7f03e4a2534c87ea989de4bac41ae:
>
> pgo: rectify comment to proper kernel-doc syntax (2021-06-28 12:10:31 -0700)
>
> ----------------------------------------------------------------
> Clang feature updates for v5.14-rc1
>
> The big addition for this merge window is the core support for Clang's
> Profile Guided Optimization, which lets Clang build the kernel for
> improved performance when running specific kernel workloads. This
> currently covers only vmlinux, but module support is under active
> development. (Sami Tolvanen, Bill Wendling, Kees Cook, Jarmo Tiitto,
> Lukas Bulwahn)

I thought the PGO stuff was on hold given Peter had open concerns, e.g.

https://lore.kernel.org/r/[email protected]

... and there didn't seem to be a strong conclusion to the contrary.

> Added CC_HAS_NO_PROFILE_FN_ATTR in preparation for PGO support in
> the face of the noinstr attribute, paving the way for PGO and fixing
> GCOV. (Nick Desaulniers)
>
> x86_64 LTO coverage is expaned to 32-bit x86. (Nathan Chancellor)
>
> Small fixes to CFI. (Mark Rutland, Nathan Chancellor)

FWIW, all the rest of this looks good to me.

Thanks,
Mark.

>
> ----------------------------------------------------------------
> Bill Wendling (1):
> pgo: rename the raw profile file to vmlinux.profraw
>
> Jarmo Tiitto (2):
> pgo: Limit allocate_node() to vmlinux sections
> pgo: Fix sleep in atomic section in prf_open()
>
> Kees Cook (2):
> MAINTAINERS: Expand and relocate PGO entry
> pgo: Clean up prf_open() error paths
>
> Lukas Bulwahn (1):
> pgo: rectify comment to proper kernel-doc syntax
>
> Mark Rutland (1):
> CFI: Move function_nocfi() into compiler.h
>
> Nathan Chancellor (2):
> MAINTAINERS: Add Clang CFI section
> x86, lto: Enable Clang LTO for 32-bit as well
>
> Nick Desaulniers (3):
> compiler_attributes.h: define __no_profile, add to noinstr
> compiler_attributes.h: cleanups for GCC 4.9+
> Kconfig: Introduce ARCH_WANTS_NO_INSTR and CC_HAS_NO_PROFILE_FN_ATTR
>
> Sami Tolvanen (1):
> pgo: Add Clang's Profile Guided Optimization infrastructure
>
> Documentation/dev-tools/index.rst | 1 +
> Documentation/dev-tools/pgo.rst | 127 +++++++++++
> MAINTAINERS | 25 ++
> Makefile | 3 +
> arch/Kconfig | 8 +
> arch/arm64/Kconfig | 1 +
> arch/arm64/include/asm/compiler.h | 16 ++
> arch/arm64/include/asm/memory.h | 16 --
> arch/s390/Kconfig | 1 +
> arch/x86/Kconfig | 6 +-
> arch/x86/boot/Makefile | 1 +
> arch/x86/boot/compressed/Makefile | 1 +
> arch/x86/crypto/Makefile | 3 +
> arch/x86/entry/vdso/Makefile | 1 +
> arch/x86/kernel/Makefile | 3 +
> arch/x86/kernel/vmlinux.lds.S | 2 +
> arch/x86/platform/efi/Makefile | 1 +
> arch/x86/purgatory/Makefile | 1 +
> arch/x86/realmode/rm/Makefile | 1 +
> arch/x86/um/vdso/Makefile | 1 +
> drivers/firmware/efi/libstub/Makefile | 1 +
> include/asm-generic/vmlinux.lds.h | 32 +++
> include/linux/compiler.h | 10 +
> include/linux/compiler_attributes.h | 19 +-
> include/linux/compiler_types.h | 2 +-
> include/linux/mm.h | 10 -
> init/Kconfig | 3 +
> kernel/Makefile | 1 +
> kernel/gcov/Kconfig | 1 +
> kernel/pgo/Kconfig | 37 +++
> kernel/pgo/Makefile | 5 +
> kernel/pgo/fs.c | 413 ++++++++++++++++++++++++++++++++++
> kernel/pgo/instrument.c | 188 ++++++++++++++++
> kernel/pgo/pgo.h | 211 +++++++++++++++++
> scripts/Makefile.lib | 10 +
> 35 files changed, 1130 insertions(+), 32 deletions(-)
> create mode 100644 Documentation/dev-tools/pgo.rst
> create mode 100644 kernel/pgo/Kconfig
> create mode 100644 kernel/pgo/Makefile
> create mode 100644 kernel/pgo/fs.c
> create mode 100644 kernel/pgo/instrument.c
> create mode 100644 kernel/pgo/pgo.h
>
> --
> Kees Cook

2021-06-29 20:45:37

by Kees Cook

[permalink] [raw]
Subject: Re: [GIT PULL] Clang feature updates for v5.14-rc1

On Mon, Jun 28, 2021 at 07:49:04PM -0700, Linus Torvalds wrote:
> On Mon, Jun 28, 2021 at 12:32 PM Kees Cook <[email protected]> wrote:
> >
> > The big addition for this merge window is the core support for Clang's
> > Profile Guided Optimization, which lets Clang build the kernel for
> > improved performance when running specific kernel workloads. This
> > currently covers only vmlinux, but module support is under active
> > development. (Sami Tolvanen, Bill Wendling, Kees Cook, Jarmo Tiitto,
> > Lukas Bulwahn)
>
> Am I misreading this?
>
> The PGO data seems to be done by using clang instrumentation, instead
> of done sanely using sample data from a regular "perf" run?

Right, yes. My understanding is that PGO is measurably better than
sample-based profiling. Additionally, it's arch-agnostic (not that that's
meaningful here with only x86 finished), and can gain other analysis
features that aren't possible with perf. I'll let Nick, Fangrui, Bill,
or Sami answer this more directly.

In the meantime I will split the pull request into "PGO" and "everything
else".

> That odd decision seems to not be documented anywhere, and it seems
> odd and counter-productive, and causes all that odd special buffer
> handling and that vmlinux.profraw file etc.
>
> And it causes the kernel to be bigger and run slower.

Right -- that's expected. It's not designed to be the final kernel
someone uses. :)

-Kees

--
Kees Cook

2021-06-29 21:23:32

by Kees Cook

[permalink] [raw]
Subject: Re: [GIT PULL] Clang feature updates for v5.14-rc1

On Tue, Jun 29, 2021 at 02:14:00PM +0100, Mark Rutland wrote:
> Hi Kees,
>
> On Mon, Jun 28, 2021 at 12:32:24PM -0700, Kees Cook wrote:
> > Hi Linus,
> >
> > Please pull these Clang feature updates for v5.14-rc1.
> >
> > Thanks!
> >
> > -Kees
> >
> > The following changes since commit d07f6ca923ea0927a1024dfccafc5b53b61cfecc:
> >
> > Linux 5.13-rc2 (2021-05-16 15:27:44 -0700)
> >
> > are available in the Git repository at:
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git tags/clang-features-v5.14-rc1
> >
> > for you to fetch changes up to 6a0544606ec7f03e4a2534c87ea989de4bac41ae:
> >
> > pgo: rectify comment to proper kernel-doc syntax (2021-06-28 12:10:31 -0700)
> >
> > ----------------------------------------------------------------
> > Clang feature updates for v5.14-rc1
> >
> > The big addition for this merge window is the core support for Clang's
> > Profile Guided Optimization, which lets Clang build the kernel for
> > improved performance when running specific kernel workloads. This
> > currently covers only vmlinux, but module support is under active
> > development. (Sami Tolvanen, Bill Wendling, Kees Cook, Jarmo Tiitto,
> > Lukas Bulwahn)
>
> I thought the PGO stuff was on hold given Peter had open concerns, e.g.
>
> https://lore.kernel.org/r/[email protected]
>
> ... and there didn't seem to be a strong conclusion to the contrary.

Hi! Whoops, I think you weren't CCed on the later threads over noinstr:
https://lore.kernel.org/lkml/[email protected]/

I understood that as the blocker for Peter from the earlier thread.

>
> > Added CC_HAS_NO_PROFILE_FN_ATTR in preparation for PGO support in
> > the face of the noinstr attribute, paving the way for PGO and fixing
> > GCOV. (Nick Desaulniers)
> >
> > x86_64 LTO coverage is expaned to 32-bit x86. (Nathan Chancellor)
> >
> > Small fixes to CFI. (Mark Rutland, Nathan Chancellor)
>
> FWIW, all the rest of this looks good to me.

Thanks!

-Kees

>
> Thanks,
> Mark.
>
> >
> > ----------------------------------------------------------------
> > Bill Wendling (1):
> > pgo: rename the raw profile file to vmlinux.profraw
> >
> > Jarmo Tiitto (2):
> > pgo: Limit allocate_node() to vmlinux sections
> > pgo: Fix sleep in atomic section in prf_open()
> >
> > Kees Cook (2):
> > MAINTAINERS: Expand and relocate PGO entry
> > pgo: Clean up prf_open() error paths
> >
> > Lukas Bulwahn (1):
> > pgo: rectify comment to proper kernel-doc syntax
> >
> > Mark Rutland (1):
> > CFI: Move function_nocfi() into compiler.h
> >
> > Nathan Chancellor (2):
> > MAINTAINERS: Add Clang CFI section
> > x86, lto: Enable Clang LTO for 32-bit as well
> >
> > Nick Desaulniers (3):
> > compiler_attributes.h: define __no_profile, add to noinstr
> > compiler_attributes.h: cleanups for GCC 4.9+
> > Kconfig: Introduce ARCH_WANTS_NO_INSTR and CC_HAS_NO_PROFILE_FN_ATTR
> >
> > Sami Tolvanen (1):
> > pgo: Add Clang's Profile Guided Optimization infrastructure
> >
> > Documentation/dev-tools/index.rst | 1 +
> > Documentation/dev-tools/pgo.rst | 127 +++++++++++
> > MAINTAINERS | 25 ++
> > Makefile | 3 +
> > arch/Kconfig | 8 +
> > arch/arm64/Kconfig | 1 +
> > arch/arm64/include/asm/compiler.h | 16 ++
> > arch/arm64/include/asm/memory.h | 16 --
> > arch/s390/Kconfig | 1 +
> > arch/x86/Kconfig | 6 +-
> > arch/x86/boot/Makefile | 1 +
> > arch/x86/boot/compressed/Makefile | 1 +
> > arch/x86/crypto/Makefile | 3 +
> > arch/x86/entry/vdso/Makefile | 1 +
> > arch/x86/kernel/Makefile | 3 +
> > arch/x86/kernel/vmlinux.lds.S | 2 +
> > arch/x86/platform/efi/Makefile | 1 +
> > arch/x86/purgatory/Makefile | 1 +
> > arch/x86/realmode/rm/Makefile | 1 +
> > arch/x86/um/vdso/Makefile | 1 +
> > drivers/firmware/efi/libstub/Makefile | 1 +
> > include/asm-generic/vmlinux.lds.h | 32 +++
> > include/linux/compiler.h | 10 +
> > include/linux/compiler_attributes.h | 19 +-
> > include/linux/compiler_types.h | 2 +-
> > include/linux/mm.h | 10 -
> > init/Kconfig | 3 +
> > kernel/Makefile | 1 +
> > kernel/gcov/Kconfig | 1 +
> > kernel/pgo/Kconfig | 37 +++
> > kernel/pgo/Makefile | 5 +
> > kernel/pgo/fs.c | 413 ++++++++++++++++++++++++++++++++++
> > kernel/pgo/instrument.c | 188 ++++++++++++++++
> > kernel/pgo/pgo.h | 211 +++++++++++++++++
> > scripts/Makefile.lib | 10 +
> > 35 files changed, 1130 insertions(+), 32 deletions(-)
> > create mode 100644 Documentation/dev-tools/pgo.rst
> > create mode 100644 kernel/pgo/Kconfig
> > create mode 100644 kernel/pgo/Makefile
> > create mode 100644 kernel/pgo/fs.c
> > create mode 100644 kernel/pgo/instrument.c
> > create mode 100644 kernel/pgo/pgo.h
> >
> > --
> > Kees Cook
>
> --
> You received this message because you are subscribed to the Google Groups "Clang Built Linux" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
> To view this discussion on the web visit https://groups.google.com/d/msgid/clang-built-linux/20210629131400.GA24514%40C02TD0UTHF1T.local.

--
Kees Cook

2021-06-29 21:25:42

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] Clang feature updates for v5.14-rc1

On Tue, Jun 29, 2021 at 1:44 PM Kees Cook <[email protected]> wrote:
> >
> > And it causes the kernel to be bigger and run slower.
>
> Right -- that's expected. It's not designed to be the final kernel
> someone uses. :)

Well, from what I've seen, you actually want to run real loads in
production environments for PGO to actually be anything but a bogus
"performance benchmarks only" kind of thing.

Of course, "performance benchmarks only" is very traditional, and
we've seen that used over and over in the past in this industry. That
doesn't make it _right_, though.

And if you actually want to have it usable in production environments,
you really should strive to run code as closely as possible to a
production kernel too.

You'd want to run something that you can sample over time, and in
production, not something that you have to build a special kernels for
that then gets used for a benchmark run, but can't be kept in
production because it performs so much worse.

Real proper profiles will tell you what *really* matters - and if you
don't have enough samples to give you good information, then that
particular code clearly is not important enough to waste PGO on.

This is not all that dissimilar to using gprof information for
traditional - manual - optimizations.

Sure, instrumented gprof output is better than nothing, but it is
*hugely* worse than actual proper sampled profiles that actually show
what matters for performance (as opposed to what runs a lot - the two
are not necessarily all that closely correlated, with cache misses
being a thing).

And I really hate how pretty much all of the PGO support seems to be
just about this inferior method of getting the data.

Linus

2021-06-29 21:27:46

by Nick Desaulniers

[permalink] [raw]
Subject: Re: [GIT PULL] Clang feature updates for v5.14-rc1

On Tue, Jun 29, 2021 at 6:14 AM Mark Rutland <[email protected]> wrote:
>
> Hi Kees,
>
> I thought the PGO stuff was on hold given Peter had open concerns, e.g.
>
> https://lore.kernel.org/r/[email protected]
>
> ... and there didn't seem to be a strong conclusion to the contrary.

Hi Mark,
If I could rephrase Peter's concerns in my own words to see if I
understood the intent correctly, I'd summarize the concerns as:
1. How does instrumentation act in regards to noinstr?

https://lore.kernel.org/linux-doc/[email protected]/
https://lore.kernel.org/lkml/YMcssV%[email protected]/

2. How much of this code can be reused with GCC?

https://lore.kernel.org/linux-doc/[email protected]/

3. Can we avoid proliferation of compiler specific code in the kernel?

https://lore.kernel.org/linux-doc/[email protected]/

---

Regarding point 1, I believe that was addressed by this series, which
Peter Ack'ed, and is based on work I did in LLVM based on Peter's
feedback, while collaborating with GCC developers on the semantics in
regards to inlining. I notice you weren't explicitly cc'ed on that
thread, that's my fault and I apologize. It wasn't intentional; once
a cc list as recommended by get_maintainer.pl gets too long, I start
to forget who was on previous threads and might be interested in
updates.

https://lore.kernel.org/lkml/[email protected]/
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80223

---

Regarding point 2, I believe I addressed that in my response. Similar
to GCOV, we need the runtime hooks which are compiler specific in
order to capture the profiling data. Exporting such data to userspace
via sysfs can be easily shared though, as is done currently for GCOV.

https://lore.kernel.org/linux-doc/CAKwvOd=aAo72j-iE2PNE5Os8BPc0y-Zs7ZoMzd21ck+QNeboBA@mail.gmail.com/

---

Regarding point 3, I agree. There's currently 2 big places in the
kernel where we have very compiler specific code, IMO:
1. frame pointer based unwinding on 32b ARM (especially but not
limited to THUMB).
2. GCOV
This series does ask to add a third.

At the same time, there are differences between compilers that are
unlikely to converge without great need. Compiler IR is generally not
interchangeable between compilers; the compiler runtimes (ie. symbols
typically provided by libgcc_s or compiler-rt) are (generally) tightly
coupled to their respective compilers. Since PGO relies on the
respective compiler runtimes, we wind up with compiler specific
runtime support for this feature. For a semi-freestanding environment
like the Linux kernel, that means duplicating the ABI for these
compiler runtime libraries, with additional code for kernel specific
synchronization, memory management, and data retrieval (sysfs).

Further, asking compiler vendors to break their existing ABIs with
their compiler runtimes to support a shared interface for profiling
data is also a hard sell. That's a major issue regarding frame pointer
based unwinding on 32b ARM as well; existing unwinders must change to
support the latest spec, yet not all code will be recompiled to match
it as the same time the unwinder support is added or updated. Unless
the compiler runtime was statically linked, then upgrading that shared
object might break binaries when they are run next. I'm not saying
it's impossible, but is it worth it? Do the compiler vendors agree?
--
Thanks,
~Nick Desaulniers

2021-06-29 22:29:49

by Nick Desaulniers

[permalink] [raw]
Subject: Re: [GIT PULL] Clang feature updates for v5.14-rc1

On Tue, Jun 29, 2021 at 2:04 PM Linus Torvalds
<[email protected]> wrote:
>
> On Tue, Jun 29, 2021 at 1:44 PM Kees Cook <[email protected]> wrote:
> > >
> > > And it causes the kernel to be bigger and run slower.
> >
> > Right -- that's expected. It's not designed to be the final kernel
> > someone uses. :)
>
> Well, from what I've seen, you actually want to run real loads in
> production environments for PGO to actually be anything but a bogus
> "performance benchmarks only" kind of thing.
>
> Of course, "performance benchmarks only" is very traditional, and
> we've seen that used over and over in the past in this industry. That
> doesn't make it _right_, though.

The current major use case is ensuring that production kernels have
been "trained" with specific workloads in mind.

> And if you actually want to have it usable in production environments,
> you really should strive to run code as closely as possible to a
> production kernel too.

You could do both. There is a line of research internally using
multiple training rounds ("CSPGO").

> You'd want to run something that you can sample over time, and in
> production, not something that you have to build a special kernels for
> that then gets used for a benchmark run, but can't be kept in
> production because it performs so much worse.
>
> Real proper profiles will tell you what *really* matters - and if you
> don't have enough samples to give you good information, then that
> particular code clearly is not important enough to waste PGO on.
>
> This is not all that dissimilar to using gprof information for
> traditional - manual - optimizations.
>
> Sure, instrumented gprof output is better than nothing, but it is
> *hugely* worse than actual proper sampled profiles that actually show
> what matters for performance (as opposed to what runs a lot - the two
> are not necessarily all that closely correlated, with cache misses
> being a thing).
>
> And I really hate how pretty much all of the PGO support seems to be
> just about this inferior method of getting the data.

Right now we're having trouble with hardware performance counters on
non-intel chips; I don't think we have working LBR equivalents on AMD
until zen3, and our ETM based samples on ARM are hung up on a few last
minute issues requiring new hardware (from multiple different chipset
vendors).

It would be good to have some form profile based optimizations that
aren't architecture or microarchitecture dependent.
--
Thanks,
~Nick Desaulniers

2021-06-29 22:33:57

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] Clang feature updates for v5.14-rc1

On Tue, Jun 29, 2021 at 2:27 PM Nick Desaulniers
<[email protected]> wrote:
>
> Right now we're having trouble with hardware performance counters on
> non-intel chips; I don't think we have working LBR equivalents on AMD
> until zen3, and our ETM based samples on ARM are hung up on a few last
> minute issues requiring new hardware (from multiple different chipset
> vendors).

I agree that perf profiling works best on Intel. The AMD perf side
works ok in Zen 2 from what I've seen, but needs to be a full-system
profile ("perf record -a") to use the better options, and ARM is..

But with x86 ranging from "excellent" to "usable", and ARM hopefully
being at least close to getting better proper profile data, I really
think it's the way forward, with instrumentation being a band-aid at
best.

Linus

2021-07-02 12:49:01

by Bill Wendling

[permalink] [raw]
Subject: Re: [GIT PULL] Clang feature updates for v5.14-rc1

On Tue, Jun 29, 2021 at 2:04 PM Linus Torvalds
<[email protected]> wrote:
>
> On Tue, Jun 29, 2021 at 1:44 PM Kees Cook <[email protected]> wrote:
> > >
> > > And it causes the kernel to be bigger and run slower.
> >
> > Right -- that's expected. It's not designed to be the final kernel
> > someone uses. :)
>
> Well, from what I've seen, you actually want to run real loads in
> production environments for PGO to actually be anything but a bogus
> "performance benchmarks only" kind of thing.
>
The reason we use PGO in this way is because we _cannot_ release a
kernel into production that hasn't had PGO applied to it. The
performance of a non-PGO'ed kernel is a non-starter for rollout. We
try our best to replicate this environment for the benchmarks, which
is the only sane way to do this. I can't imagine that we're the only
ones who run up against this chicken-and-egg problem.

For why we don't use sampling, PGO gives a better performance boost
from an instrumented kernel rather than a sampled profile. I'll work
on getting statistics to show this.

-bw

> Of course, "performance benchmarks only" is very traditional, and
> we've seen that used over and over in the past in this industry. That
> doesn't make it _right_, though.
>
> And if you actually want to have it usable in production environments,
> you really should strive to run code as closely as possible to a
> production kernel too.
>
> You'd want to run something that you can sample over time, and in
> production, not something that you have to build a special kernels for
> that then gets used for a benchmark run, but can't be kept in
> production because it performs so much worse.
>
> Real proper profiles will tell you what *really* matters - and if you
> don't have enough samples to give you good information, then that
> particular code clearly is not important enough to waste PGO on.
>
> This is not all that dissimilar to using gprof information for
> traditional - manual - optimizations.
>
> Sure, instrumented gprof output is better than nothing, but it is
> *hugely* worse than actual proper sampled profiles that actually show
> what matters for performance (as opposed to what runs a lot - the two
> are not necessarily all that closely correlated, with cache misses
> being a thing).
>
> And I really hate how pretty much all of the PGO support seems to be
> just about this inferior method of getting the data.
>
> Linus

2021-07-02 13:18:14

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [GIT PULL] Clang feature updates for v5.14-rc1

On Fri, Jul 02, 2021 at 05:46:46AM -0700, Bill Wendling wrote:
> On Tue, Jun 29, 2021 at 2:04 PM Linus Torvalds
> <[email protected]> wrote:
> >
> > On Tue, Jun 29, 2021 at 1:44 PM Kees Cook <[email protected]> wrote:
> > > >
> > > > And it causes the kernel to be bigger and run slower.
> > >
> > > Right -- that's expected. It's not designed to be the final kernel
> > > someone uses. :)
> >
> > Well, from what I've seen, you actually want to run real loads in
> > production environments for PGO to actually be anything but a bogus
> > "performance benchmarks only" kind of thing.
> >
> The reason we use PGO in this way is because we _cannot_ release a
> kernel into production that hasn't had PGO applied to it. The
> performance of a non-PGO'ed kernel is a non-starter for rollout. We
> try our best to replicate this environment for the benchmarks, which
> is the only sane way to do this. I can't imagine that we're the only
> ones who run up against this chicken-and-egg problem.
>
> For why we don't use sampling, PGO gives a better performance boost
> from an instrumented kernel rather than a sampled profile. I'll work
> on getting statistics to show this.

I've asked this before; *what* is missing from LBR samples that's
reponsible for the performance gap?

2021-07-02 18:26:35

by Nick Desaulniers

[permalink] [raw]
Subject: Re: [GIT PULL] Clang feature updates for v5.14-rc1

On Fri, Jul 2, 2021 at 5:57 AM Peter Zijlstra <[email protected]> wrote:
>
> On Fri, Jul 02, 2021 at 05:46:46AM -0700, Bill Wendling wrote:
> > On Tue, Jun 29, 2021 at 2:04 PM Linus Torvalds
> > <[email protected]> wrote:
> > >
> > > On Tue, Jun 29, 2021 at 1:44 PM Kees Cook <[email protected]> wrote:
> > > > >
> > > > > And it causes the kernel to be bigger and run slower.
> > > >
> > > > Right -- that's expected. It's not designed to be the final kernel
> > > > someone uses. :)
> > >
> > > Well, from what I've seen, you actually want to run real loads in
> > > production environments for PGO to actually be anything but a bogus
> > > "performance benchmarks only" kind of thing.
> > >
> > The reason we use PGO in this way is because we _cannot_ release a
> > kernel into production that hasn't had PGO applied to it. The
> > performance of a non-PGO'ed kernel is a non-starter for rollout. We
> > try our best to replicate this environment for the benchmarks, which
> > is the only sane way to do this. I can't imagine that we're the only
> > ones who run up against this chicken-and-egg problem.
> >
> > For why we don't use sampling, PGO gives a better performance boost
> > from an instrumented kernel rather than a sampled profile. I'll work
> > on getting statistics to show this.
>
> I've asked this before; *what* is missing from LBR samples that's
> reponsible for the performance gap?

Are we able to collect LBR samples from __init code? I can imagine
trying to launch perf from init/pid 1, but I suspect at that point
it's way too late.

Increasingly, boot times of hosts (and virtualized guests) are
becoming important to us, both in the datacenters and on mobile.
--
Thanks,
~Nick Desaulniers

2021-07-02 19:34:33

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [GIT PULL] Clang feature updates for v5.14-rc1

On Fri, Jul 02, 2021 at 10:26:40AM -0700, Nick Desaulniers wrote:
> On Fri, Jul 2, 2021 at 5:57 AM Peter Zijlstra <[email protected]> wrote:

> > I've asked this before; *what* is missing from LBR samples that's
> > reponsible for the performance gap?
>
> Are we able to collect LBR samples from __init code? I can imagine
> trying to launch perf from init/pid 1, but I suspect at that point
> it's way too late.
>
> Increasingly, boot times of hosts (and virtualized guests) are
> becoming important to us, both in the datacenters and on mobile.

For a guest, possibly, I've no idea how any of that virt crud works.

2021-07-02 20:14:28

by Nick Desaulniers

[permalink] [raw]
Subject: Re: [GIT PULL] Clang feature updates for v5.14-rc1

On Fri, Jul 2, 2021 at 11:58 AM Peter Zijlstra <[email protected]> wrote:
>
> On Fri, Jul 02, 2021 at 10:26:40AM -0700, Nick Desaulniers wrote:
> > On Fri, Jul 2, 2021 at 5:57 AM Peter Zijlstra <[email protected]> wrote:
>
> > > I've asked this before; *what* is missing from LBR samples that's
> > > reponsible for the performance gap?
> >
> > Are we able to collect LBR samples from __init code? I can imagine
> > trying to launch perf from init/pid 1, but I suspect at that point
> > it's way too late.
> >
> > Increasingly, boot times of hosts (and virtualized guests) are
> > becoming important to us, both in the datacenters and on mobile.
>
> For a guest, possibly, I've no idea how any of that virt crud works.

Oh, another issue we've hit with LBR internally as well is the ability
to access them in virtualized guests.

It looks like support for those did land upstream a few years ago
though, so I'm not sure whether it's a question of us not having
backported/cherry-picked those internally (I'm not on the team
responsible for our datacenter kernels) or if there's some sysadmin
related restrictions where we block access to these MSRs for guests.

Either way, I had an intern last year (2 years ago, perhaps? how long
has this pandemic been...) looking into profiling LLVM, and we hit the
restriction where interns were only given access to VMs in the cloud,
and yet in these VMs could not access LBR. "Guess we'll find
something else for you to work on then."

More of our developers are moving to cloud based VMs, where they have
access to larger build resources. Perhaps we just need to work with
our sysadmins to get whatever capability is necessary, at least in
some limited capacity.

Opening a perf report is painful. For a profile of a kernel build, we
get ~5GB of profile data with LBR, which takes a while to open (is
perf report single threaded?). This is much better than via dwarf
based call graphs, at which point the data becomes exceedingly painful
to work with. So LBR is a welcome improvement, when we have access to
it.
--
Thanks,
~Nick Desaulniers

2021-07-07 08:12:12

by Ingo Molnar

[permalink] [raw]
Subject: Re: [GIT PULL] Clang feature updates for v5.14-rc1


* Nick Desaulniers <[email protected]> wrote:

> > And I really hate how pretty much all of the PGO support seems to be
> > just about this inferior method of getting the data.
>
> Right now we're having trouble with hardware performance counters on
> non-intel chips; I don't think we have working LBR equivalents on AMD
> until zen3, and our ETM based samples on ARM are hung up on a few last
> minute issues requiring new hardware (from multiple different chipset
> vendors).
>
> It would be good to have some form profile based optimizations that
> aren't architecture or microarchitecture dependent.

That doesn't excuse using an inferior tooling ABI design though. By your
own description proper hardware LBR support on the platforms you care most
about is either there or close - yet the whole Clang PGO feature is
designed around software based compiler instrumentation? That's backwards.

The right technical solution to integrate the clang-pgo software
instrumetnation would be to implement a minimal software-LBR PMU
functionality on top of the clang-pgo engine, and use unified perf tooling
to process the branch tracing/profiling information.

In the main PGO thread PeterZ made a couple of technical suggestions about
how this can be done using the existing hardware LBR interfaces of perf,
but we are flexible if the design is sane and are open to improvements.

I.e. try to commonalize the tooling data as soon as possible - not very
late as in the current proposal, exposing a whole stack of APIs and ABIs to
clang-pgo specific interfaces.

The "LBR data unification" approach has numerous short term and long term
advantages:

- Hardware assisted LBR tracing support out of the box on two major
hardware platforms (Power and x86), and on some ARM platforms "soon",
maybe sooner than this feature trickles down to distributions to begin
with.

- GCC won't have to reinvent the wheel - they only need to make sure they
can generate the minimal LBR data. In that sense perf is an
'independent' tooling facility they might be more comfortable working
with as well, than a 'competing' compiler project.

- There's even a chance that existing instrumentation can be reused - or a
relatively self-contained compiler plugin can generate it.

- Lower maintenance overhead, lower risk of subsystem obsolescence.

Binding this feature to clang-pgo on the ABI level is not a good move for
the Linux kernel IMO.

So until this is implemented properly, or adequate explanation is given why
I'm wrong:

NACKED-by: Ingo Molnar <[email protected]>

Both for the core kernel and x86 bits.

Please preserve this NAK and mention it prominently in future iterations of
this feature. Please Cc: me on future postings.

Thanks,

Ingo

2021-10-05 13:13:21

by Daniel Axtens

[permalink] [raw]
Subject: ARCH_WANTS_NO_INSTR (Re: [GIT PULL] Clang feature updates for v5.14-rc1)

Hi,

Apologies, I can't find the original email for this:

> Kconfig: Introduce ARCH_WANTS_NO_INSTR and CC_HAS_NO_PROFILE_FN_ATTR

which is now commit 51c2ee6d121c ("Kconfig: Introduce ARCH_WANTS_NO_INSTR and
CC_HAS_NO_PROFILE_FN_ATTR"). It doesn't seem to show up on Google, this was the
best I could find.

Anyway, the commit message reads:

Kconfig: Introduce ARCH_WANTS_NO_INSTR and CC_HAS_NO_PROFILE_FN_ATTR

We don't want compiler instrumentation to touch noinstr functions,
which are annotated with the no_profile_instrument_function function
attribute. Add a Kconfig test for this and make GCOV depend on it, and
in the future, PGO.

If an architecture is using noinstr, it should denote that via this
Kconfig value. That makes Kconfigs that depend on noinstr able to express
dependencies in an architecturally agnostic way.

However, things in generic code (such as rcu_nmi_enter) are tagged with
`noinstr`, so I'm worried that this commit subtly breaks things like KASAN on
platforms that haven't opted in yet. (I stumbled across this while developing
KASAN on ppc64, but at least riscv and ppc32 have KASAN but not
ARCH_WANTS_NO_INSTR already.)

As I said, I haven't been able to find the original thread - is there any reason
this shouldn't be always on? Why would an arch not opt in? What's the motivation
for ignoring the noinstr markings?

Should generic KASAN/KCSAN/anything else marked in noinstr also have markings
requring ARCH_WANTS_NO_INSTR? AFAICT they should, right?

Kind regards,
Daniel

2021-10-05 13:48:37

by Daniel Axtens

[permalink] [raw]
Subject: Re: ARCH_WANTS_NO_INSTR (Re: [GIT PULL] Clang feature updates for v5.14-rc1)

> Kconfig: Introduce ARCH_WANTS_NO_INSTR and CC_HAS_NO_PROFILE_FN_ATTR
>
> We don't want compiler instrumentation to touch noinstr functions,
> which are annotated with the no_profile_instrument_function function
> attribute. Add a Kconfig test for this and make GCOV depend on it, and
> in the future, PGO.
>
> If an architecture is using noinstr, it should denote that via this
> Kconfig value. That makes Kconfigs that depend on noinstr able to express
> dependencies in an architecturally agnostic way.
>
> However, things in generic code (such as rcu_nmi_enter) are tagged with
> `noinstr`, so I'm worried that this commit subtly breaks things like KASAN on
> platforms that haven't opted in yet. (I stumbled across this while developing
> KASAN on ppc64, but at least riscv and ppc32 have KASAN but not
> ARCH_WANTS_NO_INSTR already.)

Hmm, so it looks like the commit doesn't affect how noinstr is compiled
(which means I have another different issue to contend with!), but...

> As I said, I haven't been able to find the original thread - is there any reason
> this shouldn't be always on? Why would an arch not opt in? What's the motivation
> for ignoring the noinstr markings?
>
> Should generic KASAN/KCSAN/anything else marked in noinstr also have markings
> requring ARCH_WANTS_NO_INSTR? AFAICT they should, right?

I'm still curious about all of these questions. I get
CC_HAS_NO_PROFILE_FN_ATTR, but I don't get ARCH_WANTS_NO_INSTR.

Kind regards,
Daniel

>
> Kind regards,
> Daniel

2021-10-05 14:35:40

by Mark Rutland

[permalink] [raw]
Subject: Re: ARCH_WANTS_NO_INSTR (Re: [GIT PULL] Clang feature updates for v5.14-rc1)

On Wed, Oct 06, 2021 at 12:10:15AM +1100, Daniel Axtens wrote:
> Hi,

Hi Daniel,

> Apologies, I can't find the original email for this:
>
> > Kconfig: Introduce ARCH_WANTS_NO_INSTR and CC_HAS_NO_PROFILE_FN_ATTR
>
> which is now commit 51c2ee6d121c ("Kconfig: Introduce ARCH_WANTS_NO_INSTR and
> CC_HAS_NO_PROFILE_FN_ATTR"). It doesn't seem to show up on Google, this was the
> best I could find.

Unless I've misunderstood, the commit title was rewritten when the patch
was applied, from the third link in commit 51c2ee6d121c. For reference,
those three links are:

Link: https://lore.kernel.org/lkml/[email protected]/
Link: https://lore.kernel.org/lkml/YMcssV%[email protected]/
Link: https://lore.kernel.org/r/[email protected]

> Anyway, the commit message reads:
>
> Kconfig: Introduce ARCH_WANTS_NO_INSTR and CC_HAS_NO_PROFILE_FN_ATTR
>
> We don't want compiler instrumentation to touch noinstr functions,
> which are annotated with the no_profile_instrument_function function
> attribute. Add a Kconfig test for this and make GCOV depend on it, and
> in the future, PGO.
>
> If an architecture is using noinstr, it should denote that via this
> Kconfig value. That makes Kconfigs that depend on noinstr able to express
> dependencies in an architecturally agnostic way.
>
> However, things in generic code (such as rcu_nmi_enter) are tagged with
> `noinstr`, so I'm worried that this commit subtly breaks things like KASAN on
> platforms that haven't opted in yet. (I stumbled across this while developing
> KASAN on ppc64, but at least riscv and ppc32 have KASAN but not
> ARCH_WANTS_NO_INSTR already.)
>
> As I said, I haven't been able to find the original thread - is there any reason
> this shouldn't be always on? Why would an arch not opt in? What's the motivation
> for ignoring the noinstr markings?

IIRC the thinking was that architectures which have their entry logic in
asm could/might avoid the problematic instrumentation by construction,
and we didn't want to break functionality for those.

As you say, if that asm has to call code which can't safely be
instrumented, that's equally broken, and that might have been
wrong-headed.

> Should generic KASAN/KCSAN/anything else marked in noinstr also have markings
> requring ARCH_WANTS_NO_INSTR? AFAICT they should, right?

I suspect so, if we could otherwise get unexpected or unsafe recursion
between instrumentation.

Thanks,
Mark.

2021-10-07 06:24:12

by Jarmo Tiitto

[permalink] [raw]
Subject: Re: ARCH_WANTS_NO_INSTR (Re: [GIT PULL] Clang feature updates for v5.14-rc1)

Mark Rutland wrote tiistaina 5. lokakuuta 2021 17.30.03 EEST:
> On Wed, Oct 06, 2021 at 12:10:15AM +1100, Daniel Axtens wrote:
> > Hi,
>
> Hi Daniel,
>
> > Apologies, I can't find the original email for this:
> > > Kconfig: Introduce ARCH_WANTS_NO_INSTR and CC_HAS_NO_PROFILE_FN_ATTR
> >
> > which is now commit 51c2ee6d121c ("Kconfig: Introduce ARCH_WANTS_NO_INSTR
> > and
> > CC_HAS_NO_PROFILE_FN_ATTR"). It doesn't seem to show up on Google, this
was
> > the best I could find.
>
> Unless I've misunderstood, the commit title was rewritten when the patch
> was applied, from the third link in commit 51c2ee6d121c. For reference,
> those three links are:
>
> Link:
> https://lore.kernel.org/lkml/[email protected]
> / Link:
> https://lore.kernel.org/lkml/YMcssV%[email protected]
> et/ Link:
> https://lore.kernel.org/r/[email protected]

Hello, Kees and others cc'd !
I got above mail, and went through an rabbit hole of lkml messages since I was
involved with the clang-pgo feature.

I'll like to know what is the current situation about GCOV and PGO?

I saw that for-next/clang/pgo had some new interesting patches applied.
Would it be good time now to continue make instrumented kernel?

Background:
I essentially stopped my work at the point where Peter Z noted -fprofile-
generate breaks the kernel+gcov and noinstr needs to be fixed.

My situation here is that I have very old non-public PGO hacks that date back
to v4.11 - v4.19 era using GCOV subsystem and now with the newer clang-pgo
patches that are in usable state.

These previous attempts all broke apart because of the noinstr not doing it's
job with -fprofile-generate: the compiler could generate a call to gcov/pgo
profiler hook in wrong place (in interrupt context, If I remember) and the
kernel was doomed.

One thing has not changed over the years: I still don't have a single CPU that
has hardware PMU capable of LBR and generating AutoFDO profiles. :(

So I have written code/hacks now for two subsystems to gain profile data for
PGO. In the end, I don't care from what instrumented kernel pipes I have to
pull the data out, and what format it is in, as long as the compiler accepts
it. :-P
PS: gcov-pgo had waayy too many pipes for doing just pgo. /s

Well, that was my past on this PGO topic.

Thanks all,
-Jarmo Tiitto


2022-11-01 00:20:19

by Bill Wendling

[permalink] [raw]
Subject: Re: [GIT PULL] Clang feature updates for v5.14-rc1

On Fri, Jul 2, 2021 at 5:57 AM Peter Zijlstra <[email protected]> wrote:
>
> On Fri, Jul 02, 2021 at 05:46:46AM -0700, Bill Wendling wrote:
> > On Tue, Jun 29, 2021 at 2:04 PM Linus Torvalds
> > <[email protected]> wrote:
> > >
> > > On Tue, Jun 29, 2021 at 1:44 PM Kees Cook <[email protected]> wrote:
> > > > >
> > > > > And it causes the kernel to be bigger and run slower.
> > > >
> > > > Right -- that's expected. It's not designed to be the final kernel
> > > > someone uses. :)
> > >
> > > Well, from what I've seen, you actually want to run real loads in
> > > production environments for PGO to actually be anything but a bogus
> > > "performance benchmarks only" kind of thing.
> > >
> > The reason we use PGO in this way is because we _cannot_ release a
> > kernel into production that hasn't had PGO applied to it. The
> > performance of a non-PGO'ed kernel is a non-starter for rollout. We
> > try our best to replicate this environment for the benchmarks, which
> > is the only sane way to do this. I can't imagine that we're the only
> > ones who run up against this chicken-and-egg problem.
> >
> > For why we don't use sampling, PGO gives a better performance boost
> > from an instrumented kernel rather than a sampled profile. I'll work
> > on getting statistics to show this.
>
> I've asked this before; *what* is missing from LBR samples that's
> reponsible for the performance gap?

For one, it lacks information on function call frequencies, which help
inlining. It's also much more coarse-grained than a perf trace. And
while a section of code that doesn't show up in a trace sample may not
be executed much, changes to it may have cascading effects.

Ingo mentioned had some ideas on minimal software LBR PMU
functionality. Do you have a link to this discussion?

"The right technical solution to integrate the clang-pgo software
instrumetnation would be to implement a minimal software-LBR PMU
functionality on top of the clang-pgo engine, and use unified perf tooling
to process the branch tracing/profiling information.

"In the main PGO thread PeterZ made a couple of technical suggestions about
how this can be done using the existing hardware LBR interfaces of perf,
but we are flexible if the design is sane and are open to improvements."

-bw