2023-11-15 15:09:12

by Mathieu Desnoyers

[permalink] [raw]
Subject: Summary of discussion following LPC2023 sframe talk

Hi,

[ With lkml and diamon-discuss in CC ]

I'm adding the following notes of the hallway track discussion we had
immediately after the sframe slot within the tracing MC [1]. I suspect it
is relevant (please correct me if I'm wrong or if there are conclusions
that are too early to tell):

- Handling of shared libraries:
- the libc dynamic loader should register/unregister sframe sections
explicitly with new prctl(2) options,
- The prctl() for registration of the sframe sections can take the
section address and size as arguments,
- The prctl for unregistration could take the section address as argument,
but this would require additional data in the linker map (within libc),
which is unwanted.
- One alternative would be to provide an additional information to
sframe registration/unregistration: a key which is decided by the libc
to match registration/unregistration. That key could be either the
address of the text section associated with the sframe section, or it
could be the address of the linker map entry (at the choice of userspace).
- Overall, the prctl(3) sframe register could have the following parameters:
{ key, sframe address, sframe section length }
- The prctl(3) sframe unregister would then take a { key } as parameter.

- The kernel backtrace code using the sframe information should consider
it hostile:
- can be corrupted by the application (by accident or maliciously),
- can be corrupted on disk by modification of the ELF binary, either
before registration or after (either by accident or maliciously),
- can be malformed to contain loops (need to find a way to have upper
bounds, sanity checks about the direction of the stack traversal),

- It was discussed that the kernel could possibly validate checksums on
registration and write-protect the sframe pages. Considering that the
kernel still needs to consider the content hostile even with those
mechanisms in place, it is unclear whether they are relevant.

- Mark Rutland told me that for aarch64 the current sframe content is
not sufficient to express how to walk the stack over code area at
the beginning of functions before the stack pointer is updated.
He plans to discuss this with Indu as a follow up.

- Interpreters:

- Walking over an interpreter's own stack can be as simple as skipping
over the interpreter's runtime functions. This is a first step to
allow skipping over interpreters without detailed information about
their own stack layout.

- JITs:

- There are two approaches to skip over JITted code stacks:

- If the jitted code has frame pointers, then use this.

- If we figure out that some JITs do not have frame pointers, then
we would need to design a new kernel ABI that would allow JITs
to express sframe-alike information. This will need to be designed
with the input of JIT communities because some of them are likely
not psABI compliant (e.g. lua has a separate stack).

- When we have a good understanding of the JIT requirements in terms
of frame description content, the other element that would need to
be solved is how to allow JITs to emit frame data in a data structure
that can expand. We may need something like a reserved memory area, with
a counter of the number of elements which is used to synchronize communication
between the JITs (producer) and kernel (consumer).

- We would need to figure out if JITs expect to have a single producer per
frame description area, or multiple producers.

- We would need to figure out if JITs expect to append frame descriptions in
sorted function address order (append only for frame description, append only
for functions text section as well), or if there needs to be support for unsorted
function entries.

- We would need information about how JITs reclaim functions, and how it impacts
the frame description ABI. For instance, we may want to have a tombstone bit to
state that a frame was deleted.

- We may have to create frame description areas which content are specific to given
JITs. For instance, the frame descriptions for a lua JIT on x86-64 may not follow
the x86-64 regular psABI.

- As an initial stage, we can focus on handling the sframe section for executable
and shared objects, and use frame pointers to skip over JITted code if available.
The goal here is to show the usefulness of this kind of information so we get
the interest/collaboration needed to get the relevant input from JIT communities
as we design the proper ABI for handling JIT frames.

Thanks,

Mathieu

[1] https://lpc.events/event/17/contributions/1467/ (for abstract/slides)


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


2023-11-15 15:50:01

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Summary of discussion following LPC2023 sframe talk

On Wed, Nov 15, 2023 at 10:09:16AM -0500, Mathieu Desnoyers wrote:
> Hi,
>
> [ With lkml and diamon-discuss in CC ]
>
> I'm adding the following notes of the hallway track discussion we had
> immediately after the sframe slot within the tracing MC [1]. I suspect it
> is relevant (please correct me if I'm wrong or if there are conclusions
> that are too early to tell):
>
> - Handling of shared libraries:
> - the libc dynamic loader should register/unregister sframe sections
> explicitly with new prctl(2) options,
> - The prctl() for registration of the sframe sections can take the
> section address and size as arguments,
> - The prctl for unregistration could take the section address as argument,
> but this would require additional data in the linker map (within libc),
> which is unwanted.
> - One alternative would be to provide an additional information to
> sframe registration/unregistration: a key which is decided by the libc
> to match registration/unregistration. That key could be either the
> address of the text section associated with the sframe section, or it
> could be the address of the linker map entry (at the choice of userspace).
> - Overall, the prctl(3) sframe register could have the following parameters:
> { key, sframe address, sframe section length }
> - The prctl(3) sframe unregister would then take a { key } as parameter.
>
> - The kernel backtrace code using the sframe information should consider
> it hostile:
> - can be corrupted by the application (by accident or maliciously),
> - can be corrupted on disk by modification of the ELF binary, either
> before registration or after (either by accident or maliciously),
> - can be malformed to contain loops (need to find a way to have upper
> bounds, sanity checks about the direction of the stack traversal),
>
> - It was discussed that the kernel could possibly validate checksums on
> registration and write-protect the sframe pages. Considering that the
> kernel still needs to consider the content hostile even with those
> mechanisms in place, it is unclear whether they are relevant.
>
> - Mark Rutland told me that for aarch64 the current sframe content is
> not sufficient to express how to walk the stack over code area at
> the beginning of functions before the stack pointer is updated.
> He plans to discuss this with Indu as a follow up.
>
> - Interpreters:
>
> - Walking over an interpreter's own stack can be as simple as skipping
> over the interpreter's runtime functions. This is a first step to
> allow skipping over interpreters without detailed information about
> their own stack layout.

Profiling interpreters is typically done using SIGnals. Perf is capable
of generating signals on overflow. This is slow, but so is an
interpreter. SIGhandler is part of the interpreter and can interpret
the interpreter state and do whatever it damn well pleases.

A stack-machine based interpreter will not have anything but the main
loop on the actual function call stack. Unwinding it using the 'C'
unwinder will yield nothing useful.

> - JITs:
>
> - There are two approaches to skip over JITted code stacks:
>
> - If the jitted code has frame pointers, then use this.
>
> - If we figure out that some JITs do not have frame pointers, then
> we would need to design a new kernel ABI that would allow JITs
> to express sframe-alike information. This will need to be designed
> with the input of JIT communities because some of them are likely
> not psABI compliant (e.g. lua has a separate stack).

Why a new interface? They can use the same prctl() as above. Here text,
there sframe.

> - When we have a good understanding of the JIT requirements in terms
> of frame description content, the other element that would need to
> be solved is how to allow JITs to emit frame data in a data structure
> that can expand. We may need something like a reserved memory area, with
> a counter of the number of elements which is used to synchronize communication
> between the JITs (producer) and kernel (consumer).

Again, huh?! Expand? Typical JIT has the normal epoch like approach to
text generation, have N>1 text windows, JIT into one until full, once
full, copy all still active crap into second window, induce grace period
and wipe first window, rince-repeat.

Just have a sframe thing per window and expand the definition of 'full'
to be either text of sframe window is full and everything should just
work, no?

>
> - We would need to figure out if JITs expect to have a single producer per
> frame description area, or multiple producers.

I've not really kept up and only ever seen single threaded JITs, but I
would imagine they each get their own window.

> - We would need to figure out if JITs expect to append frame descriptions in
> sorted function address order (append only for frame description, append only
> for functions text section as well), or if there needs to be support for unsorted
> function entries.

So from what I know they typically so the sorted thing, easier for their
own accounting too.

Note that there is this JAVA JIT text symbol userspace API thing that
tracks symbols. Perf-tool implements that IIRC. Writes it out to a file
which is then read and munged back into the report or so. IIRC this also
includes information on reclaim.

> - We would need information about how JITs reclaim functions, and how it impacts
> the frame description ABI. For instance, we may want to have a tombstone bit to
> state that a frame was deleted.

prctl() unregister + register ? I mean, JIT would need to be fully
co-operative anyway.

> - We may have to create frame description areas which content are specific to given
> JITs. For instance, the frame descriptions for a lua JIT on x86-64 may not follow
> the x86-64 regular psABI.
>
> - As an initial stage, we can focus on handling the sframe section for executable
> and shared objects, and use frame pointers to skip over JITted code if available.
> The goal here is to show the usefulness of this kind of information so we get
> the interest/collaboration needed to get the relevant input from JIT communities
> as we design the proper ABI for handling JIT frames.

As per: https://realpython.com/python312-perf-profiler/

There is some 'demand' for all this, might be useful to contact some JIT
authors and have them detail their needs or something.

2023-12-01 19:09:16

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: Summary of discussion following LPC2023 sframe talk

On Wed, Nov 15, 2023 at 04:49:12PM +0100, Peter Zijlstra wrote:
> > - JITs:
> >
> > - There are two approaches to skip over JITted code stacks:
> >
> > - If the jitted code has frame pointers, then use this.
> >
> > - If we figure out that some JITs do not have frame pointers, then
> > we would need to design a new kernel ABI that would allow JITs
> > to express sframe-alike information. This will need to be designed
> > with the input of JIT communities because some of them are likely
> > not psABI compliant (e.g. lua has a separate stack).
>
> Why a new interface? They can use the same prctl() as above. Here text,
> there sframe.

Our thinking was that a syscall defeats the point of "just in time".

> > - When we have a good understanding of the JIT requirements in terms
> > of frame description content, the other element that would need to
> > be solved is how to allow JITs to emit frame data in a data structure
> > that can expand. We may need something like a reserved memory area, with
> > a counter of the number of elements which is used to synchronize communication
> > between the JITs (producer) and kernel (consumer).
>
> Again, huh?! Expand? Typical JIT has the normal epoch like approach to
> text generation, have N>1 text windows, JIT into one until full, once
> full, copy all still active crap into second window, induce grace period
> and wipe first window, rince-repeat.
>
> Just have a sframe thing per window and expand the definition of 'full'
> to be either text of sframe window is full and everything should just
> work, no?

Yes, assuming the JIT doesn't mind doing a syscall.

> > - We may have to create frame description areas which content are specific to given
> > JITs. For instance, the frame descriptions for a lua JIT on x86-64 may not follow
> > the x86-64 regular psABI.
> >
> > - As an initial stage, we can focus on handling the sframe section for executable
> > and shared objects, and use frame pointers to skip over JITted code if available.
> > The goal here is to show the usefulness of this kind of information so we get
> > the interest/collaboration needed to get the relevant input from JIT communities
> > as we design the proper ABI for handling JIT frames.
>
> As per: https://realpython.com/python312-perf-profiler/
>
> There is some 'demand' for all this, might be useful to contact some JIT
> authors and have them detail their needs or something.

If there's no sframe then we can just fall back to frame pointers like
ORC does. And that article recommends enabling frame pointers for
python.

Between that and prctl(), it may be enough.

--
Josh

2023-12-01 19:47:42

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: Summary of discussion following LPC2023 sframe talk

On 2023-11-15 10:49, Peter Zijlstra wrote:
> On Wed, Nov 15, 2023 at 10:09:16AM -0500, Mathieu Desnoyers wrote:
[...]
>
>> - When we have a good understanding of the JIT requirements in terms
>> of frame description content, the other element that would need to
>> be solved is how to allow JITs to emit frame data in a data structure
>> that can expand. We may need something like a reserved memory area, with
>> a counter of the number of elements which is used to synchronize communication
>> between the JITs (producer) and kernel (consumer).
>
> Again, huh?! Expand? Typical JIT has the normal epoch like approach to
> text generation, have N>1 text windows, JIT into one until full, once
> full, copy all still active crap into second window, induce grace period
> and wipe first window, rince-repeat.
>
> Just have a sframe thing per window and expand the definition of 'full'
> to be either text of sframe window is full and everything should just
> work, no?

Is the generated text reachable (for execution) before the end of the
window during which it was created, or is there some kind of epoch delay
between text generation and the moment where it becomes reachable ?

If there is a delay between code generation and the moment where it
becomes reachable (e.g. a whole epoch), then I understand your point
that we could consider the whole jitted text window as belonging to a
single sframe section and register it in one go to the kernel. The
overhead of the system call would be amortized over the epoch duration.

However, if JITs are allowed to incrementally add text to the current
window and make it immediately reachable, then we need to have some
way to synchronize appending "sframe functions" into a memory mapping
that do not require issuing a system call every time a new function
is jitted.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com