2024-03-15 16:27:28

by Steven Rostedt

[permalink] [raw]
Subject: [GIT PULL] tracing: Updates for v6.9


Linus,

[
Note, the last fixes you pulled from me needed to change the
ring_buffer_wait() function that added two new parameters to it.
This branch added a new instance of ring_buffer_wait() but does not have
the two new parameters. I merged my urgent branch to this one and added
the two parameters (both being NULL) in the merge commit (with the
proper change log) so that the merge commit will still compile.

I also found a better way to handle the new ring_buffer_wait() default
code (with the two NULL parameters) and added that change on top.

This is the first time I'm requesting a pull request that contains a merge.
I'm hoping that my scripts created a proper diffstat for you.

I also have two commits that I'm not including in this pull request
because they add checks to the usage of __string() and __assign_str()
and will fail the build if they are not done properly. This found three
existing bugs. One fix is included in this pull request, but there's two
more. One fix has already been pulled by you. The other I'm waiting to
see if I can pull it because my checks will not compile without it.

Those two other fixes are:
https://lore.kernel.org/all/[email protected]/
https://lore.kernel.org/all/[email protected]/
]

Tracing updates for 6.9:

Main user visible change:

- Add ring_buffer memory mappings

The tracing ring buffer was created based on being mostly used with the
splice system call. It is broken up into page ordered sub-buffers and the
reader swaps a new sub-buffer with an existing sub-buffer that's part
of the write buffer. It then has total access to the swapped out
sub-buffer and can do copyless movements of the memory into other mediums
(file system, network, etc).

The buffer is great for passing around the ring buffer contents in the
kernel, but is not so good for when the consumer is the user space task
itself.

A new interface is added that allows user space to memory map the ring
buffer. It will get all the write sub-buffers as well as reader sub-buffer
(that is not written to). It can send an ioctl to change which sub-buffer
is the new reader sub-buffer.

The ring buffer is read only to user space. It only needs to call the
ioctl when it is finished with a sub-buffer and needs a new sub-buffer
that the writer will not write over.

A self test program was also created for testing and can be used as
an example for the interface to user space. The libtracefs (external
to the kernel) also has code that interacts with this, although it is
disabled until the interface is in a official release. It can be enabled
by compiling the library with a special flag. This was used for testing
applications that perform better with the buffer being mapped.

Memory mapped buffers have limitations. The main one is that it can not be
used with the snapshot logic. If the buffer is mapped, snapshots will be
disabled. If any logic is set to trigger snapshots on a buffer, that
buffer will not be allowed to be mapped.

- User events can now have "multi formats"

The current user events have a single format. If another event is created
with a different format, it will fail to be created. That is, once an
event name is used, it cannot be used again with a different format. This
can cause issues if a library is using an event and updates its format.
An application using the older format will prevent an application using
the new library from registering its event.

A task could also DOS another application if it knows the event names, and
it creates events with different formats.

The multi-format event is in a different name space from the single
format. Both the event name and its format are the unique identifier.
This will allow two different applications to use the same user event name
but with different payloads.

- Added support to have ftrace_dump_on_oops dump out instances and
not just the main top level tracing buffer.

Other changes:

- Add eventfs_root_inode

Only the root inode has a dentry that is static (never goes away) and
stores it upon creation. There's no reason that the thousands of other
eventfs inodes should have a pointer that never gets set in its
descriptor. Create a eventfs_root_inode desciptor that has a eventfs_inode
descriptor and a dentry pointer, and only the root inode will use this.

- Added WARN_ON()s in eventfs

There's some conditionals remaining in eventfs that should never be hit,
but instead of removing them, add WARN_ON() around them to make sure that
they are never hit.

- Have saved_cmdlines allocation also include the map_cmdline_to_pid array

The saved_cmdlines structure allocates a large amount of data to hold its
mappings. Within it, it has three arrays. Two are already apart of it:
map_pid_to_cmdline[] and saved_cmdlines[]. More memory can be saved by
also including the map_cmdline_to_pid[] array as well.

- Restructure __string() and __assign_str() macros used in TRACE_EVENT().

Dynamic strings in TRACE_EVENT() are declared with:

__string(name, source)

And assigned with:

__assign_str(name, source)

In the tracepoint callback of the event, the __string() is used to get the
size needed to allocate on the ring buffer and __assign_str() is used to
copy the string into the ring buffer. There's a helper structure that is
created in the TRACE_EVENT() macro logic that will hold the string length
and its position in the ring buffer which is created by __string().

There are several trace events that have a function to create the string
to save. This function is executed twice. Once for __string() and again
for __assign_str(). There's no reason for this. The helper structure could
also save the string it used in __string() and simply copy that into
__assign_str() (it also already has its length).

By using the structure to store the source string for the assignment, it
means that the second argument to __assign_str() is no longer needed.

It will be removed in the next merge window, but for now add a warning if
the source string given to __string() is different than the source string
given to __assign_str(), as the source to __assign_str() isn't even used
and will be going away.

- Other minor clean ups and fixes


Please pull the latest trace-v6.9 tree, which can be found at:


git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git
trace-v6.9

Tag SHA1: 37235a011c5cae307bc758909eeb9deb09523b16
Head SHA1: 2fd814ad5713b6069912c4f1662fbc74de2c4741


Beau Belgrave (4):
tracing/user_events: Prepare find/delete for same name events
tracing/user_events: Introduce multi-format events
selftests/user_events: Test multi-format events
tracing/user_events: Document multi-format flag

Chengming Zhou (1):
tracefs: Remove SLAB_MEM_SPREAD flag usage

Huang Yiwei (1):
tracing: Support to dump instance traces by ftrace_dump_on_oops

John Garry (1):
tracing: Use init_utsname()->release

Randy Dunlap (1):
ftrace: Fix most kernel-doc warnings

Steven Rostedt (Google) (22):
eventfs: Add WARN_ON_ONCE() to checks in eventfs_root_lookup()
eventfs: Create eventfs_root_inode to store dentry
tracing: Have saved_cmdlines arrays all in one allocation
tracing: Move open coded processing of tgid_map into helper function
tracing: Move saved_cmdline code into trace_sched_switch.c
NFSD: Fix nfsd_clid_class use of __string_len() macro
drm/i915: Add missing ; to __assign_str() macros in tracepoint code
tracing: Rework __assign_str() and __string() to not duplicate getting the string
tracing: Do not calculate strlen() twice for __string() fields
tracing: Use ? : shortcut in trace macros
tracing: Use EVENT_NULL_STR macro instead of open coding "(null)"
tracing: Fix snapshot counter going between two tracers that use it
tracing: Decrement the snapshot if the snapshot trigger fails to register
tracing: Remove __assign_str_len()
tracing: Add __string_len() example
tracing: Add warning if string in __assign_str() does not match __string()
tracing: Remove second parameter to __assign_rel_str()
tracepoints: Use WARN() and not WARN_ON() for warnings
ring-buffer: Have mmapped ring buffer keep track of missed events
net: hns3: tracing: fix hclgevf trace event strings
tracing: merge trace/urgent into trace/core
ring-buffer: Make wake once of ring_buffer_wait() more robust

Thorsten Blum (1):
tracing: Use div64_u64() instead of do_div()

Vincent Donnefort (6):
ring-buffer: Zero ring-buffer sub-buffers
ring-buffer: Introducing ring-buffer mapping functions
tracing: Add snapshot refcount
tracing: Allow user-space mapping of the ring-buffer
Documentation: tracing: Add ring-buffer mapping
ring-buffer/selftest: Add ring-buffer mapping test

linke li (1):
ring-buffer: use READ_ONCE() to read cpu_buffer->commit_page in concurrent environment

----
Documentation/admin-guide/kernel-parameters.txt | 26 +-
Documentation/admin-guide/sysctl/kernel.rst | 30 +-
Documentation/trace/index.rst | 1 +
Documentation/trace/ring-buffer-map.rst | 106 +++
Documentation/trace/user_events.rst | 27 +-
drivers/gpu/drm/i915/display/intel_display_trace.h | 6 +-
.../ethernet/hisilicon/hns3/hns3pf/hclge_trace.h | 8 +-
.../ethernet/hisilicon/hns3/hns3vf/hclgevf_trace.h | 8 +-
fs/nfsd/trace.h | 10 +-
fs/tracefs/event_inode.c | 70 +-
fs/tracefs/inode.c | 1 -
fs/tracefs/internal.h | 2 -
include/linux/ftrace.h | 4 +-
include/linux/kernel.h | 1 +
include/linux/ring_buffer.h | 7 +
include/linux/trace_events.h | 3 +
include/linux/tracepoint.h | 6 +-
include/trace/events/sunrpc.h | 12 +-
include/trace/stages/stage2_data_offsets.h | 4 +-
include/trace/stages/stage5_get_offsets.h | 15 +-
include/trace/stages/stage6_event_callback.h | 27 +-
include/uapi/linux/trace_mmap.h | 48 ++
include/uapi/linux/user_events.h | 6 +-
kernel/sysctl.c | 4 +-
kernel/trace/ftrace.c | 90 +-
kernel/trace/ring_buffer.c | 472 ++++++++++-
kernel/trace/trace.c | 904 ++++++++-------------
kernel/trace/trace.h | 19 +-
kernel/trace/trace_benchmark.c | 5 +-
kernel/trace/trace_events_trigger.c | 63 +-
kernel/trace/trace_events_user.c | 209 +++--
kernel/trace/trace_sched_switch.c | 515 ++++++++++++
kernel/trace/trace_selftest.c | 2 +-
samples/trace_events/trace-events-sample.h | 18 +-
tools/testing/selftests/ring-buffer/Makefile | 8 +
tools/testing/selftests/ring-buffer/config | 2 +
tools/testing/selftests/ring-buffer/map_test.c | 273 +++++++
tools/testing/selftests/user_events/abi_test.c | 134 +++
38 files changed, 2341 insertions(+), 805 deletions(-)
create mode 100644 Documentation/trace/ring-buffer-map.rst
create mode 100644 include/uapi/linux/trace_mmap.h
create mode 100644 tools/testing/selftests/ring-buffer/Makefile
create mode 100644 tools/testing/selftests/ring-buffer/config
create mode 100644 tools/testing/selftests/ring-buffer/map_test.c
---------------------------


2024-03-16 16:31:40

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] tracing: Updates for v6.9

On Fri, 15 Mar 2024 at 09:27, Steven Rostedt <[email protected]> wrote:
>
> - Add ring_buffer memory mappings

I pulled this, looked at it, and unpulled it again.

I don't want to have years of "fix up the mistakes after the fact".

This is all done entirely incorrectly, and just as an example of that,
subbuf_map_prepare() is another case of "tracing code works around the
fact that it did things wrong in the first place".

So instead of merging a new feature that was mis-designed and is
already having code working around its mis-design, I'm not merging it
at all.

Linus

2024-03-16 17:00:28

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] tracing: Updates for v6.9

On Sat, 16 Mar 2024 at 09:31, Linus Torvalds
<[email protected]> wrote:
>
> So instead of merging a new feature that was mis-designed and is
> already having code working around its mis-design, I'm not merging it
> at all.

Here's a clue: when hacking up VFS code, ask for ACK's from the VFS people.

And when hacking up MM code, make damn sure that you have VM people involved.

No more of this "random code that happens to work in my tests"
garbage. Yes, I'm sure that others have done this same disgusting page
counting hack and this was copied-and-pasted from some other
disgusting source, but because of all the history, I'm now looking at
tracing pulls arefully, and I'm simply not allowing any broken hacks.

So in addition to getting actual VM people to help you with mapping
stuff (hard requirement), I would also suggest:

- your allocation has to be live over the whole mmap (and that's due
to other fundamental issues - you're not even trying to deal with
actual dynamic allocations and thank Cthulhu for that), and the code
is literally designed that way, so then faulting pages in one at a
time and refcounting them one at a time is just pointless and wrong.
Just do it all at mmap time.

- I'd suggest marking it all VM_DONTCOPY | VM_IO | VM_DONTEXPAND to
not let people play games with the mapping.

- avoid all the sub-page ref-counts entirely by using VM_PFNMAP, and
use vm_insert_pages()

and a random note:

- from a TLB pressure standpoint, it might be a good idea to try to
keep the page table entries naturally aligned, so putting that one
status page at the beginning is likely a bad idea. It will typically
mean that hardware that can silently use larger TLB entries for
aligned pages won't be able to do so.

but the effect of that is likely fairly small.

Linus

2024-03-16 18:18:35

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] tracing: Updates for v6.9

On Sat, 16 Mar 2024 at 09:59, Linus Torvalds
<[email protected]> wrote:
>
> - I'd suggest marking it all VM_DONTCOPY | VM_IO | VM_DONTEXPAND to
> not let people play games with the mapping.

You already did set VM_DONTCOPY (and VM_DONTDUMP is a good idea too).
And you cleared VM_MAYWRITE. Those are all good.

I'd also suggest requiring the mma[ to be MAP_SHARED.

With a read-only mapping, that doesn't really do all that much, but I
don't think you actually need the vm_ops at all once you do everything
at mmap() time, and then it causes a SIGBUS instead of a "insert zero
page".

And _technically_ it could tell the architecture code to try to align
the mapping to the cache aliasing boundaries.

Of course, because of how you insert the meta-page at the beginning of
the mapping, you end up with the actual page table entries not aligned
anyway, so it doesn't actually help the cache coloring, but it's still
conceptually the right thing to do. So even if it ends up mostly just
a "document the fact that these are shared with the kernel" flag, I
think it's a good idea.

Linus

2024-03-16 18:20:13

by Steven Rostedt

[permalink] [raw]
Subject: Re: [GIT PULL] tracing: Updates for v6.9

On Sat, 16 Mar 2024 09:59:57 -0700
Linus Torvalds <[email protected]> wrote:

> On Sat, 16 Mar 2024 at 09:31, Linus Torvalds
> <[email protected]> wrote:
> >
> > So instead of merging a new feature that was mis-designed and is
> > already having code working around its mis-design, I'm not merging it
> > at all.
>
> Here's a clue: when hacking up VFS code, ask for ACK's from the VFS people.
>
> And when hacking up MM code, make damn sure that you have VM people involved.
>
> No more of this "random code that happens to work in my tests"
> garbage. Yes, I'm sure that others have done this same disgusting page
> counting hack and this was copied-and-pasted from some other

I admit that this was taken from looking at how other code did it.

> disgusting source, but because of all the history, I'm now looking at
> tracing pulls arefully, and I'm simply not allowing any broken hacks.

That's pretty obvious.

>
> So in addition to getting actual VM people to help you with mapping
> stuff (hard requirement), I would also suggest:
>
> - your allocation has to be live over the whole mmap (and that's due
> to other fundamental issues - you're not even trying to deal with
> actual dynamic allocations and thank Cthulhu for that), and the code
> is literally designed that way, so then faulting pages in one at a
> time and refcounting them one at a time is just pointless and wrong.
> Just do it all at mmap time.
>
> - I'd suggest marking it all VM_DONTCOPY | VM_IO | VM_DONTEXPAND to
> not let people play games with the mapping.
>
> - avoid all the sub-page ref-counts entirely by using VM_PFNMAP, and
> use vm_insert_pages()
>
> and a random note:
>
> - from a TLB pressure standpoint, it might be a good idea to try to
> keep the page table entries naturally aligned, so putting that one
> status page at the beginning is likely a bad idea. It will typically
> mean that hardware that can silently use larger TLB entries for
> aligned pages won't be able to do so.

This is actually something that I wanted to do, but how much to
allocate was stored in that status page. I could easily make the status
page as big as the sub-buffers themselves so that everything is
naturally aligned. The status pages states how big it is.

>
> but the effect of that is likely fairly small.

I'm not going to be able to get this done this week, but I don't want
the other changes not to be pulled. As the ring buffer mappings near
the beginning, what do you want me to do in the mean time?

1) Rebase without them (I know how much you love rebasing)

2) revert the entire series.

3) Just make it disabled.

-- Steve


2024-03-16 18:43:09

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] tracing: Updates for v6.9

On Sat, 16 Mar 2024 at 11:20, Steven Rostedt <[email protected]> wrote:
>
> 1) Rebase without them (I know how much you love rebasing)

This.

Except honestly, the pulls are getting to be so complicated for me
because I have to check them, that I'd really like you to start doing
topic branches for individual things.

That's what we ended up doing with the security layers too, because
there were too many cases of "that is broken, I can't pull it", and
then having one single branch for everything meant that it was always
a "all or nothing" thing.

The security layer issues have largely gone away, but I still pull
things individually, and I think it actually ended up working out
well. Yes, I see more pulls, but not only are they clearer for me, the
code history ends up being much clearer too.

So topic branches tend to make for more actual pull requests, but when
the individual pulls are smaller and have clear "this branch does XYZ
and nothing more", it turns out that the actual effort per pull ends
up being less, and it actually clarifies things a lot too.

In fact, the x86 -tip people ended up doing topic branches just to
make things easier to review, rather than any "I can't pull that"
issues, and I think it actually ended up being something that they
preferred to do anyway.

Now, I'm not suggesting anything like the multiple topic branches from
-tip (from a quick check, there's been a total of 25 tip/tip topic
branches merged just this merge window), but for clear new features
definitely.

And no cross-merges between those topic branches, because that defeats
the whole purpose.

Do you have to do it for the current situation where I just can't take
the mmap stuff? No. But please look at it going forward.

Linus

2024-03-16 20:00:56

by Borislav Petkov

[permalink] [raw]
Subject: Re: [GIT PULL] tracing: Updates for v6.9

On Sat, Mar 16, 2024 at 11:42:42AM -0700, Linus Torvalds wrote:
> Now, I'm not suggesting anything like the multiple topic branches from
> -tip (from a quick check, there's been a total of 25 tip/tip topic
> branches merged just this merge window), but for clear new features
> definitely.

So some of those branches are really tiny (1-2 patches) during some
cycles so I have often wondered whether I should merge those small
branches into a single pull...

So as not to have too many tiny pull requests.

Any preference?

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-03-16 20:42:39

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] tracing: Updates for v6.9

On Sat, 16 Mar 2024 at 13:00, Borislav Petkov <[email protected]> wrote:
>
> On Sat, Mar 16, 2024 at 11:42:42AM -0700, Linus Torvalds wrote:
> > Now, I'm not suggesting anything like the multiple topic branches from
> > -tip (from a quick check, there's been a total of 25 tip/tip topic
> > branches merged just this merge window), but for clear new features
> > definitely.
>
> So some of those branches are really tiny (1-2 patches) during some
> cycles so I have often wondered whether I should merge those small
> branches into a single pull...
>
> So as not to have too many tiny pull requests.
>
> Any preference?

Not really any strong preferences.

The really tiny ones are so easy to pull that pulling a few random
ones just isn't an issue.

I've been known to occasionally end up doing an octopus merge if I
decide that I might as well just merge multiple small branches in one
go, but honestly, I stopped doing that because it's just simpler to do
two really trivial merges than to even bother thinking about "should I
just merge these all together".

So I don't mind getting three or more random small pulls if they all
still make sense (ie they are clearly separate things).

Now, if you send me three separate pulls for basically the same
conceptual thing, that might annoy me just because it would be so
pointless.

But if it's a "one pull to fix a single-line issue in resource
control, and another pull to fix a single-line issue in objtool", then
those make perfect sense to keep separate, even if they are both
trivial and small.

And on the other hand, if you have a couple of trivial branches with
no real pattern, and decide to just merge them into one that fixes
"misc x86 problems", and the end result is still completely trivial
and there are no surprises or gotchas, that's not wrong either.

And sometimes, merging and sending me just one pull request is
absolutely the right thing.

For example, the ARM SoC trees tend to just merge "umbrella" updates
into one single pull request, and I prefer that - because I see no
point in getting ten different "this is the drivers for SoC xyz"
thing.

So then it's still a clear topic branch ("ARM SoC drivers"), but they
kept multiple branches for different SoC's and sent me just one pull
request.

End result: there's no one right thing. Make it make sense. Probably
the only real rule is

- try to keep conceptually different things separate just for cleanliness

- definitely keep fundamental new features or anything that _might_
be questionable in a branch of its own

but there aren't some kind of black-and-white rules for "this is so
small that it's not worth sending on its own".

This merge window, I think I currently have something like ~15 merges
that ended up being literally just a couple of lines (maybe spread
over two or three files). I don't mind at all. If that's all that
happened, that's fine.

Linus

2024-03-16 21:10:56

by Borislav Petkov

[permalink] [raw]
Subject: Re: [GIT PULL] tracing: Updates for v6.9

On Sat, Mar 16, 2024 at 01:42:06PM -0700, Linus Torvalds wrote:
> But if it's a "one pull to fix a single-line issue in resource
> control, and another pull to fix a single-line issue in objtool", then
> those make perfect sense to keep separate, even if they are both
> trivial and small.

Yeah, that's what they are - we do topic branches so it is always
separate topics - just sometimes they don't have too much going on
during a particular cycle.

Ok, understood - separate pulls, even if small ones, are fine.

> And on the other hand, if you have a couple of trivial branches with
> no real pattern, and decide to just merge them into one that fixes
> "misc x86 problems", and the end result is still completely trivial
> and there are no surprises or gotchas, that's not wrong either.

Ack.

> And sometimes, merging and sending me just one pull request is
> absolutely the right thing.
>
> For example, the ARM SoC trees tend to just merge "umbrella" updates
> into one single pull request, and I prefer that - because I see no
> point in getting ten different "this is the drivers for SoC xyz"
> thing.

Right.

I do a similar thing with the EDAC tree because it is all EDAC - no need
for separate pulls. We just keep them separate in case we have to rebase
early in the game to keep history clean. Yeah, well, rebase only if it
would get relatively ugly if not.

And then Tony or I merge them all into a single pull.

> So then it's still a clear topic branch ("ARM SoC drivers"), but they
> kept multiple branches for different SoC's and sent me just one pull
> request.
>
> End result: there's no one right thing. Make it make sense. Probably
> the only real rule is
>
> - try to keep conceptually different things separate just for cleanliness
>
> - definitely keep fundamental new features or anything that _might_
> be questionable in a branch of its own

Ok, understood. Makes sense.

> but there aren't some kind of black-and-white rules for "this is so
> small that it's not worth sending on its own".
>
> This merge window, I think I currently have something like ~15 merges
> that ended up being literally just a couple of lines (maybe spread
> over two or three files). I don't mind at all. If that's all that
> happened, that's fine.

.. and that is also some sort of documenting it: this area didn't see
that much development this cycle.

Thanks for explaining.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-03-17 12:14:57

by Steven Rostedt

[permalink] [raw]
Subject: Re: [GIT PULL] tracing: Updates for v6.9

On Sat, 16 Mar 2024 11:42:42 -0700
Linus Torvalds <[email protected]> wrote:

> On Sat, 16 Mar 2024 at 11:20, Steven Rostedt <[email protected]> wrote:
> >
> > 1) Rebase without them (I know how much you love rebasing)
>
> This.
>
> Except honestly, the pulls are getting to be so complicated for me
> because I have to check them, that I'd really like you to start doing
> topic branches for individual things.
>

We started doing that (with the trace probes and tools). But sure, I
can make a few more topic branches.

>
> Do you have to do it for the current situation where I just can't take
> the mmap stuff? No. But please look at it going forward.

Since I had to do a rebase, I'm rebasing it off of the last pull you
did from me for some fixes that didn't make it into 6.8. That way it
will include the fix in your tree that I need to add the string checks
for the TRACE_EVENT() __assign_str() and __string() checks.

I'm testing everything now.

-- Steve