LinuxLists.cc - [GIT PULL] tracing: Prevent trace_marker being bigger than unsigned short

2024-03-02 16:10:39

by Steven Rostedt

[permalink] [raw]

Subject: [GIT PULL] tracing: Prevent trace_marker being bigger than unsigned short

Linus,

Tracing fix for 6.8-rc6:

- The change to allow trace_marker writes to be as big as the trace_seq can
hold, and also the change that increases the size of the trace_seq to two
pages, caused PowerPC kselftest trace_marker test to fail. The trace_marker
kselftest writes up to subbuffer size which is determined by PAGE_SIZE.
On PowerPC, the PAGE_SIZE can be 64K, which means the selftest will write
a string that is around 64K in size. The output of the trace_marker is
performed with a vsnprintf("%.*s", size, string), but this write would make
the size greater than 32K, which is the max precision of "%.*s", and that
causes a kernel warning. The fix is simply to keep the write of trace_marker
less than or equal to max signed short.

Please pull the latest trace-v6.8-rc6 tree, which can be found at:

git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git
trace-v6.8-rc6

Tag SHA1: 81bbd14e994e785c073c62764e9479f26ef734a7
Head SHA1: 6e77fd570deda96cf3696a10c5bd7cc26cf0f687

Steven Rostedt (Google) (1):
tracing: Prevent trace_marker being bigger than unsigned short

----
kernel/trace/trace.c | 14 +++++++++++++-
1 file changed, 13 insertions(+), 1 deletion(-)
---------------------------
commit 6e77fd570deda96cf3696a10c5bd7cc26cf0f687
Author: Steven Rostedt (Google) <[email protected]>
Date: Tue Feb 27 12:57:06 2024 -0500

tracing: Prevent trace_marker being bigger than unsigned short

The trace_marker write goes into the ring buffer. A test was added to
write a string as big as the sub-buffer of the ring buffer to see if it
would work. A sub-buffer is typically PAGE_SIZE in length.

On PowerPC architecture, the ftrace selftest for trace_marker started to
fail. This was due to PowerPC having a PAGE_SIZE of 65536 and not 4096. It
would try to write a string that was around 63000 bytes in size. This gave
the following warning:

------------[ cut here ]------------
precision 63492 too large
WARNING: CPU: 15 PID: 2538829 at lib/vsprintf.c:2721 set_precision+0x68/0xa4
Modules linked in:
CPU: 15 PID: 2538829 Comm: awk Tainted: G M O K 6.8.0-rc5-gfca7526b7d89 #1
Hardware name: IBM,9080-HEX POWER10 (raw) 0x800200 0xf000006 of:IBM,FW1060.00 (NH1060_018) hv:phyp pSeries
NIP: c000000000f57c34 LR: c000000000f57c30 CTR: c000000000f5cdf0
REGS: c000000a58e4f5f0 TRAP: 0700 Tainted: G M O K (6.8.0-rc5-gfca7526b7d89)
MSR: 8000000002029033 <SF,VEC,EE,ME,IR,DR,RI,LE> CR: 48000824 XER: 00000005
CFAR: c00000000016154c IRQMASK: 0
GPR00: c000000000f57c30 c000000a58e4f890 c000000001482800 0000000000000019
GPR04: 0000000100011559 c000000a58e4f660 c000000a58e4f658 0000000000000027
GPR08: c000000e84e37c10 0000000000000001 0000000000000027 c000000002a47e50
GPR12: 0000000000000000 c000000e87bf7300 0000000000000000 0000000000000000
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR20: c0000004a43ec590 0000000000400cc0 0000000000000003 c0000000012c3e65
GPR24: c000000a58e4fa18 0000000000000025 0000000000000020 000000000001ff97
GPR28: c0000001168a00dd c0000001168c0074 c000000a58e4f920 000000000000f804
NIP [c000000000f57c34] set_precision+0x68/0xa4
LR [c000000000f57c30] set_precision+0x64/0xa4
Call Trace:
[c000000a58e4f890] [c000000000f57c30] set_precision+0x64/0xa4 (unreliable)
[c000000a58e4f900] [c000000000f5ccc4] vsnprintf+0x198/0x4c8
[c000000a58e4f980] [c000000000f53228] seq_buf_vprintf+0x50/0xa0
[c000000a58e4f9b0] [c00000000031cec0] trace_seq_printf+0x60/0xe0
[c000000a58e4f9e0] [c00000000031b5f0] trace_print_print+0x78/0xa4
[c000000a58e4fa60] [c0000000003133a4] print_trace_line+0x2ac/0x6d8
[c000000a58e4fb20] [c0000000003145c0] s_show+0x58/0x2c0
[c000000a58e4fba0] [c0000000005dfb2c] seq_read_iter+0x448/0x618
[c000000a58e4fc70] [c0000000005dfe08] seq_read+0x10c/0x174
[c000000a58e4fd10] [c00000000059a7e0] vfs_read+0xe0/0x39c
[c000000a58e4fdc0] [c00000000059b59c] ksys_read+0x7c/0x140
[c000000a58e4fe10] [c000000000035d74] system_call_exception+0x134/0x330
[c000000a58e4fe50] [c00000000000d6a0] system_call_common+0x160/0x2e4

The problem was that in trace_print_print() that reads the trace_marker
write data had the following code:

int max = iter->ent_size - offsetof(struct print_entry, buf);

[..]
trace_seq_printf(s, ": %.*s", max, field->buf);

Where "max" was the size of the entry. Now that the write to trace_marker
can be as big as what the sub-buffer can hold, and the sub-buffer for
powerpc is 64K in size, the "max" value was: 63492, and that was passed to
trace_seq_printf() which eventually calls vsnprintf() with the same format
and parameters.

The max "precision" that "%.*s" can be is max signed short (32767) where
63492 happens to be greater than.

Prevent the max size written by trace_marker to be greater than what a
signed short can hold.

Link: https://lore.kernel.org/all/[email protected]/
Link: https://lore.kernel.org/linux-trace-kernel/[email protected]

Cc: Mathieu Desnoyers <[email protected]>
Reported-by: Sachin Sant <[email protected]>
Tested-by: Sachin Sant <[email protected]>
Fixes: 8ec90be7f15f ("tracing: Allow for max buffer data size trace_marker writes")
Acked-by: Masami Hiramatsu (Google) <[email protected]>
Signed-off-by: Steven Rostedt (Google) <[email protected]>

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 8198bfc54b58..1606fa99367b 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -7310,7 +7310,9 @@ tracing_mark_write(struct file *filp, const char __user *ubuf,
/* Used in tracing_mark_raw_write() as well */
#define FAULTED_STR "<faulted>"
#define FAULTED_SIZE (sizeof(FAULTED_STR) - 1) /* '\0' is already accounted for */
-
+#ifndef SHORT_MAX
+#define SHORT_MAX ((1<<15) - 1)
+#endif
if (tracing_disabled)
return -EINVAL;

@@ -7328,6 +7330,16 @@ tracing_mark_write(struct file *filp, const char __user *ubuf,
if (cnt < FAULTED_SIZE)
size += FAULTED_SIZE - cnt;

+ /*
+ * trace_print_print() uses vsprintf() to determine the size via
+ * the precision format "%.*s" which can not be greater than
+ * a signed short.
+ */
+ if (size > SHORT_MAX) {
+ cnt -= size - SHORT_MAX;
+ goto again;
+ }
+
if (size > TRACE_SEQ_BUFFER_SIZE) {
cnt -= size - TRACE_SEQ_BUFFER_SIZE;
goto again;

2024-03-02 17:25:06

by Linus Torvalds

[permalink] [raw]

Subject: Re: [GIT PULL] tracing: Prevent trace_marker being bigger than unsigned short

On Sat, 2 Mar 2024 at 08:10, Steven Rostedt <[email protected]> wrote:
>
> - The change to allow trace_marker writes to be as big as the trace_seq can
> hold, and also the change that increases the size of the trace_seq to two
> pages, caused PowerPC kselftest trace_marker test to fail. The trace_marker
> kselftest writes up to subbuffer size which is determined by PAGE_SIZE.
> On PowerPC, the PAGE_SIZE can be 64K, which means the selftest will write
> a string that is around 64K in size. The output of the trace_marker is
> performed with a vsnprintf("%.*s", size, string), but this write would make
> the size greater than 32K, which is the max precision of "%.*s", and that
> causes a kernel warning. The fix is simply to keep the write of trace_marker
> less than or equal to max signed short.

Please don't just add random limits that are based on other random limits.

That printk warning is for "you did something obviously crazy".

That does NOT MEAN that you now should limit your strings to something
JUST BORDERLINE CRAZY.

See?

There is not a way in hell that printing a 32kB string in tracing is
valid. EVER.

So stop it. Stop making limits be some random implementation detail.
Make limits *sane*.

Make a *sane* limit for tracing. Not a "avoid being called crazy" limit.

Honestly, I suspect that a sane limit for tracing strings is likely on
the order of tens or maybe hundreds of bytes. Not some kind of "fits
in a short" that is just printk saying "I refuse to waste memory on
the stack".

Side note: for similar reasons the field-width is a 24-bit integer.
And no, if you think that passing a 16MB field width is sane, you need
to rethink your life. Again, that's a small implementation detail, not
a "let's explore how stupid we can be".

Linus

2024-03-02 20:00:08

by Steven Rostedt

[permalink] [raw]

Subject: Re: [GIT PULL] tracing: Prevent trace_marker being bigger than unsigned short

On Sat, 2 Mar 2024 09:24:37 -0800
Linus Torvalds <[email protected]> wrote:

> On Sat, 2 Mar 2024 at 08:10, Steven Rostedt <[email protected]> wrote:
> >
> > - The change to allow trace_marker writes to be as big as the trace_seq can
> > hold, and also the change that increases the size of the trace_seq to two
> > pages, caused PowerPC kselftest trace_marker test to fail. The trace_marker
> > kselftest writes up to subbuffer size which is determined by PAGE_SIZE.
> > On PowerPC, the PAGE_SIZE can be 64K, which means the selftest will write
> > a string that is around 64K in size. The output of the trace_marker is
> > performed with a vsnprintf("%.*s", size, string), but this write would make
> > the size greater than 32K, which is the max precision of "%.*s", and that
> > causes a kernel warning. The fix is simply to keep the write of trace_marker
> > less than or equal to max signed short.
>
> Please don't just add random limits that are based on other random limits.

It's not random limits, it's resource limits.

>
> That printk warning is for "you did something obviously crazy".
>
> That does NOT MEAN that you now should limit your strings to something
> JUST BORDERLINE CRAZY.

I don't have control over the strings. Anyone can do in user space:

fd = open("/sys/kernel/tracing/trace_marker", O_WRONLY);
r = write(fd, huge_string, 10000000);

And this code only gives you what is returned in 'r'. It doesn't error
out. It just limits what the max write size is. I just default it to
what the resources available are.

>
> See?
>
> There is not a way in hell that printing a 32kB string in tracing is
> valid. EVER.

Well, the limit is really PAGE_SIZE, which on most architectures is
4096, but on PowerPC, PAGE_SIZE is 64K. And the test in
tools/testing/selftests/ftrace/test.d/00basic/trace_marker.tc finds the
PAGE_SIZE and writes a string as long as it to see if it doesn't crash
the kernel. And all the resources can hold a 60K write. The problem
that this patch addresses is that the vsnprintf() used to move the data
into seq_file has a precision variable that checks for overflow, and it
has a max of 32K.

Yes, in most cases, 4K is the limit, which is why this doesn't trigger
on any architecture that has 4K page sizes.

>
> So stop it. Stop making limits be some random implementation detail.
> Make limits *sane*.

The "implementation detail" is PAGE_SIZE. Similar to writing large
amounts of data to pipes and sockets. It may not write all data, but
just a smaller amount. The write doesn't error, it just says "this is
all I can write that you passed to me".

>
> Make a *sane* limit for tracing. Not a "avoid being called crazy" limit.

What arbitrary limit do I do? It's just changes how the string will be
broken up, as "echo" or "cat" into trace_marker will continue writing the rest
of the string. It doesn't cause errors in the write. It simply breaks
the string up into smaller blocks.

>
> Honestly, I suspect that a sane limit for tracing strings is likely on
> the order of tens or maybe hundreds of bytes. Not some kind of "fits
> in a short" that is just printk saying "I refuse to waste memory on
> the stack".

The error isn't printk, it's vsnprintf() that is writing to a seq_file
to user space. There's no stack or printk involved here.

trace_seq_printf(s, ": %.*s", max, field->buf);

Where 's' is a trace_seq with a PAGE_SIZE buffer, that later gets passed
to seq_file.

>
> Side note: for similar reasons the field-width is a 24-bit integer.
> And no, if you think that passing a 16MB field width is sane, you need
> to rethink your life. Again, that's a small implementation detail, not
> a "let's explore how stupid we can be".

I really don't understand what you mean by the above. This code is what
user space can write into the tracing ring buffer. If the ring buffer
sub-buffer is 64K, and user space does a 32K write into it, why prevent
that? The only limit here, is that the vsnprintf() has a max of signed
short for the precision it uses, which I used to prevent overflow.
That vsnprintf which writes into a memory buffer that will be sent out
via seq_file doesn't like huge strings greater than 32K, even though
what it is writing into is big enough to hold it.

-- Steve

2024-03-02 20:25:34

by Linus Torvalds

[permalink] [raw]

Subject: Re: [GIT PULL] tracing: Prevent trace_marker being bigger than unsigned short

On Sat, 2 Mar 2024 at 12:00, Steven Rostedt <[email protected]> wrote:
>
> I don't have control over the strings. Anyone can do in user space:
>
> fd = open("/sys/kernel/tracing/trace_marker", O_WRONLY);
> r = write(fd, huge_string, 10000000);

So?

Stop the stupidity.

You already limit the string.

Just limit it to a sane value. if somebody uses a 10kB trace marker,
return an error, or just truncate it to 100 bytes.

You already were willing to truncate it to 32kB. Use your brain, and
realize that 32kB is a ridiculous limit.

Why do I even need to tell you this? I'm getting really tired of
having these idiotic arguments with you.

Linus

2024-03-02 20:26:57

by Steven Rostedt

[permalink] [raw]

Subject: Re: [GIT PULL] tracing: Prevent trace_marker being bigger than unsigned short

On Sat, 2 Mar 2024 09:24:37 -0800
Linus Torvalds <[email protected]> wrote:

> Honestly, I suspect that a sane limit for tracing strings is likely on
> the order of tens or maybe hundreds of bytes.

BTW, the max used to be 1K, and I've had people ask me to increase it,
as they build up a log in their own buffer, and then do a "dump" into
the trace_marker when an anomaly happens in their application, when
reading it out, they did not like how it was broke up.

-- Steve

2024-03-02 20:33:28

by Linus Torvalds

[permalink] [raw]

Subject: Re: [GIT PULL] tracing: Prevent trace_marker being bigger than unsigned short

On Sat, 2 Mar 2024 at 12:00, Steven Rostedt <[email protected]> wrote:
>
> The error isn't printk, it's vsnprintf() that is writing to a seq_file
> to user space. There's no stack or printk involved here.

Look again. The code uses 'struct printf_spec' and we literally have a

static_assert(sizeof(struct printf_spec) == 8);

because we want the compiler to generate sane calling conventions and
not waste space and code with arguments on the stack. That's literally
why we do all those limits in a bitfield - because the code in
question is written to say "unreasonable people can go screw
themselves".

I'm not interested in arguing this. We're not doing some completely
idiotic "let's edge up to the physical limit of what our printk code
is willing to do".

I'm perfectly happy having that WARN_ON() to continue to tell people
they are doing stupid things that won't work.

And if you ever decide that a sane limit is ok, you can send that in.

Linus

2024-03-02 20:47:24

by Steven Rostedt

[permalink] [raw]

Subject: Re: [GIT PULL] tracing: Prevent trace_marker being bigger than unsigned short

On Sat, 2 Mar 2024 12:33:01 -0800
Linus Torvalds <[email protected]> wrote:

> And if you ever decide that a sane limit is ok, you can send that in.

I'm fine with just making it 4K with a comment saying that "4K is the
minimum page size on most archs, and to keep this consistent for crazy
architectures like PowerPC and it's 64K pages, we hard code 4K to keep
all architectures acting the same".

-- Steve

2024-03-02 20:55:39

by Linus Torvalds

[permalink] [raw]

Subject: Re: [GIT PULL] tracing: Prevent trace_marker being bigger than unsigned short

On Sat, 2 Mar 2024 at 12:47, Steven Rostedt <[email protected]> wrote:
>
> I'm fine with just making it 4K with a comment saying that "4K is the
> minimum page size on most archs, and to keep this consistent for crazy
> architectures like PowerPC and it's 64K pages, we hard code 4K to keep
> all architectures acting the same".

4k is at least a somewhat sane limit, and yes, being hw-independent is
a good idea.

We have other strings that have that limit for similar reasons (ie PATH_MAX).

Linus

2024-03-03 12:59:48

by Steven Rostedt

[permalink] [raw]

Subject: Re: [GIT PULL] tracing: Prevent trace_marker being bigger than unsigned short

On Sat, 2 Mar 2024 12:55:10 -0800
Linus Torvalds <[email protected]> wrote:

> On Sat, 2 Mar 2024 at 12:47, Steven Rostedt <[email protected]> wrote:
> >
> > I'm fine with just making it 4K with a comment saying that "4K is the
> > minimum page size on most archs, and to keep this consistent for crazy
> > architectures like PowerPC and it's 64K pages, we hard code 4K to keep
> > all architectures acting the same".
>
> 4k is at least a somewhat sane limit, and yes, being hw-independent is
> a good idea.
>
> We have other strings that have that limit for similar reasons (ie PATH_MAX).
>

I believe I was trying to solve this wrong. The vsprintf() is called on
reading of the ring buffer, and I was trying to fix it on the write
side. I believe that the fix should be on the read side where the
vsprintf() is called.

Sachin, can you try this patch.

diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index 3e7fa44dc2b2..5c7962d612de 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -1588,11 +1588,20 @@ static enum print_line_t trace_print_print(struct trace_iterator *iter,
struct print_entry *field;
struct trace_seq *s = &iter->seq;
int max = iter->ent_size - offsetof(struct print_entry, buf);
+ const char *p;

trace_assign_type(field, iter->ent);

seq_print_ip_sym(s, field->ip, flags);
- trace_seq_printf(s, ": %.*s", max, field->buf);
+ trace_seq_puts(s, ": ");
+ /* Write 1K chunks at a time */
+ p = field->buf;
+ do {
+ int pre = max > 1024 ? 1024 : max;
+ trace_seq_printf(s, "%.*s", pre, p);
+ max -= pre;
+ p += pre;
+ } while (max > 0);

return trace_handle_return(s);
}
@@ -1602,10 +1611,19 @@ static enum print_line_t trace_print_raw(struct trace_iterator *iter, int flags,
{
struct print_entry *field;
int max = iter->ent_size - offsetof(struct print_entry, buf);
+ const char *p;

trace_assign_type(field, iter->ent);

- trace_seq_printf(&iter->seq, "# %lx %.*s", field->ip, max, field->buf);
+ trace_seq_printf(&iter->seq, "# %lx", field->ip);
+ /* Write 1K chunks at a time */
+ p = field->buf;
+ do {
+ int pre = max > 1024 ? 1024 : max;
+ trace_seq_printf(&iter->seq, "%.*s", pre, p);
+ max -= pre;
+ p += pre;
+ } while (max > 0);

return trace_handle_return(&iter->seq);
}

2024-03-03 13:02:25

by Steven Rostedt

[permalink] [raw]

Subject: Re: [GIT PULL] tracing: Prevent trace_marker being bigger than unsigned short

On Sun, 3 Mar 2024 07:59:37 -0500
Steven Rostedt <[email protected]> wrote:
> >
>
> I believe I was trying to solve this wrong. The vsprintf() is called on
> reading of the ring buffer, and I was trying to fix it on the write
> side. I believe that the fix should be on the read side where the
> vsprintf() is called.

I prefer this patch, as I found other ways to add large strings into the
ring buffer that can trigger this issue again. This fixes it at the
location of the bug.

-- Steve

>
> Sachin, can you try this patch.
>
> diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
> index 3e7fa44dc2b2..5c7962d612de 100644
> --- a/kernel/trace/trace_output.c
> +++ b/kernel/trace/trace_output.c
> @@ -1588,11 +1588,20 @@ static enum print_line_t trace_print_print(struct trace_iterator *iter,
> struct print_entry *field;
> struct trace_seq *s = &iter->seq;
> int max = iter->ent_size - offsetof(struct print_entry, buf);
> + const char *p;
>
> trace_assign_type(field, iter->ent);
>
> seq_print_ip_sym(s, field->ip, flags);
> - trace_seq_printf(s, ": %.*s", max, field->buf);
> + trace_seq_puts(s, ": ");
> + /* Write 1K chunks at a time */
> + p = field->buf;
> + do {
> + int pre = max > 1024 ? 1024 : max;
> + trace_seq_printf(s, "%.*s", pre, p);
> + max -= pre;
> + p += pre;
> + } while (max > 0);
>
> return trace_handle_return(s);
> }
> @@ -1602,10 +1611,19 @@ static enum print_line_t trace_print_raw(struct trace_iterator *iter, int flags,
> {
> struct print_entry *field;
> int max = iter->ent_size - offsetof(struct print_entry, buf);
> + const char *p;
>
> trace_assign_type(field, iter->ent);
>
> - trace_seq_printf(&iter->seq, "# %lx %.*s", field->ip, max, field->buf);
> + trace_seq_printf(&iter->seq, "# %lx", field->ip);
> + /* Write 1K chunks at a time */
> + p = field->buf;
> + do {
> + int pre = max > 1024 ? 1024 : max;
> + trace_seq_printf(&iter->seq, "%.*s", pre, p);
> + max -= pre;
> + p += pre;
> + } while (max > 0);
>
> return trace_handle_return(&iter->seq);
> }

2024-03-03 17:38:33

by Linus Torvalds

[permalink] [raw]

Subject: Re: [GIT PULL] tracing: Prevent trace_marker being bigger than unsigned short

On Sun, 3 Mar 2024 at 04:59, Steven Rostedt <[email protected]> wrote:
>
> - trace_seq_printf(s, ": %.*s", max, field->buf);
> + trace_seq_puts(s, ": ");
> + /* Write 1K chunks at a time */
> + p = field->buf;
> + do {
> + int pre = max > 1024 ? 1024 : max;
> + trace_seq_printf(s, "%.*s", pre, p);
> + max -= pre;
> + p += pre;
> + } while (max > 0);

The above loop is complete garbage.

If 'p' is a string, you're randomly just walking past the end of the
string with 'p += pre'

And if 'o' isn't a string but has a fixed size, you shouldn't use '%s'
in the first place, you should just use seq_write().

Just stop. You are doing things entirely wrong, and you're just adding
random code.

I'm not taking *any* fixes from you as things are now, you're once
again only making things worse.

What was wrong with saying "don't do that"? You seem to be bending
over backwards to doing stupid things, and then insisting on doing
them entirely wrong.

Linus

2024-03-03 19:07:20

by Steven Rostedt

[permalink] [raw]

Subject: Re: [GIT PULL] tracing: Prevent trace_marker being bigger than unsigned short

On Sun, 3 Mar 2024 09:38:04 -0800
Linus Torvalds <[email protected]> wrote:

> On Sun, 3 Mar 2024 at 04:59, Steven Rostedt <[email protected]> wrote:
> >
> > - trace_seq_printf(s, ": %.*s", max, field->buf);
> > + trace_seq_puts(s, ": ");
> > + /* Write 1K chunks at a time */
> > + p = field->buf;
> > + do {
> > + int pre = max > 1024 ? 1024 : max;
> > + trace_seq_printf(s, "%.*s", pre, p);
> > + max -= pre;
> > + p += pre;
> > + } while (max > 0);
>
> The above loop is complete garbage.
>
> If 'p' is a string, you're randomly just walking past the end of the
> string with 'p += pre'

The string in question isn't some random string. It's a print event on
the ring buffer where the size is strlen(p) rounded up to word size.
That means, max will be no bigger than word-size - 1 greater than
strlen(p). That means the chunks of 1024 will never land in the middle
of garbage.

And even if for some reason it did, it would simply print garbage in
the output that is already available to user space via reading the raw
ring buffer.

>
> And if 'o' isn't a string but has a fixed size, you shouldn't use '%s'
> in the first place, you should just use seq_write().

The precision actually isn't needed. I added it just in case for some
reason a bug happened and the \0 was truncated.

>
> Just stop. You are doing things entirely wrong, and you're just adding
> random code.
>
> I'm not taking *any* fixes from you as things are now, you're once
> again only making things worse.
>
> What was wrong with saying "don't do that"? You seem to be bending
> over backwards to doing stupid things, and then insisting on doing
> them entirely wrong.

Don't do what?

The previous change was just adding some random limit to a write into
the ring buffer, to prevent one of the readers of the ring buffer from
overflowing the precision check.

Hell, I'm sorry I added that precision check. I guess this could all be
solved with:

diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index 3e7fa44dc2b2..d8b302d01083 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -1587,12 +1587,11 @@ static enum print_line_t trace_print_print(struct trace_iterator *iter,
{
struct print_entry *field;
struct trace_seq *s = &iter->seq;
- int max = iter->ent_size - offsetof(struct print_entry, buf);

trace_assign_type(field, iter->ent);

seq_print_ip_sym(s, field->ip, flags);
- trace_seq_printf(s, ": %.*s", max, field->buf);
+ trace_seq_printf(s, ": %s", field->buf);

return trace_handle_return(s);
}
@@ -1601,11 +1600,10 @@ static enum print_line_t trace_print_raw(struct trace_iterator *iter, int flags,
struct trace_event *event)
{
struct print_entry *field;
- int max = iter->ent_size - offsetof(struct print_entry, buf);

trace_assign_type(field, iter->ent);

- trace_seq_printf(&iter->seq, "# %lx %.*s", field->ip, max, field->buf);
+ trace_seq_printf(&iter->seq, "# %lx %s", field->ip, field->buf);

return trace_handle_return(&iter->seq);
}

Because the string should always end with a '\0' in the first
place.

-- Steve

2024-03-03 20:10:08

by Linus Torvalds

[permalink] [raw]

Subject: Re: [GIT PULL] tracing: Prevent trace_marker being bigger than unsigned short

On Sun, 3 Mar 2024 at 11:07, Steven Rostedt <[email protected]> wrote:
>
> The string in question isn't some random string. It's a print event on
> the ring buffer where the size is strlen(p) rounded up to word size.
> That means, max will be no bigger than word-size - 1 greater than
> strlen(p). That means the chunks of 1024 will never land in the middle
> of garbage.

What a piece of unbelievable crap.

So you didn't actually want the precision in the first place, then you
started limiting it to an insane value because the printk code
complains about insane precision, and then you want to "fix" that by
printing it out in chunks where you know the chunk size won't hit
garbage, but that ends up being a random implementation detail that
you don't even document in that chunking code.

> > What was wrong with saying "don't do that"? You seem to be bending
> > over backwards to doing stupid things, and then insisting on doing
> > them entirely wrong.
>
> Don't do what?

You have this pattern of not actually thinking through the code AT
ALL, andc just fixing symptoms, and then making the code worse.

The whole "let's avoid the symptom of the kernel printk code telling
us that 32kB string precision is crazy by putting a 32kB-1 limit on
it" was clearly just papering over a symptom, not fixing the problem.

Doing insane chunking in 1kB pieces was another "let's paper over the
symptom, not fix the problem".

And now you finally admit that the actual problem was that YOUR
PRECISION WAS STUPID TO BEGIN WITH.

Do you really not see what the truly _fundamental_ problem here is?

Kernel code doesn't "paper over" stuff. We do things *right*. No more
of this crap.

You really need to take a deep look at what you are doing. I spend
more time on your pull requests than I want to, exactly because you
have had this pattern of doing something wrong in the first place, and
then adding MORE CODE to paper over all the problems that initial
wrong decision causes.

This was *exactly* the same same thing that happened on the tracefs
side. You did things wrong, and then you spent a lot of effort in
trying to patch up the resulting problems, instead of going back and
doing it *right*.

And honestly, I still think that the fundamental mistake you have done
is to let people say "I want to have these big strings" and you just
roll over and say "sure, we'll create shit kernel code for you".

WTF do you think it's fine to say "let people do insane things"
instead of just telling people that no, we have sane and small limits.

As a maintainer, one of your jobs is to say "No, we're not doing crazy
stuff". I still think that having so big strings that this came up in
the first place is a sign of the deeper problem, and then the fact
that you had an insane and pointless precision field was just a small
implementation issue.

Doing tracing in the kernel is not some kind of general-purpose thing.
It's ok - in fact, it's a really damn good idea - to just tell people
"yes, you can add strings, but dammit, there needs to be sanity to
it".

So I now tell you that you should

(a) get rid of the stupid and nonsensical precision

(b) tell people that their string are limited (and that 4kB is an
_upper_ value to sane string lengths in the kernel)

(c) really fundamentally stop with the "paper over" things approach
to kernel programming

Large strings are not a "feature". They are a bug.

It's also sad that apparently your strings are counted, but you don't
count them very well, so instead of just using the count (which is
*much* cheaper) you end up using '%s' and do things until you hit a
NUL byte. Guess what? All our printk infrastructure is designed for
small strings, so '%s' isn't exactly optimized, because we expected
sanity. It ends up in a loop that literally does things one byte at a
time.

And no. The solution is *not* to paper that over by making '%s'
printing more efficient. It's not supposed to need that kind of
efficiency.

Christ. *IF* large strings were a good idea, and you actually almost
have the length encoded, this whole tracing code could have used the
fact that you have that approximate count to do something like

len = fieldsize-1;
len &= ~(sizeof(unsigned long)-1);
len += strlen(fieldbase + len);
seq_write(buf, fieldbase, len);

and at least now you'd have something *efficient*. Which is at least a
source of pride in itself.

But since I really think the core of the problem is "we shouldn't have
allowed this kind of crap", I think efficiency - while a source of
pride - is polishing a turd and more of the "paper over the
fundamental issue".

And no, the above code sequence isn't wonderfully pretty either. Using
'strlen()' to find the last NUL character in a word is disgusting, and
our kernel 'strlen()' isn't some optimized thing either.

We do have fancier cases for fancier code (the word-at-a-time stuff),
but at least the above only walks things one byte at a time for a tiny
sequence.

Linus

2024-03-03 21:00:37

by Steven Rostedt

[permalink] [raw]

Subject: Re: [GIT PULL] tracing: Prevent trace_marker being bigger than unsigned short

On Sun, 3 Mar 2024 12:09:37 -0800
Linus Torvalds <[email protected]> wrote:

>
> Doing insane chunking in 1kB pieces was another "let's paper over the
> symptom, not fix the problem".
>
> And now you finally admit that the actual problem was that YOUR
> PRECISION WAS STUPID TO BEGIN WITH.

A little more background. I wanted my tracing buffer to take bigger
events like perf does. Specifically to get stack traces from user
space. perf allows for greater than PAGE_SIZE events, so I had to do
the same to get this to work.

The code was changed a bit which also allowed trace_marker to use the
full buffer as well. I even used trace_marker to test the "large
events" code.

The problem that happened here was PowerPC's 64k PAGE_SIZE which also
allowed the trace_marker to take a string that's 64K. I never did test
that as I don't have any machines with 64k PAGE_SIZE. If I didn't add a
trace_marker test to test a large string as well, I would never have
seen this.

>
> Do you really not see what the truly _fundamental_ problem here is?
>
> Kernel code doesn't "paper over" stuff. We do things *right*. No more
> of this crap.
>
> You really need to take a deep look at what you are doing. I spend
> more time on your pull requests than I want to, exactly because you
> have had this pattern of doing something wrong in the first place, and
> then adding MORE CODE to paper over all the problems that initial
> wrong decision causes.
>
> This was *exactly* the same same thing that happened on the tracefs
> side. You did things wrong, and then you spent a lot of effort in
> trying to patch up the resulting problems, instead of going back and
> doing it *right*.

This is different than tracefs. I actually do understand the code here,
where as in tracefs there was a lot I didn't understand.

The only reason I added a precision is that during the code change I
truncated the string on the write due to a bug that didn't write the
full event, and the print caused a crash due to the missing '\0'. I
figured by adding a precision it would "harden" the code as the missing
'\0' was a bug and should never happen by design.

>
> And honestly, I still think that the fundamental mistake you have done
> is to let people say "I want to have these big strings" and you just
> roll over and say "sure, we'll create shit kernel code for you".

Writing a bit over 4K isn't that crazy. Writing over 32K is, but that
is only possible on PowerPC due to allocating PAGE_SIZE buffers that
are 64K in size.

>
> WTF do you think it's fine to say "let people do insane things"
> instead of just telling people that no, we have sane and small limits.

System calls are expensive, writing more in one call is better than
writing lots of little calls. Again, 32K is stupid, but it was based on
the kernel allocated buffer size.

>
> As a maintainer, one of your jobs is to say "No, we're not doing crazy
> stuff". I still think that having so big strings that this came up in
> the first place is a sign of the deeper problem, and then the fact
> that you had an insane and pointless precision field was just a small
> implementation issue.

BTW, I never did add any code change to allow for the large strings
because someone asked me to. That was a side effect. Yes, people asked
for it because it caused "funny output", and I told them sorry. And
that's it.

But to add code for the stack traces I changed the buffers to allow
larger than 4K. The increase string size write was a side effect of that
work. And it even helped to make an easy way to test large events as I
used the trace_marker in my tests. I added a test to write into the
trace_marker to test these large events as well.

It wasn't until Sachin reported that the test caused the precision
overflow on PowerPC. I didn't even know there was a way to write more
than 32K.

>
> Doing tracing in the kernel is not some kind of general-purpose thing.
> It's ok - in fact, it's a really damn good idea - to just tell people
> "yes, you can add strings, but dammit, there needs to be sanity to
> it".
>
> So I now tell you that you should
>
> (a) get rid of the stupid and nonsensical precision

I can do that.

>
> (b) tell people that their string are limited (and that 4kB is an
> _upper_ value to sane string lengths in the kernel)

Well, I actually did tell people that. I just changed the code for
other reasons that gave this "feature" as a side effect, and only on
PowerPC does it get insanely large.

>
> (c) really fundamentally stop with the "paper over" things approach
> to kernel programming
>
> Large strings are not a "feature". They are a bug.
>
> It's also sad that apparently your strings are counted, but you don't
> count them very well, so instead of just using the count (which is
> *much* cheaper) you end up using '%s' and do things until you hit a
> NUL byte. Guess what? All our printk infrastructure is designed for
> small strings, so '%s' isn't exactly optimized, because we expected
> sanity. It ends up in a loop that literally does things one byte at a
> time.

Note, this is the slow path, as it only happens on reading the trace or
trace_pipe files. And again, these strings in most use cases are small.
If someone cares about performance of the read, then they should read
trace_pipe_raw which just reads the raw data without any string
processing.

-- Steve

2024-03-04 21:40:23

by Steven Rostedt

[permalink] [raw]

Subject: Re: [GIT PULL] tracing: Prevent trace_marker being bigger than unsigned short

On Sun, 3 Mar 2024 16:00:24 -0500
Steven Rostedt <[email protected]> wrote:

> > So I now tell you that you should
> >
> > (a) get rid of the stupid and nonsensical precision
>
> I can do that.

As I mentioned that the design is based on that the allocated buffer size is
the string length rounded up to the word size, all I need to do is to make
sure that there's a nul terminating byte within the last word of the
allocated buffer. Then "%s" is all I need.

Although, when writing this I found that it isn't rounded from the size of
the string itself, but because I allocate a bit more than what is written
to trace_marker, in case I need to append a '\n' and '\0' just word size
checking isn't enough. Doing two words is more than enough to find the
terminating nul unless there's a bug, in which case this would trigger a
warning.

Would this work for you?

I tested this on both 32 bit and 64 bit machines, with the following command:

# cd /sys/kernel/tracing
# for s in 80 480 1000 1450 2000 3000 4050 5500; do
let c=$s+64;
for i in `seq $s $c` ; do
str=`printf -- 'X%.0s' $(seq $i)`;
echo "write $i";
echo "$str" > trace_marker;
done;
done
# cat trace

-- Steve

diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index 3e7fa44dc2b2..848a78bab20e 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -1581,6 +1581,25 @@ static struct trace_event trace_bprint_event = {
.funcs = &trace_bprint_funcs,
};

+/*
+ * Strings in the print entry are stored by their length rounded
+ * up to the nearest word size. The write to the buffer also allocates
+ * a couple of extra bytes in case it needs to append a '\n' and '\0'
+ * if the passed in string doesn't contain them. Check up to two words
+ * in length back to make sure there's a terminating nul.
+ */
+static int test_trace_string(int len, const char *str)
+{
+ int test = sizeof(long) * 2;
+
+ for (int i = 0; len > i && i < test; i++) {
+ if (!str[len - (i + 1)])
+ return 0;
+ }
+ WARN_ONCE(1, "Trace print string missing nul terminator %d", len);
+ return -1;
+}
+
/* TRACE_PRINT */
static enum print_line_t trace_print_print(struct trace_iterator *iter,
int flags, struct trace_event *event)
@@ -1591,8 +1610,11 @@ static enum print_line_t trace_print_print(struct trace_iterator *iter,

trace_assign_type(field, iter->ent);

+ if (test_trace_string(max, field->buf))
+ return TRACE_TYPE_UNHANDLED;
+
seq_print_ip_sym(s, field->ip, flags);
- trace_seq_printf(s, ": %.*s", max, field->buf);
+ trace_seq_printf(s, ": %s", field->buf);

return trace_handle_return(s);
}
@@ -1605,7 +1627,10 @@ static enum print_line_t trace_print_raw(struct trace_iterator *iter, int flags,

trace_assign_type(field, iter->ent);

- trace_seq_printf(&iter->seq, "# %lx %.*s", field->ip, max, field->buf);
+ if (test_trace_string(max, field->buf))
+ return TRACE_TYPE_UNHANDLED;
+
+ trace_seq_printf(&iter->seq, "# %lx %s", field->ip, field->buf);

return trace_handle_return(&iter->seq);
}

2024-03-04 21:50:52

by Linus Torvalds

[permalink] [raw]

Subject: Re: [GIT PULL] tracing: Prevent trace_marker being bigger than unsigned short

On Mon, 4 Mar 2024 at 13:40, Steven Rostedt <[email protected]> wrote:
>
> As I mentioned that the design is based on that the allocated buffer size is
> the string length rounded up to the word size, all I need to do is to make
> sure that there's a nul terminating byte within the last word of the
> allocated buffer. Then "%s" is all I need.

Please don't add pointless code that helps nothing.

> Would this work for you?

No. This code only adds debug code, and doesn't actually improve anything.

We *have* debug code already. Things like KASAN already find array
overruns, and your ex-tempore debug code adds zero actual value.

That, btw, is why your old stupid precision code was not only
triggering warnings, but was ACTIVELY DETRIMENTAL.

All that precision code could ever do was to potentially hide bugs if
the string wasn't NUL-terminated.

So no. I absolutely do NOT want you to write more code to hide bugs or
do half-arsed checking.

I want you to *simplify* the code, and put proper limits in place for strings.

I want to see the code that actually notices when somebody generates a
crazy string, and stops that garbage in its tracks.

What I do *not* want to see is more ad-hoc code that tries to deal
with the symptoms of you not having done so.

Linus

2024-03-04 22:12:12

by Steven Rostedt

[permalink] [raw]

Subject: Re: [GIT PULL] tracing: Prevent trace_marker being bigger than unsigned short

On Mon, 4 Mar 2024 13:50:13 -0800
Linus Torvalds <[email protected]> wrote:

> On Mon, 4 Mar 2024 at 13:40, Steven Rostedt <[email protected]> wrote:
> >
> > As I mentioned that the design is based on that the allocated buffer size is
> > the string length rounded up to the word size, all I need to do is to make
> > sure that there's a nul terminating byte within the last word of the
> > allocated buffer. Then "%s" is all I need.
>
> Please don't add pointless code that helps nothing.
>
> > Would this work for you?
>
> No. This code only adds debug code, and doesn't actually improve anything.
>
> We *have* debug code already. Things like KASAN already find array
> overruns, and your ex-tempore debug code adds zero actual value.

Sorry I thought debug code was OK. But I guess I was mistaken. KASAN isn't
run in the field, where this would trigger. But I get your point. If it's
passing my tests (which I do run with KASAN), I guess that's good enough
for you.

>
> That, btw, is why your old stupid precision code was not only
> triggering warnings, but was ACTIVELY DETRIMENTAL.
>
> All that precision code could ever do was to potentially hide bugs if
> the string wasn't NUL-terminated.
>
> So no. I absolutely do NOT want you to write more code to hide bugs or
> do half-arsed checking.

Well, it wouldn't hide it. It would trigger a big fat warning if it was
missing a nul terminator.

>
> I want you to *simplify* the code, and put proper limits in place for strings.
>
> I want to see the code that actually notices when somebody generates a
> crazy string, and stops that garbage in its tracks.
>
> What I do *not* want to see is more ad-hoc code that tries to deal
> with the symptoms of you not having done so.

This warning is just making sure the code is nul terminated. It has nothing
to do with size. The bug that triggered when I was working on other code
was a miscalculation of the input. I didn't write the entire string into
the ring buffer which meant that the terminating nul was also missing. On
reading the string, it crashed the kernel.

I put in the precision when debugging the code, and that's where I found the
mismatch in string size vs writing to the buffer. I then kept the precision
just in case I hit a similar bug. Which is what you have issues with.

Fine, I'll just remove the precision as that's not needed. There was no
other overflows involved here.

-- Steve

2024-03-04 23:21:21

by Linus Torvalds

[permalink] [raw]

Subject: Re: [GIT PULL] tracing: Prevent trace_marker being bigger than unsigned short

On Mon, 4 Mar 2024 at 14:08, Steven Rostedt <[email protected]> wrote:
>
> Fine, I'll just remove the precision as that's not needed. There was no
> other overflows involved here.

I really want you to add the size check on the trace buffer *creation* side.

I don't understand why you refuse to accept the fact that the
precision warning found a PROBLEM.

And no, the fix was never to paper over the problem by limiting the
precision field. Hiding a problem isn't fixing it.

And no, the fix was also never to chop up the printing of the string
in smaller pieces to hide paper over the precision field. Again,
hiding a problem isn't fixing it.

And finally, NO, the fix was also never to add extra debug code to see
that there was a NUL character there.

The fix was *always* to simply not accept insanely long strings in the
first place, and make sure that the field was correctly *set*.

IOW, at *creation* time the code needed a proper check for length
(which obviously indirectly includes checking for the terminating NUL
character at that point).

Why do these threads with you always have to end up this long? Why do
I Nhave to explain every single step of the way that you need to *FIX*
the problem, not try to hide it with new extra code.

Linus

2024-03-04 23:46:46

by Steven Rostedt

[permalink] [raw]

Subject: Re: [GIT PULL] tracing: Prevent trace_marker being bigger than unsigned short

On Mon, 4 Mar 2024 15:20:52 -0800
Linus Torvalds <[email protected]> wrote:

> On Mon, 4 Mar 2024 at 14:08, Steven Rostedt <[email protected]> wrote:
> >
> > Fine, I'll just remove the precision as that's not needed. There was no
> > other overflows involved here.
>
> I really want you to add the size check on the trace buffer *creation* side.
>
> I don't understand why you refuse to accept the fact that the
> precision warning found a PROBLEM.

And what exactly was that problem? That someone wrote a huge string into
the trace_marker on a machine that had huge page sizes?

What exactly broke?

There was no overflow. No bad memory. KASAN is happy.

>
> And no, the fix was never to paper over the problem by limiting the
> precision field. Hiding a problem isn't fixing it.

What was broken? The fact that we allowed the buffer to be 64K on a system
that has 64K PAGE_SIZE and we limited the strings to the size of what the
buffer can hold?

I would limit the buffers themselves but they are used in page mappings
which needs to be PAGE_SIZE. Which is most cases is only 4K.

>
> And no, the fix was also never to chop up the printing of the string
> in smaller pieces to hide paper over the precision field. Again,
> hiding a problem isn't fixing it.

Yes, I didn't like that change either.

>
> And finally, NO, the fix was also never to add extra debug code to see
> that there was a NUL character there.

That was not a fix, but just a paranoid check. The only times that my
paranoid checks usually trigger is during development which saves a few
hours of debugging as they point out exactly what broke.

>
> The fix was *always* to simply not accept insanely long strings in the
> first place, and make sure that the field was correctly *set*.

And what exactly is an "insane size"?

The buffer is allocated by page size, and when writing to it I make sure it
fits. I'm now supposed to add some code to tell the user, "No you can't
write that much into the buffer, even though we allocated enough to fit it"?

If I had never added that precision (which I just recently added), there
would have been no bug report, because there would have been no bug.

>
> IOW, at *creation* time the code needed a proper check for length
> (which obviously indirectly includes checking for the terminating NUL
> character at that point).
>
> Why do these threads with you always have to end up this long? Why do
> I Nhave to explain every single step of the way that you need to *FIX*
> the problem, not try to hide it with new extra code.

Because I still don't know what exactly the problem is! I'm supposed to add
some arbitrary limit to what people can enter just because we think it's
crazy to have big strings?

If I remove the precision, as I did in this patch. Then adding a limit is
not fixing any bug. It's just adding a limit.

-- Steve

2024-03-04 23:51:00

by Steven Rostedt

[permalink] [raw]

Subject: Re: [GIT PULL] tracing: Prevent trace_marker being bigger than unsigned short

On Mon, 4 Mar 2024 18:47:25 -0500
Steven Rostedt <[email protected]> wrote:

> If I remove the precision, as I did in this patch. Then adding a limit is
> not fixing any bug. It's just adding a limit.

Anyway, I can change the limit to 4K and also change the trace_seq size to
just 8K to hold events, instead of using PAGE_SIZE.

But this still isn't fixing anything. It's just adding a limit.

I'll add that as a separate patch.

-- Steve

2024-03-05 00:19:28

by Linus Torvalds

[permalink] [raw]

Subject: Re: [GIT PULL] tracing: Prevent trace_marker being bigger than unsigned short

On Mon, 4 Mar 2024 at 15:50, Steven Rostedt <[email protected]> wrote:
>
> But this still isn't fixing anything. It's just adding a limit.

Limiting things to a common maximum size is a good thing. The kernel
limits much more important things for very good reasons.

The kernel really shouldn't have big strings. EVER. And it literally
shows in our kernel infrastructure. It showed in that vsnprintf
precision thing. It shows in our implementation choices, where we tend
to have simplistic implementations because doing things a byte at a
time is simple and cheap when the strings are limited in size (and we
don't want fancy and can't use vector state anyway).

If something as core as a pathname can be limited to 4kB, then
something as unimportant as a trace string had better be limited too.
Because we simply DO NOT WANT to have to deal with longer strings in
the kernel.

Linus

2024-03-05 00:42:02

by Steven Rostedt

[permalink] [raw]

Subject: Re: [GIT PULL] tracing: Prevent trace_marker being bigger than unsigned short

On Mon, 4 Mar 2024 16:17:15 -0800
Linus Torvalds <[email protected]> wrote:

> On Mon, 4 Mar 2024 at 15:50, Steven Rostedt <[email protected]> wrote:
> >
> > But this still isn't fixing anything. It's just adding a limit.
>
> Limiting things to a common maximum size is a good thing. The kernel
> limits much more important things for very good reasons.
>
> The kernel really shouldn't have big strings. EVER. And it literally
> shows in our kernel infrastructure. It showed in that vsnprintf
> precision thing. It shows in our implementation choices, where we tend
> to have simplistic implementations because doing things a byte at a
> time is simple and cheap when the strings are limited in size (and we
> don't want fancy and can't use vector state anyway).
>
> If something as core as a pathname can be limited to 4kB, then
> something as unimportant as a trace string had better be limited too.
> Because we simply DO NOT WANT to have to deal with longer strings in
> the kernel.
>

So I made three patches that do basically what you want. And as a bonus,
it's not really an arbitrary limit but based on trace_seq size.

The first patch will be removing the precision check, as that's not needed.

The second patch is to remove the dependency between trace_seq and
PAGE_SIZE, as its size really can just be 8K for all architectures. Which
has the side effect of limiting the size of trace_marker, as its size is
limited by the trace_seq size.

Finally, because the trace_seq defines the max output that a trace_event
can write (for all its fields), the extra data of a print event could
possibly overflow that, which will cause the event not to print, and just
an "OVERFLOW" output would show in the trace buffer. So I used the
TRACE_SEQ_SIZE / 2 as the max size that trace_marker can read, which
happens to be 4K.

-- Steve

2024-03-05 01:20:14

by Mathieu Desnoyers

[permalink] [raw]

Subject: Re: [GIT PULL] tracing: Prevent trace_marker being bigger than unsigned short

On 2024-03-04 19:43, Steven Rostedt wrote:
> On Mon, 4 Mar 2024 16:17:15 -0800
> Linus Torvalds <[email protected]> wrote:
>
>> On Mon, 4 Mar 2024 at 15:50, Steven Rostedt <[email protected]> wrote:
>>>
>>> But this still isn't fixing anything. It's just adding a limit.
>>
>> Limiting things to a common maximum size is a good thing. The kernel
>> limits much more important things for very good reasons.
>>
>> The kernel really shouldn't have big strings. EVER. And it literally
>> shows in our kernel infrastructure. It showed in that vsnprintf
>> precision thing. It shows in our implementation choices, where we tend
>> to have simplistic implementations because doing things a byte at a
>> time is simple and cheap when the strings are limited in size (and we
>> don't want fancy and can't use vector state anyway).
>>
>> If something as core as a pathname can be limited to 4kB, then
>> something as unimportant as a trace string had better be limited too.
>> Because we simply DO NOT WANT to have to deal with longer strings in
>> the kernel.
>>
>
> So I made three patches that do basically what you want. And as a bonus,
> it's not really an arbitrary limit but based on trace_seq size.
>
> The first patch will be removing the precision check, as that's not needed.
>
> The second patch is to remove the dependency between trace_seq and
> PAGE_SIZE, as its size really can just be 8K for all architectures. Which
> has the side effect of limiting the size of trace_marker, as its size is
> limited by the trace_seq size.
>
> Finally, because the trace_seq defines the max output that a trace_event
> can write (for all its fields), the extra data of a print event could
> possibly overflow that, which will cause the event not to print, and just
> an "OVERFLOW" output would show in the trace buffer. So I used the
> TRACE_SEQ_SIZE / 2 as the max size that trace_marker can read, which
> happens to be 4K.

Steven, see my other reply. This is backwards.

You can leave the trace_seq as is if you want. It's the trace marking
input size that should be #define to 4kB, not defined as
half-the-size-of-an-internal-buffer-that-happens-to-be-8k.

Then add a BUILD_BUG_ON() in the output code to make sure the output
buffer is always large enough.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com