2022-01-10 16:55:39

by Steven Rostedt

[permalink] [raw]
Subject: [PATCH v2] tracing: Add test for user space strings when filtering on string pointers

From: Steven Rostedt <[email protected]>

Pingfan reported that the following causes a fault:

echo "filename ~ \"cpu\"" > events/syscalls/sys_enter_openat/filter
echo 1 > events/syscalls/sys_enter_at/enable

The reason is that trace event filter treats the user space pointer
defined by "filename" as a normal pointer to compare against the "cpu"
string. If the string is not loaded into memory yet, it will trigger a
fault in kernel space:

kvm-03-guest16 login: [72198.026181] BUG: unable to handle page fault for address: 00007fffaae8ef60
#PF: supervisor read access in kernel mode
#PF: error_code(0x0001) - permissions violation
PGD 80000001008b7067 P4D 80000001008b7067 PUD 2393f1067 PMD 2393ec067 PTE 8000000108f47867
Oops: 0001 [#1] PREEMPT SMP PTI
CPU: 1 PID: 1 Comm: systemd Kdump: loaded Not tainted 5.14.0-32.el9.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
RIP: 0010:strlen+0x0/0x20
Code: 48 89 f9 74 09 48 83 c1 01 80 39 00 75 f7 31 d2 44 0f b6 04 16 44 88 04 11
48 83 c2 01 45 84 c0 75 ee c3 0f 1f 80 00 00 00 00 <80> 3f 00 74 10 48 89 f8
48 83 c0 01 80 38 00 75 f7 48 29 f8 c3 31
RSP: 0018:ffffb5b900013e48 EFLAGS: 00010246
RAX: 0000000000000018 RBX: ffff8fc1c49ede00 RCX: 0000000000000000
RDX: 0000000000000020 RSI: ffff8fc1c02d601c RDI: 00007fffaae8ef60
RBP: 00007fffaae8ef60 R08: 0005034f4ddb8ea4 R09: 0000000000000000
R10: ffff8fc1c02d601c R11: 0000000000000000 R12: ffff8fc1c8a6e380
R13: 0000000000000000 R14: ffff8fc1c02d6010 R15: ffff8fc1c00453c0
FS: 00007fa86123db40(0000) GS:ffff8fc2ffd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fffaae8ef60 CR3: 0000000102880001 CR4: 00000000007706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
filter_pred_pchar+0x18/0x40
filter_match_preds+0x31/0x70
ftrace_syscall_enter+0x27a/0x2c0
syscall_trace_enter.constprop.0+0x1aa/0x1d0
do_syscall_64+0x16/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7fa861d88664

To be even more robust, test both kernel and user space strings. If the
string fails to read, then simply have the filter fail.

Link: https://lore.kernel.org/all/[email protected]/

Cc: [email protected]
Cc: Ingo Molnar <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Tom Zanussi <[email protected]>
Reported-by: Pingfan Liu <[email protected]>
Fixes: 87a342f5db69d ("tracing/filters: Support filtering for char * strings")
Signed-off-by: Steven Rostedt <[email protected]>
---
Changes since v1: https://lkml.kernel.org/r/[email protected]

- Up the size of temp buffer from 512 to 1024
Still less than PATH_MAX, but should be good enough in real systems.
If this is an issue, then we can add a way for users to updated it.

- Add documentation about the limit of comparing to string pointers.

- Add typecast from str to ustr to denote adding "__user"

- Allocate buffer only when comparing with string pointers.

- Only do the test on string pointers and not all string compares.

Documentation/trace/events.rst | 10 +++++
kernel/trace/trace_events_filter.c | 59 ++++++++++++++++++++++++++++--
2 files changed, 66 insertions(+), 3 deletions(-)

diff --git a/Documentation/trace/events.rst b/Documentation/trace/events.rst
index 8ddb9b09451c..45e66a60a816 100644
--- a/Documentation/trace/events.rst
+++ b/Documentation/trace/events.rst
@@ -230,6 +230,16 @@ Currently the caret ('^') for an error always appears at the beginning of
the filter string; the error message should still be useful though
even without more accurate position info.

+5.2.1 Filter limitations
+------------------------
+
+If a filter is placed on a string pointer ``(char *)`` that does not point
+to a string on the ring buffer, but instead points to kernel or user space
+memory, then, for safety reasons, at most 1024 bytes of the content is
+copied onto a temporary buffer to do the compare. If the copy of the memory
+faults (the pointer points to memory that should not be accessed), then the
+string compare will be treated as not matching.
+
5.3 Clearing filters
--------------------

diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c
index 996920ed1812..91352a64be09 100644
--- a/kernel/trace/trace_events_filter.c
+++ b/kernel/trace/trace_events_filter.c
@@ -5,6 +5,7 @@
* Copyright (C) 2009 Tom Zanussi <[email protected]>
*/

+#include <linux/uaccess.h>
#include <linux/module.h>
#include <linux/ctype.h>
#include <linux/mutex.h>
@@ -654,6 +655,40 @@ DEFINE_EQUALITY_PRED(32);
DEFINE_EQUALITY_PRED(16);
DEFINE_EQUALITY_PRED(8);

+/* user space strings temp buffer */
+#define USTRING_BUF_SIZE 1024
+
+struct ustring_buffer {
+ char buffer[USTRING_BUF_SIZE];
+};
+
+static __percpu struct ustring_buffer *ustring_per_cpu;
+
+static __always_inline char *test_string(char *str)
+{
+ struct ustring_buffer *ubuf;
+ char __user *ustr;
+ char *kstr;
+
+ if (!ustring_per_cpu)
+ return NULL;
+
+ ubuf = this_cpu_ptr(ustring_per_cpu);
+ kstr = ubuf->buffer;
+
+ if (likely((unsigned long)str >= TASK_SIZE)) {
+ /* For safety, do not trust the string pointer */
+ if (!strncpy_from_kernel_nofault(kstr, str, USTRING_BUF_SIZE))
+ return NULL;
+ } else {
+ /* user space address? */
+ ustr = (char __user *)str;
+ if (!strncpy_from_user_nofault(kstr, ustr, USTRING_BUF_SIZE))
+ return NULL;
+ }
+ return kstr;
+}
+
/* Filter predicate for fixed sized arrays of characters */
static int filter_pred_string(struct filter_pred *pred, void *event)
{
@@ -671,10 +706,16 @@ static int filter_pred_string(struct filter_pred *pred, void *event)
static int filter_pred_pchar(struct filter_pred *pred, void *event)
{
char **addr = (char **)(event + pred->offset);
+ char *str;
int cmp, match;
- int len = strlen(*addr) + 1; /* including tailing '\0' */
+ int len;

- cmp = pred->regex.match(*addr, &pred->regex, len);
+ str = test_string(*addr);
+ if (!str)
+ return 0;
+
+ len = strlen(str) + 1; /* including tailing '\0' */
+ cmp = pred->regex.match(str, &pred->regex, len);

match = cmp ^ pred->not;

@@ -1348,8 +1389,17 @@ static int parse_pred(const char *str, void *data,
pred->fn = filter_pred_strloc;
} else if (field->filter_type == FILTER_RDYN_STRING)
pred->fn = filter_pred_strrelloc;
- else
+ else {
+
+ if (!ustring_per_cpu) {
+ /* Once allocated, keep it around for good */
+ ustring_per_cpu = alloc_percpu(struct ustring_buffer);
+ if (!ustring_per_cpu)
+ goto err_mem;
+ }
+
pred->fn = filter_pred_pchar;
+ }
/* go past the last quote */
i++;

@@ -1415,6 +1465,9 @@ static int parse_pred(const char *str, void *data,
err_free:
kfree(pred);
return -EINVAL;
+err_mem:
+ kfree(pred);
+ return -ENOMEM;
}

enum {
--
2.33.0



2022-01-10 17:11:57

by David Laight

[permalink] [raw]
Subject: RE: [PATCH v2] tracing: Add test for user space strings when filtering on string pointers

From: Steven Rostedt
> Sent: 10 January 2022 16:56
>
> From: Steven Rostedt <[email protected]>
>
> Pingfan reported that the following causes a fault:
>
> echo "filename ~ \"cpu\"" > events/syscalls/sys_enter_openat/filter
> echo 1 > events/syscalls/sys_enter_at/enable
>
> The reason is that trace event filter treats the user space pointer
> defined by "filename" as a normal pointer to compare against the "cpu"
> string. If the string is not loaded into memory yet, it will trigger a
> fault in kernel space:

If a userspace pointer can end up the kernel structure then presumably
a 'dodgy' user program can supply an arbitrary kernel address instead?
This may give the user the ability to read arbitrary kernel addresses
(including ones that are mapped to PCIe IO addresses).
Doesn't sound good at all.

...
> + if (likely((unsigned long)str >= TASK_SIZE)) {
> + /* For safety, do not trust the string pointer */
> + if (!strncpy_from_kernel_nofault(kstr, str, USTRING_BUF_SIZE))
> + return NULL;
> + } else {
> + /* user space address? */
> + ustr = (char __user *)str;
> + if (!strncpy_from_user_nofault(kstr, ustr, USTRING_BUF_SIZE))
> + return NULL;

Is that check against TASK_SIZE even correct for all architectures?
copy_to/from_user() uses access_ok() - which is architecture dependant.

I think you need to remember where the pointer came from.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


2022-01-10 17:24:42

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v2] tracing: Add test for user space strings when filtering on string pointers

On Mon, 10 Jan 2022 17:11:52 +0000
David Laight <[email protected]> wrote:

> From: Steven Rostedt
> > Sent: 10 January 2022 16:56
> >
> > From: Steven Rostedt <[email protected]>
> >
> > Pingfan reported that the following causes a fault:
> >
> > echo "filename ~ \"cpu\"" > events/syscalls/sys_enter_openat/filter
> > echo 1 > events/syscalls/sys_enter_at/enable
> >
> > The reason is that trace event filter treats the user space pointer
> > defined by "filename" as a normal pointer to compare against the "cpu"
> > string. If the string is not loaded into memory yet, it will trigger a
> > fault in kernel space:
>
> If a userspace pointer can end up the kernel structure then presumably
> a 'dodgy' user program can supply an arbitrary kernel address instead?
> This may give the user the ability to read arbitrary kernel addresses
> (including ones that are mapped to PCIe IO addresses).
> Doesn't sound good at all.

Only root has access to the information read here. All tracing requires
root or those explicitly given access to the tracing data, which pretty
much allows all access to kernel internals (including all memory). So
nothing to worry about here ;-)

>
> ...
> > + if (likely((unsigned long)str >= TASK_SIZE)) {
> > + /* For safety, do not trust the string pointer */
> > + if (!strncpy_from_kernel_nofault(kstr, str, USTRING_BUF_SIZE))
> > + return NULL;
> > + } else {
> > + /* user space address? */
> > + ustr = (char __user *)str;
> > + if (!strncpy_from_user_nofault(kstr, ustr, USTRING_BUF_SIZE))
> > + return NULL;
>
> Is that check against TASK_SIZE even correct for all architectures?
> copy_to/from_user() uses access_ok() - which is architecture dependant.

The problem with access_ok() (which I tried first) is that it can't be used
from interrupt context, and this check can happen in interrupt context.
Either way, if we pick the wrong one for the arch, the only thing bad that
can happen is that it returns "fault" and the filter fails, just like if
the pointer was to bad memory.

>
> I think you need to remember where the pointer came from.
>

Why?

-- Steve

2022-01-10 17:29:44

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v2] tracing: Add test for user space strings when filtering on string pointers

On Mon, 10 Jan 2022 12:24:36 -0500
Steven Rostedt <[email protected]> wrote:

> > > Pingfan reported that the following causes a fault:
> > >
> > > echo "filename ~ \"cpu\"" > events/syscalls/sys_enter_openat/filter
> > > echo 1 > events/syscalls/sys_enter_at/enable
> > >
> > > The reason is that trace event filter treats the user space pointer
> > > defined by "filename" as a normal pointer to compare against the "cpu"
> > > string. If the string is not loaded into memory yet, it will trigger a
> > > fault in kernel space:
> >
> > If a userspace pointer can end up the kernel structure then presumably
> > a 'dodgy' user program can supply an arbitrary kernel address instead?
> > This may give the user the ability to read arbitrary kernel addresses
> > (including ones that are mapped to PCIe IO addresses).
> > Doesn't sound good at all.
>
> Only root has access to the information read here. All tracing requires
> root or those explicitly given access to the tracing data, which pretty
> much allows all access to kernel internals (including all memory). So
> nothing to worry about here ;-)

BTW, this patch doesn't give any new access to anything. It only protects
if the pointer is pointing someplace that will give a fault. The user could
today point that pointer to anything in kernel memory and trace it (but
only read it if you are root). And worse, if the pointer is to something
that can fault, it will trigger a BUG(). This could happen if root creates
the filter, and a non-privileged user sets that pointer to something bad.
Which means anyone can cause the machine to fault if root does this kind
of tracing. This is why I added the stable tag.

This patch is to prevent that BUG() from happening.

-- Steve

2022-01-10 21:58:38

by David Laight

[permalink] [raw]
Subject: RE: [PATCH v2] tracing: Add test for user space strings when filtering on string pointers

From: Steven Rostedt
> Sent: 10 January 2022 17:25
...
> > ...
> > > + if (likely((unsigned long)str >= TASK_SIZE)) {
> > > + /* For safety, do not trust the string pointer */
> > > + if (!strncpy_from_kernel_nofault(kstr, str, USTRING_BUF_SIZE))
> > > + return NULL;
> > > + } else {
> > > + /* user space address? */
> > > + ustr = (char __user *)str;
> > > + if (!strncpy_from_user_nofault(kstr, ustr, USTRING_BUF_SIZE))
> > > + return NULL;
> >
> > Is that check against TASK_SIZE even correct for all architectures?
> > copy_to/from_user() uses access_ok() - which is architecture dependant.
>
> The problem with access_ok() (which I tried first) is that it can't be used
> from interrupt context, and this check can happen in interrupt context.
> Either way, if we pick the wrong one for the arch, the only thing bad that
> can happen is that it returns "fault" and the filter fails, just like if
> the pointer was to bad memory.

Isn't there also at least one architecture where you can't differentiate
between user and kernel pointers by looking at the address?
(Something like sparc ASI is used for user accesses so both user
and kernel get the full 4G address range. But it isn't sparc (or pdp/11))
ISTR it causing issues with the code for kernel_setsockopt() and
required a separate flag.

Put that together with something that needs user_access_begin()
to bracket user accesses and you probably fail big-time.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


2022-01-10 22:03:25

by David Laight

[permalink] [raw]
Subject: RE: [PATCH v2] tracing: Add test for user space strings when filtering on string pointers

From: Steven Rostedt
> Sent: 10 January 2022 17:25
>
> On Mon, 10 Jan 2022 17:11:52 +0000
> David Laight <[email protected]> wrote:
>
> > From: Steven Rostedt
> > > Sent: 10 January 2022 16:56
> > >
> > > From: Steven Rostedt <[email protected]>
> > >
> > > Pingfan reported that the following causes a fault:
> > >
> > > echo "filename ~ \"cpu\"" > events/syscalls/sys_enter_openat/filter
> > > echo 1 > events/syscalls/sys_enter_at/enable
> > >
> > > The reason is that trace event filter treats the user space pointer
> > > defined by "filename" as a normal pointer to compare against the "cpu"
> > > string. If the string is not loaded into memory yet, it will trigger a
> > > fault in kernel space:
> >
> > If a userspace pointer can end up the kernel structure then presumably
> > a 'dodgy' user program can supply an arbitrary kernel address instead?
> > This may give the user the ability to read arbitrary kernel addresses
> > (including ones that are mapped to PCIe IO addresses).
> > Doesn't sound good at all.
>
> Only root has access to the information read here. All tracing requires
> root or those explicitly given access to the tracing data, which pretty
> much allows all access to kernel internals (including all memory). So
> nothing to worry about here ;-)

Is this filtering trace using a filename passed to a system call by a user program?
In which case a user program can set up a system call that normally fails
(because the copy_from_user() errors) but if root tries to run a system
call event trace on that process can read arbitrary addresses and
thus crash the system?

While unlikely root might be persuaded to try to run the trace.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


2022-01-11 00:21:09

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v2] tracing: Add test for user space strings when filtering on string pointers

On Mon, 10 Jan 2022 22:03:20 +0000
David Laight <[email protected]> wrote:

> > Only root has access to the information read here. All tracing requires
> > root or those explicitly given access to the tracing data, which pretty
> > much allows all access to kernel internals (including all memory). So
> > nothing to worry about here ;-)
>
> Is this filtering trace using a filename passed to a system call by a user program?
> In which case a user program can set up a system call that normally fails
> (because the copy_from_user() errors) but if root tries to run a system
> call event trace on that process can read arbitrary addresses and
> thus crash the system?
>
> While unlikely root might be persuaded to try to run the trace.

Yes. That's exactly what the code does today, and why it's a bug.

This patch instead uses copy_from_user_nofault/copy_from_kernel_nofault and
copies it into a temp buffer and then compares against that.

If a user passes in a crazy pointer, the copy_from_user/kernel_nofault()
will not read it, and the filter simply fails to match. Nothing bad will
happen.

-- Steve

2022-01-11 20:56:22

by Sven Schnelle

[permalink] [raw]
Subject: Re: [PATCH v2] tracing: Add test for user space strings when filtering on string pointers

David Laight <[email protected]> writes:

> From: Steven Rostedt
>> Sent: 10 January 2022 17:25
> ...
>> > ...
>> > > + if (likely((unsigned long)str >= TASK_SIZE)) {
>> > > + /* For safety, do not trust the string pointer */
>> > > + if (!strncpy_from_kernel_nofault(kstr, str, USTRING_BUF_SIZE))
>> > > + return NULL;
>> > > + } else {
>> > > + /* user space address? */
>> > > + ustr = (char __user *)str;
>> > > + if (!strncpy_from_user_nofault(kstr, ustr, USTRING_BUF_SIZE))
>> > > + return NULL;
>> >
>> > Is that check against TASK_SIZE even correct for all architectures?
>> > copy_to/from_user() uses access_ok() - which is architecture dependant.
>>
>> The problem with access_ok() (which I tried first) is that it can't be used
>> from interrupt context, and this check can happen in interrupt context.
>> Either way, if we pick the wrong one for the arch, the only thing bad that
>> can happen is that it returns "fault" and the filter fails, just like if
>> the pointer was to bad memory.
>
> Isn't there also at least one architecture where you can't differentiate
> between user and kernel pointers by looking at the address?
> (Something like sparc ASI is used for user accesses so both user
> and kernel get the full 4G address range. But it isn't sparc (or pdp/11))
> ISTR it causing issues with the code for kernel_setsockopt() and
> required a separate flag.

On s390 TASK_SIZE is defined as -PAGE_SIZE, so with the patch above the
kernel would always try to fetch it from user space. I think it would be
the same for parisc.

> Put that together with something that needs user_access_begin()
> to bracket user accesses and you probably fail big-time.
>
> David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)

2022-01-12 04:13:14

by Pingfan Liu

[permalink] [raw]
Subject: Re: [PATCH v2] tracing: Add test for user space strings when filtering on string pointers

Hi Steven,

Sorry that I am out of office, and not reply in time.

On Mon, Jan 10, 2022 at 12:24:36PM -0500, Steven Rostedt wrote:
> On Mon, 10 Jan 2022 17:11:52 +0000
> David Laight <[email protected]> wrote:
>
> > From: Steven Rostedt
> > > Sent: 10 January 2022 16:56
> > >
> > > From: Steven Rostedt <[email protected]>
> > >
> > > Pingfan reported that the following causes a fault:
> > >
> > > echo "filename ~ \"cpu\"" > events/syscalls/sys_enter_openat/filter
> > > echo 1 > events/syscalls/sys_enter_at/enable
> > >
> > > The reason is that trace event filter treats the user space pointer
> > > defined by "filename" as a normal pointer to compare against the "cpu"
> > > string. If the string is not loaded into memory yet, it will trigger a
> > > fault in kernel space:

For accurate commit log, the swapped-out user page is not the root cause
of this bug is "supervisor read access in kernel mode". And it is trueth
that swapped-out user page can trigger a bug here, but it should be a
different a stack.

> >
> > If a userspace pointer can end up the kernel structure then presumably
> > a 'dodgy' user program can supply an arbitrary kernel address instead?
> > This may give the user the ability to read arbitrary kernel addresses
> > (including ones that are mapped to PCIe IO addresses).
> > Doesn't sound good at all.
>
> Only root has access to the information read here. All tracing requires
> root or those explicitly given access to the tracing data, which pretty
> much allows all access to kernel internals (including all memory). So
> nothing to worry about here ;-)
>

I am not sure about the opposite way. Since kernel is not allowed to
access userspace most of the time, then is it an leakage, which looks
like:
use tracepoint as trampoline to uaccess.
read out user info from ustring_per_cpu

But any kernel code can call copy_from_user() function family freely, so
it is not a problem caused by this patch, right? Or ustring_per_cpu
should be zeroed out.

For V2, feel free to add "Tested-by"


Thanks,

Pingfan
> >
> > ...
> > > + if (likely((unsigned long)str >= TASK_SIZE)) {
> > > + /* For safety, do not trust the string pointer */
> > > + if (!strncpy_from_kernel_nofault(kstr, str, USTRING_BUF_SIZE))
> > > + return NULL;
> > > + } else {
> > > + /* user space address? */
> > > + ustr = (char __user *)str;
> > > + if (!strncpy_from_user_nofault(kstr, ustr, USTRING_BUF_SIZE))
> > > + return NULL;
> >
> > Is that check against TASK_SIZE even correct for all architectures?
> > copy_to/from_user() uses access_ok() - which is architecture dependant.
>
> The problem with access_ok() (which I tried first) is that it can't be used
> from interrupt context, and this check can happen in interrupt context.
> Either way, if we pick the wrong one for the arch, the only thing bad that
> can happen is that it returns "fault" and the filter fails, just like if
> the pointer was to bad memory.
>
> >
> > I think you need to remember where the pointer came from.
> >
>
> Why?
>
> -- Steve

2022-01-13 17:58:00

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v2] tracing: Add test for user space strings when filtering on string pointers

On Tue, 11 Jan 2022 21:55:53 +0100
Sven Schnelle <[email protected]> wrote:

> > Isn't there also at least one architecture where you can't differentiate
> > between user and kernel pointers by looking at the address?
> > (Something like sparc ASI is used for user accesses so both user
> > and kernel get the full 4G address range. But it isn't sparc (or pdp/11))
> > ISTR it causing issues with the code for kernel_setsockopt() and
> > required a separate flag.
>
> On s390 TASK_SIZE is defined as -PAGE_SIZE, so with the patch above the
> kernel would always try to fetch it from user space. I think it would be
> the same for parisc.

As a work around for these cases, would something like this work?

-- Steve

diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c
index 91352a64be09..06013822764c 100644
--- a/kernel/trace/trace_events_filter.c
+++ b/kernel/trace/trace_events_filter.c
@@ -676,7 +676,15 @@ static __always_inline char *test_string(char *str)
ubuf = this_cpu_ptr(ustring_per_cpu);
kstr = ubuf->buffer;

- if (likely((unsigned long)str >= TASK_SIZE)) {
+ /*
+ * Test the address of ustring_per_cpu against TASK_SIZE, as
+ * comparing TASK_SIZE to determine kernel/user space address
+ * is not enough on some architectures. If the address is less
+ * than TASK_SIZE we know this is the case, in which we should
+ * always use the from_kernel variant.
+ */
+ if ((unsigned long)&ustring_per_cpu < (unsigned long)TASK_SIZE ||
+ likely((unsigned long)str >= TASK_SIZE)) {
/* For safety, do not trust the string pointer */
if (!strncpy_from_kernel_nofault(kstr, str, USTRING_BUF_SIZE))
return NULL;

2022-01-13 18:04:10

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v2] tracing: Add test for user space strings when filtering on string pointers

On Wed, 12 Jan 2022 12:13:03 +0800
Pingfan Liu <[email protected]> wrote:

> Hi Steven,
>
> Sorry that I am out of office, and not reply in time.

And I've been very sick for the last few days :-p

>
> On Mon, Jan 10, 2022 at 12:24:36PM -0500, Steven Rostedt wrote:
> > On Mon, 10 Jan 2022 17:11:52 +0000
> > David Laight <[email protected]> wrote:
> >
> > > From: Steven Rostedt
> > > > Sent: 10 January 2022 16:56
> > > >
> > > > From: Steven Rostedt <[email protected]>
> > > >
> > > > Pingfan reported that the following causes a fault:
> > > >
> > > > echo "filename ~ \"cpu\"" > events/syscalls/sys_enter_openat/filter
> > > > echo 1 > events/syscalls/sys_enter_at/enable
> > > >
> > > > The reason is that trace event filter treats the user space pointer
> > > > defined by "filename" as a normal pointer to compare against the "cpu"
> > > > string. If the string is not loaded into memory yet, it will trigger a
> > > > fault in kernel space:
>
> For accurate commit log, the swapped-out user page is not the root cause
> of this bug is "supervisor read access in kernel mode". And it is trueth
> that swapped-out user page can trigger a bug here, but it should be a
> different a stack.

Yeah, it's true that the bug that triggered is the supervisor access, but
that's more of a security concern (and an option that triggers this, to
find bugs of this kind :-) To me, the real bug is the random read of a
pointer. I can update the change log to be a bit more specific.

>
> > >
> > > If a userspace pointer can end up the kernel structure then presumably
> > > a 'dodgy' user program can supply an arbitrary kernel address instead?
> > > This may give the user the ability to read arbitrary kernel addresses
> > > (including ones that are mapped to PCIe IO addresses).
> > > Doesn't sound good at all.
> >
> > Only root has access to the information read here. All tracing requires
> > root or those explicitly given access to the tracing data, which pretty
> > much allows all access to kernel internals (including all memory). So
> > nothing to worry about here ;-)
> >
>
> I am not sure about the opposite way. Since kernel is not allowed to
> access userspace most of the time, then is it an leakage, which looks
> like:
> use tracepoint as trampoline to uaccess.
> read out user info from ustring_per_cpu
>
> But any kernel code can call copy_from_user() function family freely, so
> it is not a problem caused by this patch, right? Or ustring_per_cpu
> should be zeroed out.

And you can use tracing to create a kprobe that can read user space at
almost any place in the kernel. Hence, we have API that allows root to do
this, which is why I'm not concerned about it.

>
> For V2, feel free to add "Tested-by"
>
>

Thanks,

-- Steve

2022-01-13 21:28:33

by Sven Schnelle

[permalink] [raw]
Subject: Re: [PATCH v2] tracing: Add test for user space strings when filtering on string pointers

Hi Steve,

Steven Rostedt <[email protected]> writes:

> On Tue, 11 Jan 2022 21:55:53 +0100
> Sven Schnelle <[email protected]> wrote:
>
>> > Isn't there also at least one architecture where you can't differentiate
>> > between user and kernel pointers by looking at the address?
>> > (Something like sparc ASI is used for user accesses so both user
>> > and kernel get the full 4G address range. But it isn't sparc (or pdp/11))
>> > ISTR it causing issues with the code for kernel_setsockopt() and
>> > required a separate flag.
>>
>> On s390 TASK_SIZE is defined as -PAGE_SIZE, so with the patch above the
>> kernel would always try to fetch it from user space. I think it would be
>> the same for parisc.
>
> As a work around for these cases, would something like this work?

Hmm, i don't see how. On s390, TASK_SIZE is -PAGE_SIZE, which means
0xfffffffffffff000 so i think the if() condition below is always true.

Too bad that the __user attribute is stripped during a normal compile.
But couldn't we add the information whether a pointer belongs to user
or kernel space in the trace event definition? For syscall tracing it's
easy, because pointer types in SYSCALL_DEFINE() and friends are always
userspace pointers?

> -- Steve
>
> diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c
> index 91352a64be09..06013822764c 100644
> --- a/kernel/trace/trace_events_filter.c
> +++ b/kernel/trace/trace_events_filter.c
> @@ -676,7 +676,15 @@ static __always_inline char *test_string(char *str)
> ubuf = this_cpu_ptr(ustring_per_cpu);
> kstr = ubuf->buffer;
>
> - if (likely((unsigned long)str >= TASK_SIZE)) {
> + /*
> + * Test the address of ustring_per_cpu against TASK_SIZE, as
> + * comparing TASK_SIZE to determine kernel/user space address
> + * is not enough on some architectures. If the address is less
> + * than TASK_SIZE we know this is the case, in which we should
> + * always use the from_kernel variant.
> + */
> + if ((unsigned long)&ustring_per_cpu < (unsigned long)TASK_SIZE ||
> + likely((unsigned long)str >= TASK_SIZE)) {
> /* For safety, do not trust the string pointer */
> if (!strncpy_from_kernel_nofault(kstr, str, USTRING_BUF_SIZE))
> return NULL;

2022-01-13 21:51:21

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v2] tracing: Add test for user space strings when filtering on string pointers

On Thu, 13 Jan 2022 22:28:01 +0100
Sven Schnelle <[email protected]> wrote:
>
> Hmm, i don't see how. On s390, TASK_SIZE is -PAGE_SIZE, which means
> 0xfffffffffffff000 so i think the if() condition below is always true.

Yes, I did that to just use the kernel version and not the user space one.

This is just a workaround for now.

>
> Too bad that the __user attribute is stripped during a normal compile.
> But couldn't we add the information whether a pointer belongs to user
> or kernel space in the trace event definition? For syscall tracing it's
> easy, because pointer types in SYSCALL_DEFINE() and friends are always
> userspace pointers?

We could add something later. As it is currently the merge window, and this
is a real bug, I'm going to just leave it as is, and we can work to fix the
other archs later. I need to get a pull request ready by tomorrow.

-- Steve

2022-01-13 22:02:40

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v2] tracing: Add test for user space strings when filtering on string pointers

On Wed, 12 Jan 2022 12:13:03 +0800
Pingfan Liu <[email protected]> wrote:

> > > > The reason is that trace event filter treats the user space pointer
> > > > defined by "filename" as a normal pointer to compare against the "cpu"
> > > > string. If the string is not loaded into memory yet, it will trigger a
> > > > fault in kernel space:
>
> For accurate commit log, the swapped-out user page is not the root cause
> of this bug is "supervisor read access in kernel mode". And it is trueth
> that swapped-out user page can trigger a bug here, but it should be a
> different a stack.

I updated the change log.

>
> For V2, feel free to add "Tested-by"

Thanks.

I also added a comment in the code. Here's v3:

-- Steve

From: Steven Rostedt <[email protected]>
Subject: [PATCH] tracing: Add test for user space strings when filtering on
string pointers

Pingfan reported that the following causes a fault:

echo "filename ~ \"cpu\"" > events/syscalls/sys_enter_openat/filter
echo 1 > events/syscalls/sys_enter_at/enable

The reason is that trace event filter treats the user space pointer
defined by "filename" as a normal pointer to compare against the "cpu"
string. The following bug happened:

kvm-03-guest16 login: [72198.026181] BUG: unable to handle page fault for address: 00007fffaae8ef60
#PF: supervisor read access in kernel mode
#PF: error_code(0x0001) - permissions violation
PGD 80000001008b7067 P4D 80000001008b7067 PUD 2393f1067 PMD 2393ec067 PTE 8000000108f47867
Oops: 0001 [#1] PREEMPT SMP PTI
CPU: 1 PID: 1 Comm: systemd Kdump: loaded Not tainted 5.14.0-32.el9.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
RIP: 0010:strlen+0x0/0x20
Code: 48 89 f9 74 09 48 83 c1 01 80 39 00 75 f7 31 d2 44 0f b6 04 16 44 88 04 11
48 83 c2 01 45 84 c0 75 ee c3 0f 1f 80 00 00 00 00 <80> 3f 00 74 10 48 89 f8
48 83 c0 01 80 38 00 75 f7 48 29 f8 c3 31
RSP: 0018:ffffb5b900013e48 EFLAGS: 00010246
RAX: 0000000000000018 RBX: ffff8fc1c49ede00 RCX: 0000000000000000
RDX: 0000000000000020 RSI: ffff8fc1c02d601c RDI: 00007fffaae8ef60
RBP: 00007fffaae8ef60 R08: 0005034f4ddb8ea4 R09: 0000000000000000
R10: ffff8fc1c02d601c R11: 0000000000000000 R12: ffff8fc1c8a6e380
R13: 0000000000000000 R14: ffff8fc1c02d6010 R15: ffff8fc1c00453c0
FS: 00007fa86123db40(0000) GS:ffff8fc2ffd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fffaae8ef60 CR3: 0000000102880001 CR4: 00000000007706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
filter_pred_pchar+0x18/0x40
filter_match_preds+0x31/0x70
ftrace_syscall_enter+0x27a/0x2c0
syscall_trace_enter.constprop.0+0x1aa/0x1d0
do_syscall_64+0x16/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7fa861d88664

The above happened because the kernel tried to access user space directly
and triggered a "supervisor read access in kernel mode" fault. Worse yet,
the memory could not even be loaded yet, and a SEGFAULT could happen as
well. This could be true for kernel space accessing as well.

To be even more robust, test both kernel and user space strings. If the
string fails to read, then simply have the filter fail.

Note, TASK_SIZE is used to determine if the pointer is user or kernel space
and the appropriate strncpy_from_kernel/user_nofault() function is used to
copy the memory. For some architectures, the compare to TASK_SIZE may always
pick user space or kernel space. If it gets it wrong, the only thing is that
the filter will fail to match. In the future, this needs to be fixed to have
the event denote which should be used. But failing a filter is much better
than panicing the machine, and that can be solved later.

Link: https://lore.kernel.org/all/[email protected]/
Link: https://lkml.kernel.org/r/[email protected]

Cc: [email protected]
Cc: Ingo Molnar <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Tom Zanussi <[email protected]>
Reported-by: Pingfan Liu <[email protected]>
Tested-by: Pingfan Liu <[email protected]>
Fixes: 87a342f5db69d ("tracing/filters: Support filtering for char * strings")
Signed-off-by: Steven Rostedt <[email protected]>
---
Documentation/trace/events.rst | 10 +++++
kernel/trace/trace_events_filter.c | 66 ++++++++++++++++++++++++++++--
2 files changed, 73 insertions(+), 3 deletions(-)

diff --git a/Documentation/trace/events.rst b/Documentation/trace/events.rst
index 8ddb9b09451c..45e66a60a816 100644
--- a/Documentation/trace/events.rst
+++ b/Documentation/trace/events.rst
@@ -230,6 +230,16 @@ Currently the caret ('^') for an error always appears at the beginning of
the filter string; the error message should still be useful though
even without more accurate position info.

+5.2.1 Filter limitations
+------------------------
+
+If a filter is placed on a string pointer ``(char *)`` that does not point
+to a string on the ring buffer, but instead points to kernel or user space
+memory, then, for safety reasons, at most 1024 bytes of the content is
+copied onto a temporary buffer to do the compare. If the copy of the memory
+faults (the pointer points to memory that should not be accessed), then the
+string compare will be treated as not matching.
+
5.3 Clearing filters
--------------------

diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c
index 996920ed1812..2e9ef64e9ee9 100644
--- a/kernel/trace/trace_events_filter.c
+++ b/kernel/trace/trace_events_filter.c
@@ -5,6 +5,7 @@
* Copyright (C) 2009 Tom Zanussi <[email protected]>
*/

+#include <linux/uaccess.h>
#include <linux/module.h>
#include <linux/ctype.h>
#include <linux/mutex.h>
@@ -654,6 +655,47 @@ DEFINE_EQUALITY_PRED(32);
DEFINE_EQUALITY_PRED(16);
DEFINE_EQUALITY_PRED(8);

+/* user space strings temp buffer */
+#define USTRING_BUF_SIZE 1024
+
+struct ustring_buffer {
+ char buffer[USTRING_BUF_SIZE];
+};
+
+static __percpu struct ustring_buffer *ustring_per_cpu;
+
+static __always_inline char *test_string(char *str)
+{
+ struct ustring_buffer *ubuf;
+ char __user *ustr;
+ char *kstr;
+
+ if (!ustring_per_cpu)
+ return NULL;
+
+ ubuf = this_cpu_ptr(ustring_per_cpu);
+ kstr = ubuf->buffer;
+
+ /*
+ * We use TASK_SIZE to denote user or kernel space, but this will
+ * not work for all architectures. If it picks the wrong one, it may
+ * just fail the filter (but will not bug).
+ *
+ * TODO: Have a way to properly denote which one this is for.
+ */
+ if (likely((unsigned long)str >= TASK_SIZE)) {
+ /* For safety, do not trust the string pointer */
+ if (!strncpy_from_kernel_nofault(kstr, str, USTRING_BUF_SIZE))
+ return NULL;
+ } else {
+ /* user space address? */
+ ustr = (char __user *)str;
+ if (!strncpy_from_user_nofault(kstr, ustr, USTRING_BUF_SIZE))
+ return NULL;
+ }
+ return kstr;
+}
+
/* Filter predicate for fixed sized arrays of characters */
static int filter_pred_string(struct filter_pred *pred, void *event)
{
@@ -671,10 +713,16 @@ static int filter_pred_string(struct filter_pred *pred, void *event)
static int filter_pred_pchar(struct filter_pred *pred, void *event)
{
char **addr = (char **)(event + pred->offset);
+ char *str;
int cmp, match;
- int len = strlen(*addr) + 1; /* including tailing '\0' */
+ int len;

- cmp = pred->regex.match(*addr, &pred->regex, len);
+ str = test_string(*addr);
+ if (!str)
+ return 0;
+
+ len = strlen(str) + 1; /* including tailing '\0' */
+ cmp = pred->regex.match(str, &pred->regex, len);

match = cmp ^ pred->not;

@@ -1348,8 +1396,17 @@ static int parse_pred(const char *str, void *data,
pred->fn = filter_pred_strloc;
} else if (field->filter_type == FILTER_RDYN_STRING)
pred->fn = filter_pred_strrelloc;
- else
+ else {
+
+ if (!ustring_per_cpu) {
+ /* Once allocated, keep it around for good */
+ ustring_per_cpu = alloc_percpu(struct ustring_buffer);
+ if (!ustring_per_cpu)
+ goto err_mem;
+ }
+
pred->fn = filter_pred_pchar;
+ }
/* go past the last quote */
i++;

@@ -1415,6 +1472,9 @@ static int parse_pred(const char *str, void *data,
err_free:
kfree(pred);
return -EINVAL;
+err_mem:
+ kfree(pred);
+ return -ENOMEM;
}

enum {
--
2.33.0


2022-01-13 22:11:45

by David Laight

[permalink] [raw]
Subject: RE: [PATCH v2] tracing: Add test for user space strings when filtering on string pointers

From: Sven Schnelle
> Sent: 13 January 2022 21:28
>
> Hi Steve,
>
> Steven Rostedt <[email protected]> writes:
>
> > On Tue, 11 Jan 2022 21:55:53 +0100
> > Sven Schnelle <[email protected]> wrote:
> >
> >> > Isn't there also at least one architecture where you can't differentiate
> >> > between user and kernel pointers by looking at the address?
> >> > (Something like sparc ASI is used for user accesses so both user
> >> > and kernel get the full 4G address range. But it isn't sparc (or pdp/11))
> >> > ISTR it causing issues with the code for kernel_setsockopt() and
> >> > required a separate flag.
> >>
> >> On s390 TASK_SIZE is defined as -PAGE_SIZE, so with the patch above the
> >> kernel would always try to fetch it from user space. I think it would be
> >> the same for parisc.
> >
> > As a work around for these cases, would something like this work?
>
> Hmm, i don't see how. On s390, TASK_SIZE is -PAGE_SIZE, which means
> 0xfffffffffffff000 so i think the if() condition below is always true.

Isn't TASK_SIZE also dependant on current->xxxx on at least one architecture?
Possibly even x86-64.

> Too bad that the __user attribute is stripped during a normal compile.
> But couldn't we add the information whether a pointer belongs to user
> or kernel space in the trace event definition? For syscall tracing it's
> easy, because pointer types in SYSCALL_DEFINE() and friends are always
> userspace pointers?

Also, when the __user pointer is saved it MUST be checked for being
a valid user pointer (eg with access_ok(ptr, 1).

You really do need to remember whether the pointer is user or kernel
when you save it.

I also suspect that you need to check for contexts where 'current'
isn't really valid (eg any kind on interrupt) and ensure the user
reads aren't even attempted.
The excuse of being 'root' in order to request/read trace isn't
really a very good one.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


2022-01-13 22:28:47

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v2] tracing: Add test for user space strings when filtering on string pointers

On Thu, 13 Jan 2022 22:11:35 +0000
David Laight <[email protected]> wrote:


> > Too bad that the __user attribute is stripped during a normal compile.
> > But couldn't we add the information whether a pointer belongs to user
> > or kernel space in the trace event definition? For syscall tracing it's
> > easy, because pointer types in SYSCALL_DEFINE() and friends are always
> > userspace pointers?
>
> Also, when the __user pointer is saved it MUST be checked for being
> a valid user pointer (eg with access_ok(ptr, 1).

It's rather hard to even know if a pointer is __user or not. It could be
some random address in any event field.


>
> You really do need to remember whether the pointer is user or kernel
> when you save it.
>
> I also suspect that you need to check for contexts where 'current'
> isn't really valid (eg any kind on interrupt) and ensure the user
> reads aren't even attempted.

It's not going to crash, even if it is required. The
strncpy_from_user/kernel_nofault() should detect any of that, right? Or are
those functions not safe to call?


> The excuse of being 'root' in order to request/read trace isn't
> really a very good one.

Not sure what you are getting at here? If you are worried about tracing
reading anything, then disable it. There's a lockdown on tracing. for those
that are worried.

Heck I could just do:

# echo 'p:random_umem __common_interrupt add=+0(@0x7f073a188000):u64' > /sys/kernel/tracing/kprobe_events
# trace-cmd start -e random_umem
# trace-cmd show
# tracer: nop
#
# entries-in-buffer/entries-written: 15/15 #P:8
#
# _-----=> irqs-off/BH-disabled
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / _-=> migrate-disable
# |||| / delay
# TASK-PID CPU# ||||| TIMESTAMP FUNCTION
# | | | ||||| | |
<idle>-0 [000] d.h1. 844.051612: random_umem: (__common_interrupt+0x0/0x100) add=7291432837672293
<idle>-0 [000] d.h1. 844.051694: random_umem: (__common_interrupt+0x0/0x100) add=74591483011526501
<idle>-0 [000] d.H1. 844.051743: random_umem: (__common_interrupt+0x0/0x100) add=21474836483
<idle>-0 [000] d.h1. 844.383333: random_umem: (__common_interrupt+0x0/0x100) add=115964116992
<idle>-0 [000] d.h1. 844.383802: random_umem: (__common_interrupt+0x0/0x100) add=1125968626319387
<idle>-0 [000] d.h1. 844.383864: random_umem: (__common_interrupt+0x0/0x100) add=115964116992
<idle>-0 [000] d.h1. 844.533321: random_umem: (__common_interrupt+0x0/0x100) add=4032068056083813
<idle>-0 [000] d.h1. 844.533801: random_umem: (__common_interrupt+0x0/0x100) add=6929886302
kworker/0:1-15 [000] d.h.. 844.900412: random_umem: (__common_interrupt+0x0/0x100) add=17367183
<idle>-0 [000] d.h1. 844.933428: random_umem: (__common_interrupt+0x0/0x100) add=115964116992
<idle>-0 [000] d.h1. 844.933818: random_umem: (__common_interrupt+0x0/0x100) add=17367183
<idle>-0 [000] d.h1. 844.933955: random_umem: (__common_interrupt+0x0/0x100) add=74591483008647173
<idle>-0 [000] d.h1. 845.364181: random_umem: (__common_interrupt+0x0/0x100) add=1125968626319387
bash-1759 [000] d.h.. 845.364541: random_umem: (__common_interrupt+0x0/0x100) add=13912554418208784
bash-1759 [000] d.h.. 845.364592: random_umem: (__common_interrupt+0x0/0x100) add=18446744071906760329

And at every interrupt I get the memory at: 0x7f073a188000 for the current
running task that was running when the interrupt happened.

As I said. It's an API.

-- Steve

2022-01-14 02:15:09

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v2] tracing: Add test for user space strings when filtering on string pointers

On Thu, 13 Jan 2022 16:51:15 -0500
Steven Rostedt <[email protected]> wrote:

> We could add something later. As it is currently the merge window, and this
> is a real bug, I'm going to just leave it as is, and we can work to fix the
> other archs later. I need to get a pull request ready by tomorrow.

Actually I got this working, and looks like a reasonable answer to the
problem. It basically requires that the user specify that the pointer
points into user space for the kernel to read it.

Thus instead of:

echo 'filename ~ "trace"' > events/syscalls/sys_enter_openat/filter

They must now do:

echo 'filename.ustring ~ "trace"' > events/syscalls/sys_enter_openat/filter

I updated the documentation to reflect this.

-- Steve

From: Steven Rostedt <[email protected]>
Subject: [PATCH] tracing: Add ustring operation to filtering string pointers

Since referencing user space pointers is special, if the user wants to
filter on a field that is a pointer to user space, then they need to
specify it.

Add a ".ustring" attribute to the field name for filters to state that the
field is pointing to user space such that the kernel can take the
appropriate action to read that pointer.

Signed-off-by: Steven Rostedt <[email protected]>
---
Documentation/trace/events.rst | 9 ++++
kernel/trace/trace_events_filter.c | 81 +++++++++++++++++++++---------
2 files changed, 66 insertions(+), 24 deletions(-)

diff --git a/Documentation/trace/events.rst b/Documentation/trace/events.rst
index 45e66a60a816..c47f381d0c00 100644
--- a/Documentation/trace/events.rst
+++ b/Documentation/trace/events.rst
@@ -198,6 +198,15 @@ The glob (~) accepts a wild card character (\*,?) and character classes
prev_comm ~ "*sh*"
prev_comm ~ "ba*sh"

+If the field is a pointer that points into user space (for example
+"filename" from sys_enter_openat), then you have to append ".ustring" to the
+field name::
+
+ filename.ustring ~ "password"
+
+As the kernel will have to know how to retrieve the memory that the pointer
+is at from user space.
+
5.2 Setting filters
-------------------

diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c
index 2e9ef64e9ee9..b458a9afa2c0 100644
--- a/kernel/trace/trace_events_filter.c
+++ b/kernel/trace/trace_events_filter.c
@@ -665,6 +665,23 @@ struct ustring_buffer {
static __percpu struct ustring_buffer *ustring_per_cpu;

static __always_inline char *test_string(char *str)
+{
+ struct ustring_buffer *ubuf;
+ char *kstr;
+
+ if (!ustring_per_cpu)
+ return NULL;
+
+ ubuf = this_cpu_ptr(ustring_per_cpu);
+ kstr = ubuf->buffer;
+
+ /* For safety, do not trust the string pointer */
+ if (!strncpy_from_kernel_nofault(kstr, str, USTRING_BUF_SIZE))
+ return NULL;
+ return kstr;
+}
+
+static __always_inline char *test_ustring(char *str)
{
struct ustring_buffer *ubuf;
char __user *ustr;
@@ -676,23 +693,11 @@ static __always_inline char *test_string(char *str)
ubuf = this_cpu_ptr(ustring_per_cpu);
kstr = ubuf->buffer;

- /*
- * We use TASK_SIZE to denote user or kernel space, but this will
- * not work for all architectures. If it picks the wrong one, it may
- * just fail the filter (but will not bug).
- *
- * TODO: Have a way to properly denote which one this is for.
- */
- if (likely((unsigned long)str >= TASK_SIZE)) {
- /* For safety, do not trust the string pointer */
- if (!strncpy_from_kernel_nofault(kstr, str, USTRING_BUF_SIZE))
- return NULL;
- } else {
- /* user space address? */
- ustr = (char __user *)str;
- if (!strncpy_from_user_nofault(kstr, ustr, USTRING_BUF_SIZE))
- return NULL;
- }
+ /* user space address? */
+ ustr = (char __user *)str;
+ if (!strncpy_from_user_nofault(kstr, ustr, USTRING_BUF_SIZE))
+ return NULL;
+
return kstr;
}

@@ -709,24 +714,42 @@ static int filter_pred_string(struct filter_pred *pred, void *event)
return match;
}

+static __always_inline int filter_pchar(struct filter_pred *pred, char *str)
+{
+ int cmp, match;
+ int len;
+
+ len = strlen(str) + 1; /* including tailing '\0' */
+ cmp = pred->regex.match(str, &pred->regex, len);
+
+ match = cmp ^ pred->not;
+
+ return match;
+}
/* Filter predicate for char * pointers */
static int filter_pred_pchar(struct filter_pred *pred, void *event)
{
char **addr = (char **)(event + pred->offset);
char *str;
- int cmp, match;
- int len;

str = test_string(*addr);
if (!str)
return 0;

- len = strlen(str) + 1; /* including tailing '\0' */
- cmp = pred->regex.match(str, &pred->regex, len);
+ return filter_pchar(pred, str);
+}

- match = cmp ^ pred->not;
+/* Filter predicate for char * pointers in user space*/
+static int filter_pred_pchar_user(struct filter_pred *pred, void *event)
+{
+ char **addr = (char **)(event + pred->offset);
+ char *str;

- return match;
+ str = test_ustring(*addr);
+ if (!str)
+ return 0;
+
+ return filter_pchar(pred, str);
}

/*
@@ -1232,6 +1255,7 @@ static int parse_pred(const char *str, void *data,
struct filter_pred *pred = NULL;
char num_buf[24]; /* Big enough to hold an address */
char *field_name;
+ bool ustring = false;
char q;
u64 val;
int len;
@@ -1266,6 +1290,12 @@ static int parse_pred(const char *str, void *data,
return -EINVAL;
}

+ /* See if the field is a user space string */
+ if ((len = str_has_prefix(str + i, ".ustring"))) {
+ ustring = true;
+ i += len;
+ }
+
while (isspace(str[i]))
i++;

@@ -1405,7 +1435,10 @@ static int parse_pred(const char *str, void *data,
goto err_mem;
}

- pred->fn = filter_pred_pchar;
+ if (ustring)
+ pred->fn = filter_pred_pchar_user;
+ else
+ pred->fn = filter_pred_pchar;
}
/* go past the last quote */
i++;
--
2.33.0


2022-01-14 07:29:32

by Sven Schnelle

[permalink] [raw]
Subject: Re: [PATCH v2] tracing: Add test for user space strings when filtering on string pointers

Steven Rostedt <[email protected]> writes:

> On Thu, 13 Jan 2022 16:51:15 -0500
> Steven Rostedt <[email protected]> wrote:
>
>> We could add something later. As it is currently the merge window, and this
>> is a real bug, I'm going to just leave it as is, and we can work to fix the
>> other archs later. I need to get a pull request ready by tomorrow.
>
> Actually I got this working, and looks like a reasonable answer to the
> problem. It basically requires that the user specify that the pointer
> points into user space for the kernel to read it.
>
> Thus instead of:
>
> echo 'filename ~ "trace"' > events/syscalls/sys_enter_openat/filter
>
> They must now do:
>
> echo 'filename.ustring ~ "trace"' > events/syscalls/sys_enter_openat/filter
>
> I updated the documentation to reflect this.

Looks good and works on s390, feel free to add my tested-by:

Tested-by: Sven Schnelle <[email protected]>

Thanks!

2022-01-14 09:35:14

by David Laight

[permalink] [raw]
Subject: RE: [PATCH v2] tracing: Add test for user space strings when filtering on string pointers

From: Steven Rostedt
> Sent: 14 January 2022 02:15
>
> On Thu, 13 Jan 2022 16:51:15 -0500
> Steven Rostedt <[email protected]> wrote:
>
> > We could add something later. As it is currently the merge window, and this
> > is a real bug, I'm going to just leave it as is, and we can work to fix the
> > other archs later. I need to get a pull request ready by tomorrow.
>
> Actually I got this working, and looks like a reasonable answer to the
> problem. It basically requires that the user specify that the pointer
> points into user space for the kernel to read it.

Certainly look better.

I'm not sure about the difference between the _inatomic() and _nocheck()
user access functions though.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)