Programs that utilize user_events today only get the event payloads via
perf or ftrace when writing event data. When BPF programs are attached
to tracepoints created by user_events the BPF programs do not get run
even though the attach succeeds. This causes confusion by the users of
the programs, as they expect the data to be available via BPF programs
they write. We have several projects that have hit this and requested
BPF program support when publishing data via user_events from their
user processes in production.
Swap out perf_trace_buf_submit() for perf_trace_run_bpf_submit() to
ensure BPF programs that are attached are run in addition to writing to
perf or ftrace buffers. This requires no changes to the BPF infrastructure
and only utilizes the GPL exported function that modules and other
components may use for the same purpose. This keeps user_events consistent
with how other kernel, modules, and probes expose tracepoint data to allow
attachment of a BPF program.
Cc: Alexei Starovoitov <[email protected]>
Cc: Daniel Borkmann <[email protected]>
Cc: Andrii Nakryiko <[email protected]>
Cc: [email protected]
Signed-off-by: Beau Belgrave <[email protected]>
---
kernel/trace/trace_events_user.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/kernel/trace/trace_events_user.c b/kernel/trace/trace_events_user.c
index b1ecd7677642..838fced40a41 100644
--- a/kernel/trace/trace_events_user.c
+++ b/kernel/trace/trace_events_user.c
@@ -1422,11 +1422,12 @@ static void user_event_ftrace(struct user_event *user, struct iov_iter *i,
static void user_event_perf(struct user_event *user, struct iov_iter *i,
void *tpdata, bool *faulted)
{
+ bool bpf_prog = bpf_prog_array_valid(&user->call);
struct hlist_head *perf_head;
perf_head = this_cpu_ptr(user->call.perf_events);
- if (perf_head && !hlist_empty(perf_head)) {
+ if (perf_head && (!hlist_empty(perf_head) || bpf_prog)) {
struct trace_entry *perf_entry;
struct pt_regs *regs;
size_t size = sizeof(*perf_entry) + i->count;
@@ -1447,9 +1448,9 @@ static void user_event_perf(struct user_event *user, struct iov_iter *i,
unlikely(user_event_validate(user, perf_entry, size)))
goto discard;
- perf_trace_buf_submit(perf_entry, size, context,
- user->call.event.type, 1, regs,
- perf_head, NULL);
+ perf_trace_run_bpf_submit(perf_entry, size, context,
+ &user->call, 1, regs,
+ perf_head, NULL);
return;
discard:
base-commit: 3862f86c1529fa0016de6344eb974877b4cd3838
--
2.25.1
On Mon, May 8, 2023 at 9:38 AM Beau Belgrave <[email protected]> wrote:
>
> Programs that utilize user_events today only get the event payloads via
> perf or ftrace when writing event data. When BPF programs are attached
> to tracepoints created by user_events the BPF programs do not get run
> even though the attach succeeds. This causes confusion by the users of
> the programs, as they expect the data to be available via BPF programs
> they write. We have several projects that have hit this and requested
> BPF program support when publishing data via user_events from their
> user processes in production.
>
> Swap out perf_trace_buf_submit() for perf_trace_run_bpf_submit() to
> ensure BPF programs that are attached are run in addition to writing to
> perf or ftrace buffers. This requires no changes to the BPF infrastructure
> and only utilizes the GPL exported function that modules and other
> components may use for the same purpose. This keeps user_events consistent
> with how other kernel, modules, and probes expose tracepoint data to allow
> attachment of a BPF program.
Sorry, I have to keep my Nack here.
I see no practical use case for bpf progs to be connected to user events.
There must be a different way to solve your user needs
and this is not bpf.
On Tue, 9 May 2023 08:24:29 -0700
Alexei Starovoitov <[email protected]> wrote:
> On Mon, May 8, 2023 at 9:38 AM Beau Belgrave <[email protected]> wrote:
> >
> > Programs that utilize user_events today only get the event payloads via
> > perf or ftrace when writing event data. When BPF programs are attached
> > to tracepoints created by user_events the BPF programs do not get run
> > even though the attach succeeds. This causes confusion by the users of
> > the programs, as they expect the data to be available via BPF programs
> > they write. We have several projects that have hit this and requested
> > BPF program support when publishing data via user_events from their
> > user processes in production.
> >
> > Swap out perf_trace_buf_submit() for perf_trace_run_bpf_submit() to
> > ensure BPF programs that are attached are run in addition to writing to
> > perf or ftrace buffers. This requires no changes to the BPF infrastructure
> > and only utilizes the GPL exported function that modules and other
> > components may use for the same purpose. This keeps user_events consistent
> > with how other kernel, modules, and probes expose tracepoint data to allow
> > attachment of a BPF program.
>
> Sorry, I have to keep my Nack here.
>
> I see no practical use case for bpf progs to be connected to user events.
That's not a technical reason. Obviously they have a use case.
This is only connecting to BPF through the API. It makes no changes to
BPF itself, so I'm not sure your NACK has jurisdiction here. Their
alternative is to to do it with an external module as the only
connections to BPF it uses is via an EXPORT_SYMBOL_GPL() function!
Again, what is your technical reason for nacking this? It's like me
nacking a user of ftrace because I don't see a use case for it. That's
not a valid reason to issue a nack.
>
> There must be a different way to solve your user needs
> and this is not bpf.
Why not use BPF?
-- Steve
On Tue, 9 May 2023 13:01:11 -0400
Steven Rostedt <[email protected]> wrote:
> > I see no practical use case for bpf progs to be connected to user events.
>
> That's not a technical reason. Obviously they have a use case.
Alexei,
It was great having a chat with you during lunch at LSFMM/BPF!
Looking forward to your technical response that I believe are
legitimate requests. I'm replying here, as during our conversation, you
had the misperception that the user events had a system call when the
event was disabled. I told you I will point out the code that shows
that the kernel sets the bit, and that user space does not do a system
call when the event is disable.
From the user space side, which does:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/samples/user_events/example.c#n60
/* Check if anyone is listening */
if (enabled) {
/* Yep, trace out our data */
writev(data_fd, (const struct iovec *)io, 2);
/* Increase the count */
count++;
printf("Something was attached, wrote data\n");
}
Where it told the kernel about that "enabled" variable:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/samples/user_events/example.c#n47
if (event_reg(data_fd, "test u32 count", &write, &enabled) == -1)
return errno;
static int event_reg(int fd, const char *command, int *write, int *enabled)
{
struct user_reg reg = {0};
reg.size = sizeof(reg);
reg.enable_bit = 31;
reg.enable_size = sizeof(*enabled);
reg.enable_addr = (__u64)enabled;
reg.name_args = (__u64)command;
if (ioctl(fd, DIAG_IOCSREG, ®) == -1)
return -1;
*write = reg.write_index;
return 0;
}
The above will add a trace event into tracefs. When someone does:
# echo 1 > /sys/kernel/tracing/user_events/test/enable
The kernel will trigger the class->reg function, defined by:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/trace/trace_events_user.c#n1804
user->class.reg = user_event_reg;
Which calls:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/trace/trace_events_user.c#n1555
update_enable_bit_for(user);
Which does:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/trace/trace_events_user.c#n1465
update_enable_bit_for() {
[..]
user_event_enabler_update(user);
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/trace/trace_events_user.c#n451
user_event_enabler_update() {
[..]
user_event_enabler_write(mm, enabler, true, &attempt);
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/trace/trace_events_user.c#n385
static int user_event_enabler_write(struct user_event_mm *mm,
struct user_event_enabler *enabler,
bool fixup_fault, int *attempt)
{
unsigned long uaddr = enabler->addr;
unsigned long *ptr;
struct page *page;
void *kaddr;
int ret;
lockdep_assert_held(&event_mutex);
mmap_assert_locked(mm->mm);
*attempt += 1;
/* Ensure MM has tasks, cannot use after exit_mm() */
if (refcount_read(&mm->tasks) == 0)
return -ENOENT;
if (unlikely(test_bit(ENABLE_VAL_FAULTING_BIT, ENABLE_BITOPS(enabler)) ||
test_bit(ENABLE_VAL_FREEING_BIT, ENABLE_BITOPS(enabler))))
return -EBUSY;
ret = pin_user_pages_remote(mm->mm, uaddr, 1, FOLL_WRITE | FOLL_NOFAULT,
&page, NULL, NULL);
if (unlikely(ret <= 0)) {
if (!fixup_fault)
return -EFAULT;
if (!user_event_enabler_queue_fault(mm, enabler, *attempt))
pr_warn("user_events: Unable to queue fault handler\n");
return -EFAULT;
}
kaddr = kmap_local_page(page);
ptr = kaddr + (uaddr & ~PAGE_MASK);
/* Update bit atomically, user tracers must be atomic as well */
if (enabler->event && enabler->event->status)
set_bit(enabler->values & ENABLE_VAL_BIT_MASK, ptr);
else
clear_bit(enabler->values & ENABLE_VAL_BIT_MASK, ptr);
kunmap_local(kaddr);
unpin_user_pages_dirty_lock(&page, 1, true);
return 0;
}
The above maps the user space address and then sets the bit that was
registered.
That is, it changes "enabled" to true, and the if statement:
if (enabled) {
/* Yep, trace out our data */
writev(data_fd, (const struct iovec *)io, 2);
/* Increase the count */
count++;
printf("Something was attached, wrote data\n");
}
Is now executed.
-- Steve
On Tue, 9 May 2023 16:30:50 -0400
Steven Rostedt <[email protected]> wrote:
> >From the user space side, which does:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/samples/user_events/example.c#n60
>
> /* Check if anyone is listening */
> if (enabled) {
Hmm, looking at this deeper, we should update it to prevent the
compiler from optimizing it, and keeping "enabled" in a register. Which
would not work. Should probably add:
if (*(const volatile int *)&enabled) {
-- Steve
> /* Yep, trace out our data */
> writev(data_fd, (const struct iovec *)io, 2);
>
> /* Increase the count */
> count++;
>
> printf("Something was attached, wrote data\n");
> }
On Tue, May 09, 2023 at 04:30:50PM -0400, Steven Rostedt wrote:
> On Tue, 9 May 2023 13:01:11 -0400
> Steven Rostedt <[email protected]> wrote:
>
> > > I see no practical use case for bpf progs to be connected to user events.
> >
> > That's not a technical reason. Obviously they have a use case.
>
> Alexei,
>
> It was great having a chat with you during lunch at LSFMM/BPF!
Yeah. It was great catching up!
> Looking forward to your technical response that I believe are
> legitimate requests. I'm replying here, as during our conversation, you
> had the misperception that the user events had a system call when the
> event was disabled. I told you I will point out the code that shows
> that the kernel sets the bit, and that user space does not do a system
> call when the event is disable.
Thank you for these details. Answer below...
> From the user space side, which does:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/samples/user_events/example.c#n60
>
> /* Check if anyone is listening */
> if (enabled) {
> /* Yep, trace out our data */
> writev(data_fd, (const struct iovec *)io, 2);
>
> /* Increase the count */
> count++;
>
> printf("Something was attached, wrote data\n");
> }
>
>
> Where it told the kernel about that "enabled" variable:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/samples/user_events/example.c#n47
>
> if (event_reg(data_fd, "test u32 count", &write, &enabled) == -1)
> return errno;
>
> static int event_reg(int fd, const char *command, int *write, int *enabled)
> {
> struct user_reg reg = {0};
>
> reg.size = sizeof(reg);
> reg.enable_bit = 31;
> reg.enable_size = sizeof(*enabled);
> reg.enable_addr = (__u64)enabled;
> reg.name_args = (__u64)command;
>
> if (ioctl(fd, DIAG_IOCSREG, ®) == -1)
> return -1;
>
> *write = reg.write_index;
>
> return 0;
> }
>
> The above will add a trace event into tracefs. When someone does:
>
> # echo 1 > /sys/kernel/tracing/user_events/test/enable
>
> The kernel will trigger the class->reg function, defined by:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/trace/trace_events_user.c#n1804
>
> user->class.reg = user_event_reg;
>
> Which calls:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/trace/trace_events_user.c#n1555
>
> update_enable_bit_for(user);
>
> Which does:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/trace/trace_events_user.c#n1465
>
> update_enable_bit_for() {
> [..]
> user_event_enabler_update(user);
>
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/trace/trace_events_user.c#n451
>
> user_event_enabler_update() {
> [..]
> user_event_enabler_write(mm, enabler, true, &attempt);
Which will do
rcu_read_lock()
and then call user_event_enabler_write() under lock...
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/trace/trace_events_user.c#n385
>
> static int user_event_enabler_write(struct user_event_mm *mm,
> struct user_event_enabler *enabler,
> bool fixup_fault, int *attempt)
> {
> unsigned long uaddr = enabler->addr;
> unsigned long *ptr;
> struct page *page;
> void *kaddr;
> int ret;
>
> lockdep_assert_held(&event_mutex);
> mmap_assert_locked(mm->mm);
>
> *attempt += 1;
>
> /* Ensure MM has tasks, cannot use after exit_mm() */
> if (refcount_read(&mm->tasks) == 0)
> return -ENOENT;
>
> if (unlikely(test_bit(ENABLE_VAL_FAULTING_BIT, ENABLE_BITOPS(enabler)) ||
> test_bit(ENABLE_VAL_FREEING_BIT, ENABLE_BITOPS(enabler))))
> return -EBUSY;
>
> ret = pin_user_pages_remote(mm->mm, uaddr, 1, FOLL_WRITE | FOLL_NOFAULT,
> &page, NULL, NULL);
... which will call pin_user_pages_remote() in RCU CS.
This looks buggy, since pin_user_pages_remote() may schedule.
> if (unlikely(ret <= 0)) {
> if (!fixup_fault)
> return -EFAULT;
>
> if (!user_event_enabler_queue_fault(mm, enabler, *attempt))
> pr_warn("user_events: Unable to queue fault handler\n");
This part looks questionable.
The only users of fixup_user_fault() were futex and KVM.
Now user_events are calling it too from user_event_mm_fault_in() where
"bool unlocked;" is uninitialized and state of this flag is not checked
after fixup_user_fault() call.
Not an MM expert, but this is suspicious.
>
> return -EFAULT;
> }
>
> kaddr = kmap_local_page(page);
> ptr = kaddr + (uaddr & ~PAGE_MASK);
>
> /* Update bit atomically, user tracers must be atomic as well */
> if (enabler->event && enabler->event->status)
> set_bit(enabler->values & ENABLE_VAL_BIT_MASK, ptr);
> else
> clear_bit(enabler->values & ENABLE_VAL_BIT_MASK, ptr);
Furthermore.
Here the kernel writes bits in user pages.
It's missing user_access_begin/end.
Early on there was an access_ok() check during user_event registration,
but it's not enough.
I believe user_access_begin() has to be done before the actual access,
since it does __uaccess_begin_nospec().
Another issue is that the user space could have supplied any address as
enabler->addr including addr in a huge page or a file backed mmaped address.
I don't know whether above code can handle it.
I'm not a GUP expert either, but direct use of pin_user_pages_remote() looks
suspicious too.
I think ptrace_may_access() is missing.
I guess it has to be a root user to do
echo 1 > /sys/kernel/tracing/user_events/test/enable
to trigger the kernel writes into various MM of user processes, but still.
There are security/LSM checks in many paths that accesses user memory.
These checks are bypassed here.
> kunmap_local(kaddr);
> unpin_user_pages_dirty_lock(&page, 1, true);
>
> return 0;
> }
>
> The above maps the user space address and then sets the bit that was
> registered.
>
> That is, it changes "enabled" to true, and the if statement:
>
> if (enabled) {
and not just 'volatile' is missing, but this is buggy in general.
The kernel only wrote one bit into 'enabled' variable.
The user space should be checking that one bit only.
Since samples/user_events/example.c registering with reg.enable_bit = 31;
it probably should be
if (READ_ONCE(enabled) & (1u << 31))
> /* Yep, trace out our data */
> writev(data_fd, (const struct iovec *)io, 2);
>
> /* Increase the count */
> count++;
>
> printf("Something was attached, wrote data\n");
Another misleading example. The writev() could have failed,
but the message will say "success".
And it's easy to make mistake here.
The iovec[0] should be write_index that was received by user space
after registration via ioctl.
If my understanding of user_events design is correct, various user
process (all running as root) will open /sys/kernel/tracing/user_events_data
then will do multiple ioctl(fd, DIAG_IOCSREG) for various events and
remember write_index-es and enabled's addresses.
Then in various places in the code they will do
if (READ_ONCE(enabled_X) & (1u << correct_bit)) {
io[0].iov_base = &write_index_X;
io[1].iov_base = data_to_send_to_kernel;
and write_index has to match with the format of data.
During the writev() the kernel will validate user_event_validate(),
but this is expensive.
The design of user events looks fragile to me. One user process can write
into user_event of another process by supplying wrong 'write_index' and the
kernel won't catch it if data formats are compatible.
All such processes have to be root to access /sys/kernel/tracing/user_events_data,
so not a security issue, but use cases for user_events seems to be very limited.
During LSFMMBPF, Steven, you've mentioned that you want to use user_event in chrome.
I think you didn't imply that chrome browser will be running as root.
You probably meant something else.
Now as far as this particular patch.
s/perf_trace_buf_submit/perf_trace_run_bpf_submit/
may look trivial, but there is a lot to unpack here.
How bpf prog was attached to user event?
What is the life time of bpf prog?
What happens when user process crashes?
What happens when user event is unregistered ?
What is bpf prog context? Like, what helpers are allowed to be called?
Does libbpf need updating?
etc etc
No selftests were provided with this patch, so impossible to answer.
In general we don't want bpf to be called in various parts of the kernel
just because bpf was used in similar parts elsewhere.
bpf needs to provide real value for a particular kernel subsystem.
For user events it's still not clear to me what bpf can bring to the table.
The commit log of this proposed patch says:
"When BPF programs are attached to tracepoints created by user_events
the BPF programs do not get run even though the attach succeeds."
It looks to me that it's a bug in attaching.
The kernel shouldn't have allowed attaching bpf prog to user events,
since they cannot be run.
Then the commit log says:
"This keeps user_events consistent
with how other kernel, modules, and probes expose tracepoint data to allow
attachment of a BPF program."
"keep consistent" is not a reason to use bpf with user_events.
Beau,
please provide a detailed explanation of your use case and how bpf helps.
Also please explain why uprobe/USDT and bpf don't achieve your goals.
Various user space applications have USDTs in them.
This is an existing mechanism that was proven to be useful to many projects
including glibc, python, mysql.
Comparing to user_events the USDTs work fine in unprivileged applications
and have zero overhead when not turned on. USDT is a single 'nop' instruction
while user events need if(enabled & bit) check plus iov prep and write.
When enabled the write() is probably faster than USDT trap, but all the extra
overhead in tracepoint and user_event_validate() probably makes it the same speed.
So why not USDT ?
On Mon, 15 May 2023 09:57:07 -0700
Alexei Starovoitov <[email protected]> wrote:
> Thank you for these details. Answer below...
Thanks for this well thought out reply!
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/trace/trace_events_user.c#n451
> >
> > user_event_enabler_update() {
> > [..]
> > user_event_enabler_write(mm, enabler, true, &attempt);
>
> Which will do
> rcu_read_lock()
> and then call user_event_enabler_write() under lock...
>
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/trace/trace_events_user.c#n385
> >
> > static int user_event_enabler_write(struct user_event_mm *mm,
> > struct user_event_enabler *enabler,
> > bool fixup_fault, int *attempt)
> > {
> > unsigned long uaddr = enabler->addr;
> > unsigned long *ptr;
> > struct page *page;
> > void *kaddr;
> > int ret;
> >
> > lockdep_assert_held(&event_mutex);
> > mmap_assert_locked(mm->mm);
> >
> > *attempt += 1;
> >
> > /* Ensure MM has tasks, cannot use after exit_mm() */
> > if (refcount_read(&mm->tasks) == 0)
> > return -ENOENT;
> >
> > if (unlikely(test_bit(ENABLE_VAL_FAULTING_BIT, ENABLE_BITOPS(enabler)) ||
> > test_bit(ENABLE_VAL_FREEING_BIT, ENABLE_BITOPS(enabler))))
> > return -EBUSY;
> >
> > ret = pin_user_pages_remote(mm->mm, uaddr, 1, FOLL_WRITE | FOLL_NOFAULT,
> > &page, NULL, NULL);
>
> ... which will call pin_user_pages_remote() in RCU CS.
> This looks buggy, since pin_user_pages_remote() may schedule.
Hmm, if that's the case, we should add might_sleep() to that call.
>
> > if (unlikely(ret <= 0)) {
> > if (!fixup_fault)
> > return -EFAULT;
> >
> > if (!user_event_enabler_queue_fault(mm, enabler, *attempt))
> > pr_warn("user_events: Unable to queue fault handler\n");
>
> This part looks questionable.
>
> The only users of fixup_user_fault() were futex and KVM.
> Now user_events are calling it too from user_event_mm_fault_in() where
> "bool unlocked;" is uninitialized and state of this flag is not checked
> after fixup_user_fault() call.
> Not an MM expert, but this is suspicious.
Hmm, yeah, this should be:
static int user_event_mm_fault_in()
{
bool unlocked = false;
[..]
out:
if (!unlocked)
mmap_read_unlock(mm->mm);
}
Good catch!
>
> >
> > return -EFAULT;
> > }
> >
> > kaddr = kmap_local_page(page);
> > ptr = kaddr + (uaddr & ~PAGE_MASK);
> >
> > /* Update bit atomically, user tracers must be atomic as well */
> > if (enabler->event && enabler->event->status)
> > set_bit(enabler->values & ENABLE_VAL_BIT_MASK, ptr);
> > else
> > clear_bit(enabler->values & ENABLE_VAL_BIT_MASK, ptr);
>
> Furthermore.
> Here the kernel writes bits in user pages.
> It's missing user_access_begin/end.
> Early on there was an access_ok() check during user_event registration,
> but it's not enough.
> I believe user_access_begin() has to be done before the actual access,
> since it does __uaccess_begin_nospec().
But it actually mapped the address to kernel. The ptr is pointing to a
kernel page, not the user space page, but the memory is shared between both.
>
> Another issue is that the user space could have supplied any address as
> enabler->addr including addr in a huge page or a file backed mmaped address.
> I don't know whether above code can handle it.
>
> I'm not a GUP expert either, but direct use of pin_user_pages_remote() looks
> suspicious too.
> I think ptrace_may_access() is missing.
> I guess it has to be a root user to do
> echo 1 > /sys/kernel/tracing/user_events/test/enable
>
> to trigger the kernel writes into various MM of user processes, but still.
> There are security/LSM checks in many paths that accesses user memory.
> These checks are bypassed here.
I'm happy to audit this further. I'll just have to add that to my TODO list
:-p
>
> > kunmap_local(kaddr);
> > unpin_user_pages_dirty_lock(&page, 1, true);
> >
> > return 0;
> > }
> >
> > The above maps the user space address and then sets the bit that was
> > registered.
> >
> > That is, it changes "enabled" to true, and the if statement:
> >
> > if (enabled) {
>
> and not just 'volatile' is missing, but this is buggy in general.
> The kernel only wrote one bit into 'enabled' variable.
> The user space should be checking that one bit only.
> Since samples/user_events/example.c registering with reg.enable_bit = 31;
> it probably should be
> if (READ_ONCE(enabled) & (1u << 31))
The other bits are actually for other tracers. Yeah, it's missing the
de-multiplexing below, and the comment should mention that.
That is, what we decided was to have the API keep bit 31 for the kernel,
but other tracers could map other bits, and we would have the tracing logic
in a place that would allow something like LTTng hook into it and call its
code. Say LTTng is bit 1, then it would set it when it wants a trace.
The if statement is still correct, but the calling into the kernel should
only be done if bit 31 is set.
>
> > /* Yep, trace out our data */
> > writev(data_fd, (const struct iovec *)io, 2);
> >
> > /* Increase the count */
> > count++;
> >
> > printf("Something was attached, wrote data\n");
>
> Another misleading example. The writev() could have failed,
> but the message will say "success".
> And it's easy to make mistake here.
> The iovec[0] should be write_index that was received by user space
> after registration via ioctl.
Yeah, that should be cleaned up.
>
> If my understanding of user_events design is correct, various user
> process (all running as root) will open /sys/kernel/tracing/user_events_data
Actually, we can change the permissions of user_events_data to allow any
task. Or set the group permission and only allow certain groups access.
tracefs allows changing of ownerships of the files.
> then will do multiple ioctl(fd, DIAG_IOCSREG) for various events and
> remember write_index-es and enabled's addresses.
> Then in various places in the code they will do
> if (READ_ONCE(enabled_X) & (1u << correct_bit)) {
> io[0].iov_base = &write_index_X;
> io[1].iov_base = data_to_send_to_kernel;
>
> and write_index has to match with the format of data.
> During the writev() the kernel will validate user_event_validate(),
> but this is expensive.
> The design of user events looks fragile to me. One user process can write
> into user_event of another process by supplying wrong 'write_index' and the
> kernel won't catch it if data formats are compatible.
But the kernel tracing also includes the pid, so filtering or analysis
could catch that as well.
>
> All such processes have to be root to access /sys/kernel/tracing/user_events_data,
> so not a security issue, but use cases for user_events seems to be very limited.
> During LSFMMBPF, Steven, you've mentioned that you want to use user_event in chrome.
> I think you didn't imply that chrome browser will be running as root.
> You probably meant something else.
Again, it is easy to change ownership permissions of that file. We can make
allow the chrome group to have write access to it, and everything still
"just works".
>
> Now as far as this particular patch.
>
> s/perf_trace_buf_submit/perf_trace_run_bpf_submit/
>
> may look trivial, but there is a lot to unpack here.
>
> How bpf prog was attached to user event?
> What is the life time of bpf prog?
> What happens when user process crashes?
> What happens when user event is unregistered ?
> What is bpf prog context? Like, what helpers are allowed to be called?
> Does libbpf need updating?
> etc etc
>
> No selftests were provided with this patch, so impossible to answer.
>
> In general we don't want bpf to be called in various parts of the kernel
> just because bpf was used in similar parts elsewhere.
> bpf needs to provide real value for a particular kernel subsystem.
>
> For user events it's still not clear to me what bpf can bring to the table.
>
> The commit log of this proposed patch says:
> "When BPF programs are attached to tracepoints created by user_events
> the BPF programs do not get run even though the attach succeeds."
>
> It looks to me that it's a bug in attaching.
> The kernel shouldn't have allowed attaching bpf prog to user events,
> since they cannot be run.
>
> Then the commit log says:
> "This keeps user_events consistent
> with how other kernel, modules, and probes expose tracepoint data to allow
> attachment of a BPF program."
>
> "keep consistent" is not a reason to use bpf with user_events.
Thank you Alexei for asking these. The above are all valid concerns.
-- Steve
>
> Beau,
> please provide a detailed explanation of your use case and how bpf helps.
>
> Also please explain why uprobe/USDT and bpf don't achieve your goals.
> Various user space applications have USDTs in them.
> This is an existing mechanism that was proven to be useful to many projects
> including glibc, python, mysql.
>
> Comparing to user_events the USDTs work fine in unprivileged applications
> and have zero overhead when not turned on. USDT is a single 'nop' instruction
> while user events need if(enabled & bit) check plus iov prep and write.
>
> When enabled the write() is probably faster than USDT trap, but all the extra
> overhead in tracepoint and user_event_validate() probably makes it the same speed.
> So why not USDT ?
On Mon, May 15, 2023 at 09:57:07AM -0700, Alexei Starovoitov wrote:
> On Tue, May 09, 2023 at 04:30:50PM -0400, Steven Rostedt wrote:
> > On Tue, 9 May 2023 13:01:11 -0400
> > Steven Rostedt <[email protected]> wrote:
> >
> > > > I see no practical use case for bpf progs to be connected to user events.
> > >
> > > That's not a technical reason. Obviously they have a use case.
> >
> > Alexei,
> >
> > It was great having a chat with you during lunch at LSFMM/BPF!
>
> Yeah. It was great catching up!
>
> > Looking forward to your technical response that I believe are
> > legitimate requests. I'm replying here, as during our conversation, you
> > had the misperception that the user events had a system call when the
> > event was disabled. I told you I will point out the code that shows
> > that the kernel sets the bit, and that user space does not do a system
> > call when the event is disable.
>
> Thank you for these details. Answer below...
>
> > From the user space side, which does:
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/samples/user_events/example.c#n60
> >
> > /* Check if anyone is listening */
> > if (enabled) {
> > /* Yep, trace out our data */
> > writev(data_fd, (const struct iovec *)io, 2);
> >
> > /* Increase the count */
> > count++;
> >
> > printf("Something was attached, wrote data\n");
> > }
> >
> >
> > Where it told the kernel about that "enabled" variable:
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/samples/user_events/example.c#n47
> >
> > if (event_reg(data_fd, "test u32 count", &write, &enabled) == -1)
> > return errno;
> >
> > static int event_reg(int fd, const char *command, int *write, int *enabled)
> > {
> > struct user_reg reg = {0};
> >
> > reg.size = sizeof(reg);
> > reg.enable_bit = 31;
> > reg.enable_size = sizeof(*enabled);
> > reg.enable_addr = (__u64)enabled;
> > reg.name_args = (__u64)command;
> >
> > if (ioctl(fd, DIAG_IOCSREG, ®) == -1)
> > return -1;
> >
> > *write = reg.write_index;
> >
> > return 0;
> > }
> >
> > The above will add a trace event into tracefs. When someone does:
> >
> > # echo 1 > /sys/kernel/tracing/user_events/test/enable
> >
> > The kernel will trigger the class->reg function, defined by:
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/trace/trace_events_user.c#n1804
> >
> > user->class.reg = user_event_reg;
> >
> > Which calls:
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/trace/trace_events_user.c#n1555
> >
> > update_enable_bit_for(user);
> >
> > Which does:
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/trace/trace_events_user.c#n1465
> >
> > update_enable_bit_for() {
> > [..]
> > user_event_enabler_update(user);
> >
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/trace/trace_events_user.c#n451
> >
> > user_event_enabler_update() {
> > [..]
> > user_event_enabler_write(mm, enabler, true, &attempt);
>
> Which will do
> rcu_read_lock()
> and then call user_event_enabler_write() under lock...
>
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/trace/trace_events_user.c#n385
> >
> > static int user_event_enabler_write(struct user_event_mm *mm,
> > struct user_event_enabler *enabler,
> > bool fixup_fault, int *attempt)
> > {
> > unsigned long uaddr = enabler->addr;
> > unsigned long *ptr;
> > struct page *page;
> > void *kaddr;
> > int ret;
> >
> > lockdep_assert_held(&event_mutex);
> > mmap_assert_locked(mm->mm);
> >
> > *attempt += 1;
> >
> > /* Ensure MM has tasks, cannot use after exit_mm() */
> > if (refcount_read(&mm->tasks) == 0)
> > return -ENOENT;
> >
> > if (unlikely(test_bit(ENABLE_VAL_FAULTING_BIT, ENABLE_BITOPS(enabler)) ||
> > test_bit(ENABLE_VAL_FREEING_BIT, ENABLE_BITOPS(enabler))))
> > return -EBUSY;
> >
> > ret = pin_user_pages_remote(mm->mm, uaddr, 1, FOLL_WRITE | FOLL_NOFAULT,
> > &page, NULL, NULL);
>
> ... which will call pin_user_pages_remote() in RCU CS.
> This looks buggy, since pin_user_pages_remote() may schedule.
>
If it's possible to schedule, I can change this to cache the probe
callbacks under RCU then drop it. However, when would
pin_user_pages_remote() schedule with FOLL_NOFAULT? I couldn't pick up
where it might schedule?
> > if (unlikely(ret <= 0)) {
> > if (!fixup_fault)
> > return -EFAULT;
> >
> > if (!user_event_enabler_queue_fault(mm, enabler, *attempt))
> > pr_warn("user_events: Unable to queue fault handler\n");
>
> This part looks questionable.
>
> The only users of fixup_user_fault() were futex and KVM.
> Now user_events are calling it too from user_event_mm_fault_in() where
> "bool unlocked;" is uninitialized and state of this flag is not checked
> after fixup_user_fault() call.
> Not an MM expert, but this is suspicious.
>
fixup_user_fault() seems like a good function to use for this, especially
with with a pointer to unlocked passed (handles killable process and
does an internal retry). I understand the unlocked value states the
mmap_read_lock() that was taken was unlocked then reacquired. This does
not seem like an issue for our scenario. Would love someone from MM to
chime in (these patches were sent out to MM for comments, but didn't get
any).
> >
> > return -EFAULT;
> > }
> >
> > kaddr = kmap_local_page(page);
> > ptr = kaddr + (uaddr & ~PAGE_MASK);
> >
> > /* Update bit atomically, user tracers must be atomic as well */
> > if (enabler->event && enabler->event->status)
> > set_bit(enabler->values & ENABLE_VAL_BIT_MASK, ptr);
> > else
> > clear_bit(enabler->values & ENABLE_VAL_BIT_MASK, ptr);
>
> Furthermore.
> Here the kernel writes bits in user pages.
> It's missing user_access_begin/end.
> Early on there was an access_ok() check during user_event registration,
> but it's not enough.
> I believe user_access_begin() has to be done before the actual access,
> since it does __uaccess_begin_nospec().
>
> Another issue is that the user space could have supplied any address as
> enabler->addr including addr in a huge page or a file backed mmaped address.
> I don't know whether above code can handle it.
>
> I'm not a GUP expert either, but direct use of pin_user_pages_remote() looks
> suspicious too.
> I think ptrace_may_access() is missing.
> I guess it has to be a root user to do
> echo 1 > /sys/kernel/tracing/user_events/test/enable
>
> to trigger the kernel writes into various MM of user processes, but still.
> There are security/LSM checks in many paths that accesses user memory.
> These checks are bypassed here.
>
I don't see uprobes using ptrace_may_access() either, it replaces a page
under the same situation. I modelled a lot of this based on what both
futex and uprobes do for this.
If I missed something, I'm happy to add it. My understanding is that
tracefs is acting as the security boundary here.
> > kunmap_local(kaddr);
> > unpin_user_pages_dirty_lock(&page, 1, true);
> >
> > return 0;
> > }
> >
> > The above maps the user space address and then sets the bit that was
> > registered.
> >
> > That is, it changes "enabled" to true, and the if statement:
> >
> > if (enabled) {
>
> and not just 'volatile' is missing, but this is buggy in general.
> The kernel only wrote one bit into 'enabled' variable.
> The user space should be checking that one bit only.
> Since samples/user_events/example.c registering with reg.enable_bit = 31;
> it probably should be
> if (READ_ONCE(enabled) & (1u << 31))
>
> > /* Yep, trace out our data */
> > writev(data_fd, (const struct iovec *)io, 2);
> >
> > /* Increase the count */
> > count++;
> >
> > printf("Something was attached, wrote data\n");
>
> Another misleading example. The writev() could have failed,
> but the message will say "success".
> And it's easy to make mistake here.
> The iovec[0] should be write_index that was received by user space
> after registration via ioctl.
>
Yes, it's easy to make mistakes when using the ABI directly. We have a
library [1] to help with this, and libside [2] will also help here.
> If my understanding of user_events design is correct, various user
> process (all running as root) will open /sys/kernel/tracing/user_events_data
> then will do multiple ioctl(fd, DIAG_IOCSREG) for various events and
> remember write_index-es and enabled's addresses.
> Then in various places in the code they will do
> if (READ_ONCE(enabled_X) & (1u << correct_bit)) {
> io[0].iov_base = &write_index_X;
> io[1].iov_base = data_to_send_to_kernel;
>
> and write_index has to match with the format of data.
> During the writev() the kernel will validate user_event_validate(),
> but this is expensive.
Validation is only done for certain types of data, and is required to
ensure the kernel doesn't do bad things when filtering the data.
> The design of user events looks fragile to me. One user process can write
> into user_event of another process by supplying wrong 'write_index' and the
> kernel won't catch it if data formats are compatible.
>
The write_index is a per-process, per-fd index, you cannot do what you state
unless a process shares it's internal FD and goes out of it's way to
achieve that.
Even then the validators WILL catch if any data gets put in that will
try to perform bad accesses, etc. Currently the only check is for
__rel_loc / __data_loc (variable sized payloads) that are of type char.
In those cases we ensure the end of the buffer is 0 to ensure string
compares are null terminated.
> All such processes have to be root to access /sys/kernel/tracing/user_events_data,
> so not a security issue, but use cases for user_events seems to be very limited.
> During LSFMMBPF, Steven, you've mentioned that you want to use user_event in chrome.
> I think you didn't imply that chrome browser will be running as root.
> You probably meant something else.
>
> Now as far as this particular patch.
>
Thanks for the time you took to look at the above code outside this
patch.
> s/perf_trace_buf_submit/perf_trace_run_bpf_submit/
>
> may look trivial, but there is a lot to unpack here.
>
> How bpf prog was attached to user event?
> What is the life time of bpf prog?
> What happens when user process crashes?
> What happens when user event is unregistered ?
> What is bpf prog context? Like, what helpers are allowed to be called?
> Does libbpf need updating?
> etc etc
>
> No selftests were provided with this patch, so impossible to answer.
>
I thought it being a GPL export symbol that this kind of stuff would be
documented somewhere if there are requirements to use the method. As it
stands in the patch, the data that is sent to BPF is from the buffer
returned from perf_trace_buf_alloc() after it has been copied from the
user process.
If the process crashes, that shouldn't affect the actual data. The
tracepoint remains even upon a crash. If you try to unregister the
tracepoint while BPF is attached, it is prevented, as the tracepoints
are ref-counted and cannot be unregistered if anything is using it
(user processes, ftrace, perf/bpf).
We have been using libbpf to attach and monitor user_events with this
patch and haven't hit issues for what we plan to use it for (decode
the payload, aggregate, and track what's happening per-TID/PID). The
data we care about is already in kernel memory via the perf trace
buffer.
> In general we don't want bpf to be called in various parts of the kernel
> just because bpf was used in similar parts elsewhere.
> bpf needs to provide real value for a particular kernel subsystem.
>
For sure. I've had a lot of requests within Microsoft to wire up BPF to
user_events which prompted this patch. I've been in a few conversations
where we start talking about perf_event buffers and teams stop and ask
why it cannot go to BPF directly.
> For user events it's still not clear to me what bpf can bring to the table.
>
> The commit log of this proposed patch says:
> "When BPF programs are attached to tracepoints created by user_events
> the BPF programs do not get run even though the attach succeeds."
>
> It looks to me that it's a bug in attaching.
> The kernel shouldn't have allowed attaching bpf prog to user events,
> since they cannot be run.
>
> Then the commit log says:
> "This keeps user_events consistent
> with how other kernel, modules, and probes expose tracepoint data to allow
> attachment of a BPF program."
>
> "keep consistent" is not a reason to use bpf with user_events.
>
Yeah, keep consistent was more about using the GPL export symbol, which
the kernel tracepoints currently utilize. I wanted to avoid any special
casing BPF needed to add for user_events, and I also expect users would
like one way to write a BPF program for tracepoints/trace_events even
if they are from user processes vs kernel.
> Beau,
> please provide a detailed explanation of your use case and how bpf helps.
>
There are teams that have existing BPF programs that want to also pull
in data from user processes in addition to the data they already collect
from the kernel.
We are also seeing a trend of teams wanting to drop buffering approaches
and move into non-buffered analysis of problems. An example is as soon
as a fault happens in a user-process, they would like the ability to see
what that thread has done, what the kernel did a bit before the error
(or other processes that have swapped in, etc).
We also have needs to aggregate operation duration live, and as soon as
they deviate, trigger corrective actions. BPF is ideal for us to use for
aggregating data cheaply, comparing that to other kernel and user
processes, and then making a decision quickly on how to mitigate or flag
it. We are working with OpenTelemetry teams to make this work via
certain exporters in various languages (C#/C++/Rust).
> Also please explain why uprobe/USDT and bpf don't achieve your goals.
> Various user space applications have USDTs in them.
> This is an existing mechanism that was proven to be useful to many projects
> including glibc, python, mysql.
>
> Comparing to user_events the USDTs work fine in unprivileged applications
> and have zero overhead when not turned on. USDT is a single 'nop' instruction
> while user events need if(enabled & bit) check plus iov prep and write.
>
> When enabled the write() is probably faster than USDT trap, but all the extra
> overhead in tracepoint and user_event_validate() probably makes it the same speed.
> So why not USDT ?
User_events is being used in several production environments. These
environments are heavily locked down for security and have several
policies that prevent any modifications to executable pages (SELinux,
IPE, etc). There are also other scenarios, such as runtime integrity
checked processes, where the running code pages are periodically
compared to known hashes that can be affected by patching as well.
We also have to support languages, like C# and Java, which don't have a
stable location to apply a nop/int 3 patch, even when environments
allow patching.
We've used branch + syscall approaches in Windows for a long time and
have found them to work well in these locked down environments as well
as for JIT'd languages like C#.
In our usages of user_events, this has worked well for us, we imagine it
will work well for other folks in the same situation(s). Some folks
might not want to use it, that's fine, it's an optional KConfig with
default N.
Thanks,
-Beau
[1] https://github.com/microsoft/LinuxTracepoints
[2] https://github.com/compudj/libside
On Mon, May 15, 2023 at 02:33:05PM -0400, Steven Rostedt wrote:
> On Mon, 15 May 2023 09:57:07 -0700
> Alexei Starovoitov <[email protected]> wrote:
>
> > Thank you for these details. Answer below...
>
> Thanks for this well thought out reply!
>
>
[...]
> >
> > > if (unlikely(ret <= 0)) {
> > > if (!fixup_fault)
> > > return -EFAULT;
> > >
> > > if (!user_event_enabler_queue_fault(mm, enabler, *attempt))
> > > pr_warn("user_events: Unable to queue fault handler\n");
> >
> > This part looks questionable.
> >
> > The only users of fixup_user_fault() were futex and KVM.
> > Now user_events are calling it too from user_event_mm_fault_in() where
> > "bool unlocked;" is uninitialized and state of this flag is not checked
> > after fixup_user_fault() call.
> > Not an MM expert, but this is suspicious.
>
> Hmm, yeah, this should be:
>
> static int user_event_mm_fault_in()
> {
> bool unlocked = false;
>
> [..]
>
> out:
> if (!unlocked)
> mmap_read_unlock(mm->mm);
> }
>
> Good catch!
>
I don't believe that's correct. fixup_user_fault() re-acquires the
mmap lock, and when it does, it lets you know via unlocked getting set
to true. IE: Something COULD have changed in the mmap during this call,
but the lock is still held.
See comments here:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/gup.c#n1287
Thanks,
-Beau
>
> Thank you Alexei for asking these. The above are all valid concerns.
>
> -- Steve
>
On Mon, 15 May 2023 12:35:32 -0700
Beau Belgrave <[email protected]> wrote:
> > static int user_event_mm_fault_in()
> > {
> > bool unlocked = false;
> >
> > [..]
> >
> > out:
> > if (!unlocked)
> > mmap_read_unlock(mm->mm);
> > }
> >
> > Good catch!
> >
>
> I don't believe that's correct. fixup_user_fault() re-acquires the
> mmap lock, and when it does, it lets you know via unlocked getting set
> to true. IE: Something COULD have changed in the mmap during this call,
> but the lock is still held.
>
> See comments here:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/gup.c#n1287
Ah you're right. I thought it said "mmap_read_unlock()" before returning,
but it's actually doing a "mmap_read_lock()". Note, I just got back home
yesterday, so I blame jetlag ;-)
OK, this is good as-is.
The "unlocked" is only if you want to know if the mm_read_lock() was
unlocked in the function. You need to set it if you want it to retry, but
you don't need to initialize it if you don't care if it was released.
Probably could use a comment there.
-- Steve
On Mon, 15 May 2023 12:24:07 -0700
Beau Belgrave <[email protected]> wrote:
> > Beau,
> > please provide a detailed explanation of your use case and how bpf helps.
> >
>
> There are teams that have existing BPF programs that want to also pull
> in data from user processes in addition to the data they already collect
> from the kernel.
>
> We are also seeing a trend of teams wanting to drop buffering approaches
> and move into non-buffered analysis of problems. An example is as soon
> as a fault happens in a user-process, they would like the ability to see
> what that thread has done, what the kernel did a bit before the error
> (or other processes that have swapped in, etc).
>
> We also have needs to aggregate operation duration live, and as soon as
> they deviate, trigger corrective actions. BPF is ideal for us to use for
> aggregating data cheaply, comparing that to other kernel and user
> processes, and then making a decision quickly on how to mitigate or flag
> it. We are working with OpenTelemetry teams to make this work via
> certain exporters in various languages (C#/C++/Rust).
This is turning into a very productive discussion. Thank you Alexei and
Beau for this.
Beau,
Could you possibly also add (in a separate patch), a simple use case of a
BPF program that would be attached to some user event. Could be contrived.
Perhaps supply a patch to ls.c[1] that adds a user event to where it reads a
file type and the bpf program can do something special if the file belongs
to the user. OK, I'm just pulling crazy ideas out of thin air!
[1] https://github.com/coreutils/coreutils/blob/master/src/ls.c
Could copy the ls with the user event to the samples directory for user
events. It is GPL.
-- Steve
On Mon, May 15, 2023 at 12:24:07PM -0700, Beau Belgrave wrote:
> > >
> > > ret = pin_user_pages_remote(mm->mm, uaddr, 1, FOLL_WRITE | FOLL_NOFAULT,
> > > &page, NULL, NULL);
> >
> > ... which will call pin_user_pages_remote() in RCU CS.
> > This looks buggy, since pin_user_pages_remote() may schedule.
> >
>
> If it's possible to schedule, I can change this to cache the probe
> callbacks under RCU then drop it. However, when would
> pin_user_pages_remote() schedule with FOLL_NOFAULT?
Are you saying that passing FOLL_NOFAULT makes it work in atomic context?
Is this documented anywhere?
> I couldn't pick up
> where it might schedule?
I think I see plenty of rw_semaphore access in the guts of GUP.
Have you tested user events with CONFIG_DEBUG_ATOMIC_SLEEP?
>
> I don't see uprobes using ptrace_may_access() either, it replaces a page
> under the same situation. I modelled a lot of this based on what both
> futex and uprobes do for this.
>
> If I missed something, I'm happy to add it. My understanding is that
> tracefs is acting as the security boundary here.
security boundary? but..
> Yes, it's easy to make mistakes when using the ABI directly. We have a
> library [1] to help with this,
quoting [1]
> [1] https://github.com/microsoft/LinuxTracepoints
"
The user that will generate events must have x access to the tracing directory, e.g. chmod a+x /sys/kernel/tracing
The user that will generate events must have rw access to the tracing/user_events_data file, e.g. chmod a+rw /sys/kernel/tracing/user_events_data
"
So any unpriv user can create and operate user events.
Including seeing and enabling other user's user_events with 'ls/echo/cat' in tracefs.
Looks like user events were designed with intention to be unprivileged.
When I looked at kernel/trace/trace_events_user.c I assumed root.
I doubt other people reviewed it from security perspective.
Recommending "chmod a+rw /sys/kernel/tracing/user_events_data" doesn't sound like a good idea.
For example, I think the following is possible:
fd = open("/sys/kernel/tracing/user_events_data")
ioclt(fd, DIAG_IOCSDEL)
user_events_ioctl_del
delete_user_event(info->group, name);
'info' is different for every FD, but info->group is the same for all users/processes/fds,
because only one global init_group is created.
So one user can unregister other user event by knowing 'name'.
A security hole, no?
> and libside [2] will also help here.
> [2] https://github.com/compudj/libside
That's an interesting project. It doesn't do any user_events access afaict,
but the concept is nice.
Looks like libside will work without user events just fine.
It's a pure user to user tracing framework similar to lttng.
Why microsoft cannot use just libside without user_events?
> > The design of user events looks fragile to me. One user process can write
> > into user_event of another process by supplying wrong 'write_index' and the
> > kernel won't catch it if data formats are compatible.
> >
>
> The write_index is a per-process, per-fd index, you cannot do what you state
> unless a process shares it's internal FD and goes out of it's way to
> achieve that.
See it now.
struct user_event_file_info is indeed per-file/per-FD.
write_index is isolated enough.
> > s/perf_trace_buf_submit/perf_trace_run_bpf_submit/
> >
> > may look trivial, but there is a lot to unpack here.
> >
> > How bpf prog was attached to user event?
> > What is the life time of bpf prog?
> > What happens when user process crashes?
> > What happens when user event is unregistered ?
> > What is bpf prog context? Like, what helpers are allowed to be called?
> > Does libbpf need updating?
> > etc etc
> >
> > No selftests were provided with this patch, so impossible to answer.
> >
>
> I thought it being a GPL export symbol that this kind of stuff would be
> documented somewhere if there are requirements to use the method. As it
EXPORT_SYMBOL_GPL(perf_trace_run_bpf_submit);
does not mean that any arbitrary code in the kernel or GPL-ed module
is free to call it whichever way they like.
It's an export symbol only because modules expose tracepoints.
It's an implementation detail of DECLARE_EVENT_CLASS macro and
can change at any time including removal of export symbol.
> stands in the patch, the data that is sent to BPF is from the buffer
> returned from perf_trace_buf_alloc() after it has been copied from the
> user process.
>
> If the process crashes, that shouldn't affect the actual data. The
> tracepoint remains even upon a crash. If you try to unregister the
> tracepoint while BPF is attached, it is prevented, as the tracepoints
> are ref-counted and cannot be unregistered if anything is using it
> (user processes, ftrace, perf/bpf).
>
> We have been using libbpf to attach and monitor user_events with this
> patch and haven't hit issues for what we plan to use it for (decode
> the payload, aggregate, and track what's happening per-TID/PID). The
> data we care about is already in kernel memory via the perf trace
> buffer.
What bpf prog type do you use? How does libbpf attach it?
You have to provide a patch for selftest/bpf/ for us to meaningfully review it.
>
> > In general we don't want bpf to be called in various parts of the kernel
> > just because bpf was used in similar parts elsewhere.
> > bpf needs to provide real value for a particular kernel subsystem.
> >
>
> For sure. I've had a lot of requests within Microsoft to wire up BPF to
> user_events which prompted this patch. I've been in a few conversations
> where we start talking about perf_event buffers and teams stop and ask
> why it cannot go to BPF directly.
So you need perf_event buffers or ftrace ring buffer (aka trace_pipe) ?
Which one do you want to use ?
> Yeah, keep consistent was more about using the GPL export symbol, which
> the kernel tracepoints currently utilize. I wanted to avoid any special
> casing BPF needed to add for user_events, and I also expect users would
> like one way to write a BPF program for tracepoints/trace_events even
> if they are from user processes vs kernel.
BPF progs have three ways to access kernel tracepoints:
1. traditional tracepoint
2. raw tracepoint
3. raw tracepoint with BTF
1 was added first and now rarely used (only by old tools), since it's slow.
2 was added later to address performance concerns.
3 was added after BTF was introduced to provide accurate types.
3 is the only one that bpf community recommends and is the one that is used most often.
As far as I know trace_events were never connected to bpf.
Unless somebody sneaked the code in without us seeing it.
I think you're trying to model user_events+bpf as 1.
Which means that you'll be repeating the same mistakes.
>
> > Beau,
> > please provide a detailed explanation of your use case and how bpf helps.
> >
>
> There are teams that have existing BPF programs that want to also pull
> in data from user processes in addition to the data they already collect
> from the kernel.
>
> We are also seeing a trend of teams wanting to drop buffering approaches
> and move into non-buffered analysis of problems. An example is as soon
> as a fault happens in a user-process, they would like the ability to see
> what that thread has done, what the kernel did a bit before the error
> (or other processes that have swapped in, etc).
Sounds like bpf prog would need to access user memory.
What we've learned the hard way that you cannot do it cleanly from the kernel
tracepoint/trace_event/perf_event (and user_event in your case).
The only clean way to do it is from uprobe where it's user context and it is
sleepable and fault-able. That's why we've added 'sleepable bpf uprobes'.
Just going with perf_trace_run_bpf_submit() you'll only have 'best effort' access
to user data. Not recommended.
> We also have needs to aggregate operation duration live, and as soon as
> they deviate, trigger corrective actions. BPF is ideal for us to use for
> aggregating data cheaply, comparing that to other kernel and user
> processes, and then making a decision quickly on how to mitigate or flag
> it. We are working with OpenTelemetry teams to make this work via
> certain exporters in various languages (C#/C++/Rust).
Using bpf for in-kernel aggregation makes sense, of course.
But if you have to instrument your user processes why cannot you do
libside/lttng style aggregation and instrumentation?
You don't need the kernel to be in the path.
User to user can do it faster than going into the kernel.
> > Also please explain why uprobe/USDT and bpf don't achieve your goals.
> > Various user space applications have USDTs in them.
> > This is an existing mechanism that was proven to be useful to many projects
> > including glibc, python, mysql.
> >
> > Comparing to user_events the USDTs work fine in unprivileged applications
> > and have zero overhead when not turned on. USDT is a single 'nop' instruction
> > while user events need if(enabled & bit) check plus iov prep and write.
> >
> > When enabled the write() is probably faster than USDT trap, but all the extra
> > overhead in tracepoint and user_event_validate() probably makes it the same speed.
> > So why not USDT ?
>
> User_events is being used in several production environments. These
> environments are heavily locked down for security and have several
> policies that prevent any modifications to executable pages (SELinux,
> IPE, etc). There are also other scenarios, such as runtime integrity
> checked processes, where the running code pages are periodically
> compared to known hashes that can be affected by patching as well.
"IPE" means microsoft's our-of-tree LSM ?
https://microsoft.github.io/ipe/
Quoting above:
"IPE cannot verify the integrity of anonymous executable memory,
such as the trampolines created by gcc closures and libffi, or JIT'd code"
I think C# does code modifications, so I don't believe your environment
means 'no code mods ever'.
afaik uprobe doesn't affect integrity. uprobe's nop->int3 rewrite doesn't trip IMA.
> We also have to support languages, like C# and Java, which don't have a
> stable location to apply a nop/int 3 patch, even when environments
> allow patching.
That's a fair point.
There were "dynamic USDT" and "Java Statically Defined Tracing probes (JSDT)"
in the past, but I don't know whether anyone ported them to Linux.
> We've used branch + syscall approaches in Windows for a long time and
> have found them to work well in these locked down environments as well
> as for JIT'd languages like C#.
Ok. Looks like we've got to the main reason for user_events.
Re-phrasing above statement. User_events-like facility existed in Windows
and we've decided to implement the same in Linux to have common framework
to monitor applications in both OSes.
Are you planning to extend bpf-for-windows to attach to window's equivalent
of user_events ?
If so, we can allow bpf progs to be attached to user_events in Linux.
Please send a proper patch with [PATCH bpf-next] subject targeting bpf-next
with selftest and clear explanation of the true reason.
Also think hard whether repeating our prior tracepoint+bpf mistakes is
really want you want to do with user_events+bpf.
On Tue, May 16, 2023 at 5:36 PM Alexei Starovoitov
<[email protected]> wrote:
>
> On Mon, May 15, 2023 at 12:24:07PM -0700, Beau Belgrave wrote:
> > > >
> > > > ret = pin_user_pages_remote(mm->mm, uaddr, 1, FOLL_WRITE | FOLL_NOFAULT,
> > > > &page, NULL, NULL);
> > >
> > > ... which will call pin_user_pages_remote() in RCU CS.
> > > This looks buggy, since pin_user_pages_remote() may schedule.
> > >
> >
> > If it's possible to schedule, I can change this to cache the probe
> > callbacks under RCU then drop it. However, when would
> > pin_user_pages_remote() schedule with FOLL_NOFAULT?
>
> Are you saying that passing FOLL_NOFAULT makes it work in atomic context?
Absolutely not.
It may not fault missing pages in, but that does *not* make it atomic.
That code depends on all the usual MM locking, and it does not work at
all in the same way that "pagefault_disable()" does, for example. That
will fail on any fault and never take locks, and is designed to work
in atomic contexts. Very different.
So no, don't think you can call pin_user_pages_remote() or any other
GUP function from atomic context.
We do have "get_user_page[s]_fast_only()" and that is the only version
of GUP that is actually lock-free.
Also, just FYI, those special gup_user*fast_only()" functions simply
will not work on some architectures at all.
Linus
On Tue, 16 May 2023 17:36:28 -0700
Alexei Starovoitov <[email protected]> wrote:
> "
> The user that will generate events must have x access to the tracing directory, e.g. chmod a+x /sys/kernel/tracing
> The user that will generate events must have rw access to the tracing/user_events_data file, e.g. chmod a+rw /sys/kernel/tracing/user_events_data
> "
> So any unpriv user can create and operate user events.
> Including seeing and enabling other user's user_events with 'ls/echo/cat' in tracefs.
It can see user_events_data, but x only gives you access into the directory.
It does not get you the contents of the files within the directory. The
above only gives access to the user_events_data. Which is to create events.
I recommended using groups and not giving access to all tasks.
>
> Looks like user events were designed with intention to be unprivileged.
> When I looked at kernel/trace/trace_events_user.c I assumed root.
> I doubt other people reviewed it from security perspective.
>
> Recommending "chmod a+rw /sys/kernel/tracing/user_events_data" doesn't sound like a good idea.
>
> For example, I think the following is possible:
> fd = open("/sys/kernel/tracing/user_events_data")
> ioclt(fd, DIAG_IOCSDEL)
> user_events_ioctl_del
> delete_user_event(info->group, name);
>
> 'info' is different for every FD, but info->group is the same for all users/processes/fds,
> because only one global init_group is created.
> So one user can unregister other user event by knowing 'name'.
> A security hole, no?
>
> > and libside [2] will also help here.
>
> > [2] https://github.com/compudj/libside
>
> That's an interesting project. It doesn't do any user_events access afaict,
I'll let Beau answer the rest.
-- Steve
On Tue, May 16, 2023 at 5:56 PM Linus Torvalds
<[email protected]> wrote:
>
> That code depends on all the usual MM locking, and it does not work at
> all in the same way that "pagefault_disable()" does, for example.
Note that in interrupt context, the page table locking would also
deadlock, so even though it's using spinlocks and that _could_ be
atomic, it really isn't usable in interrupt (much less NMI) context.
And even if you're in process context, but atomic due to other reasons
(ie spinlocks or RCU), while the page table locking would be ok, the
mm locking isn't.
So FOLL_NOFAULT really is about avoiding faulting in any new pages,
and it's kind of like "GFP_NOIO" in that respect: it's meant for
filesystems etc that can not afford to fault, because they may hold
locks, and a page fault could then recurse back into the filesystem
and deadlock.
It's not about atomicity or anything like that.
Similarly, FOLL_NOWAIT is absolutely not about that - it will actually
start to fault things in, it just won't then wait for the IO to
complete (so think "don't wait for IO" rather than any "avoid
locking").
Anyway, the end result is the one I already mentioned: only
"get_user_page[s]_fast_only()" is about being usable in atomic
context, and that only works on the *current* process.
That one really is designed to be kind of like "pagefault_disable()",
except instead of fetching user data, it fetches the "reference" to
the user datat (ie the pointers to the underlying pages).
In fact, the very reason that *one* GUP function works in atomic
context is pretty much the exact same reason that you can try to fault
in user pages under pagefault_disable(): it doesn't actually use any
of the generic VM data structures, and accesses the page tables
directly. Exactly like the hardware does.
So don't even ask for other GUP functionality, much less the "remote"
kind. Not going to happen. If you think you need access to remote
process memory, you had better do it in process context, or you had
better just think again.
Linus
On Tue, 16 May 2023 18:46:29 -0700
Linus Torvalds <[email protected]> wrote:
> So don't even ask for other GUP functionality, much less the "remote"
> kind. Not going to happen. If you think you need access to remote
> process memory, you had better do it in process context, or you had
> better just think again.
So this code path is very much in user context (called directly by a
write system call). The issue that Alexei had was that it's also in an
rcu_read_lock() section.
I wonder if this all goes away if we switch to SRCU? That is, sleepable RCU.
-- Steve
On Tue, May 16, 2023 at 7:29 PM Steven Rostedt <[email protected]> wrote:
>
> So this code path is very much in user context (called directly by a
> write system call). The issue that Alexei had was that it's also in an
> rcu_read_lock() section.
>
> I wonder if this all goes away if we switch to SRCU?
Yes, SRCU context would work.
That said, how critical is this code? Because honestly, the *sanest*
thing to do is to just hold the lock that actually protects the list,
not try to walk the list in any RCU mode.
And as far as I can tell, that's the 'event_mutex', which is already held.
RCU walking of a list is only meaningful when the walk doesn't need
the lock that guarantees the list integrity.
But *modification* of a RCU-protected list still requires locking, and
from a very cursory look, it really looks like 'event_mutex' is
already the lock that protects the list.
So the whole use of RCU during the list walking there in
user_event_enabler_update() _seems_ pointless. You hold event_mutex -
user_event_enabler_write() that is called in the loop already has a
lockdep assert to that effect.
So what is it that could even race and change the list that is the
cause of that rcu-ness?
Other code in that file happily just does
mutex_lock(&event_mutex);
list_for_each_entry_safe(enabler, next, &mm->enablers, link)
with no RCU anywhere. Why does user_event_enabler_update() not do that?
Oh, and even those other loops are a bit strange. Why do they use the
"_safe" variant, even when they just traverse the list without
changing it? Look at user_event_enabler_exists(), for example.
I must really be missing something. That code is confusing. Or I am
very confused.
Linus
On Tue, May 16, 2023 at 09:26:58PM -0400, Steven Rostedt wrote:
> On Tue, 16 May 2023 17:36:28 -0700
> Alexei Starovoitov <[email protected]> wrote:
>
>
> > "
> > The user that will generate events must have x access to the tracing directory, e.g. chmod a+x /sys/kernel/tracing
> > The user that will generate events must have rw access to the tracing/user_events_data file, e.g. chmod a+rw /sys/kernel/tracing/user_events_data
> > "
> > So any unpriv user can create and operate user events.
> > Including seeing and enabling other user's user_events with 'ls/echo/cat' in tracefs.
>
> It can see user_events_data, but x only gives you access into the directory.
> It does not get you the contents of the files within the directory. The
> above only gives access to the user_events_data. Which is to create events.
>
> I recommended using groups and not giving access to all tasks.
>
> >
> > Looks like user events were designed with intention to be unprivileged.
> > When I looked at kernel/trace/trace_events_user.c I assumed root.
> > I doubt other people reviewed it from security perspective.
> >
> > Recommending "chmod a+rw /sys/kernel/tracing/user_events_data" doesn't sound like a good idea.
> >
> > For example, I think the following is possible:
> > fd = open("/sys/kernel/tracing/user_events_data")
> > ioclt(fd, DIAG_IOCSDEL)
> > user_events_ioctl_del
> > delete_user_event(info->group, name);
> >
> > 'info' is different for every FD, but info->group is the same for all users/processes/fds,
> > because only one global init_group is created.
> > So one user can unregister other user event by knowing 'name'.
> > A security hole, no?
> >
> > > and libside [2] will also help here.
> >
> > > [2] https://github.com/compudj/libside
> >
> > That's an interesting project. It doesn't do any user_events access afaict,
>
> I'll let Beau answer the rest.
>
> -- Steve
Mathieu and I have talked for the last year to align user_events with
the ability to also run user-space tracers together. I've sent a patch
out to Mathieu to add user_events to libside and was the main reason why
the ABI moved toward remote writes of bits.
Libside uses a binary description of event data that the kernel cannot
handle (yet). We talk about this almost each tracefs meeting, libside
can be used with user_events, however, the kernel side decoding is hard
to describe at the moment. We are working on a way to tell the kernel
about events via a binary format to achieve this.
Regarding deleting events, only users that are given access can delete
events. They must know the event name, just like users with access to
delete files must know a path (and have access to it). Since the
write_index and other details are per-process, unless the user has
access to either /sys/kernel/tracing/events/user_events/* or
/sys/kernel/tracing/user_events_status, they do not know which names are
being used.
If that is not enough, we could require CAP_SYSADMIN to be able to
delete events even when they have access to the file. Users can also
apply SELinux policies per-file to achieve further isolation, if
required.
Thanks,
-Beau
On Tue, May 16, 2023 at 08:03:09PM -0700, Linus Torvalds wrote:
> On Tue, May 16, 2023 at 7:29 PM Steven Rostedt <[email protected]> wrote:
> >
> > So this code path is very much in user context (called directly by a
> > write system call). The issue that Alexei had was that it's also in an
> > rcu_read_lock() section.
> >
> > I wonder if this all goes away if we switch to SRCU?
>
> Yes, SRCU context would work.
>
> That said, how critical is this code? Because honestly, the *sanest*
> thing to do is to just hold the lock that actually protects the list,
> not try to walk the list in any RCU mode.
>
> And as far as I can tell, that's the 'event_mutex', which is already held.
>
> RCU walking of a list is only meaningful when the walk doesn't need
> the lock that guarantees the list integrity.
>
> But *modification* of a RCU-protected list still requires locking, and
> from a very cursory look, it really looks like 'event_mutex' is
> already the lock that protects the list.
>
> So the whole use of RCU during the list walking there in
> user_event_enabler_update() _seems_ pointless. You hold event_mutex -
> user_event_enabler_write() that is called in the loop already has a
> lockdep assert to that effect.
>
> So what is it that could even race and change the list that is the
> cause of that rcu-ness?
>
Processes that fork() with previous user_events need to be duplicated.
The fork() paths do not acquire the event_mutex. In the middle of a fork
an event could become enabled/disabled, which would call this part of
the code, at that time the list is actively being appended to when we
try to update the bits.
> Other code in that file happily just does
>
> mutex_lock(&event_mutex);
>
> list_for_each_entry_safe(enabler, next, &mm->enablers, link)
>
> with no RCU anywhere. Why does user_event_enabler_update() not do that?
>
This is due to the fork() case above without taking the event_mutex. I
really tried to not cause fork() to stall if a process uses user_events.
This required using RCU, maybe there is a simpler approach:
One approach I can think of is that during fork() we don't add the newly
created mm to the global list until we copy all the enablers. The COW
pages should reflect the bits if a timing window occurs there, since I
believe it's impossible for the newly forked() mm to cause a COW during
that time. Then I can drop this RCU on the enablers.
> Oh, and even those other loops are a bit strange. Why do they use the
> "_safe" variant, even when they just traverse the list without
> changing it? Look at user_event_enabler_exists(), for example.
>
The other places in the code that do this either will remove the event
depending on the situation during the for_each, or they only hold the
register lock and don't hold the event_mutex. So the disabler could get
removed out from under it.
IE: user_events_ioctl_reg() -> current_user_event_enabler_exists()
This is a place where we could just simply change to grab the
event_mutex, it's pretty isolated and we take the lock anyway further
down the path.
> I must really be missing something. That code is confusing. Or I am
> very confused.
>
> Linus
Thanks,
-Beau
On Tue, May 16, 2023 at 05:36:28PM -0700, Alexei Starovoitov wrote:
> On Mon, May 15, 2023 at 12:24:07PM -0700, Beau Belgrave wrote:
> > > >
> > > > ret = pin_user_pages_remote(mm->mm, uaddr, 1, FOLL_WRITE | FOLL_NOFAULT,
> > > > &page, NULL, NULL);
> > >
> > > ... which will call pin_user_pages_remote() in RCU CS.
> > > This looks buggy, since pin_user_pages_remote() may schedule.
> > >
> >
> > If it's possible to schedule, I can change this to cache the probe
> > callbacks under RCU then drop it. However, when would
> > pin_user_pages_remote() schedule with FOLL_NOFAULT?
>
> Are you saying that passing FOLL_NOFAULT makes it work in atomic context?
> Is this documented anywhere?
>
> > I couldn't pick up
> > where it might schedule?
>
> I think I see plenty of rw_semaphore access in the guts of GUP.
>
> Have you tested user events with CONFIG_DEBUG_ATOMIC_SLEEP?
>
This pops on ATOMIC_SLEEP, thanks for pointing this out. I missed that
the gup retry statement is a fallthrough. PROVE_RCU/LOCKING didn't catch
this, lesson learned.
[...]
> >
> > I thought it being a GPL export symbol that this kind of stuff would be
> > documented somewhere if there are requirements to use the method. As it
>
> EXPORT_SYMBOL_GPL(perf_trace_run_bpf_submit);
> does not mean that any arbitrary code in the kernel or GPL-ed module
> is free to call it whichever way they like.
> It's an export symbol only because modules expose tracepoints.
> It's an implementation detail of DECLARE_EVENT_CLASS macro and
> can change at any time including removal of export symbol.
>
Ok, guess I'm looking for what best to do here that is least likely to
break and also allows potentially the BPF program to grab further user
memory within it (I guess this means using sleepable BPF, should I
follow what uprobes did?).
> > stands in the patch, the data that is sent to BPF is from the buffer
> > returned from perf_trace_buf_alloc() after it has been copied from the
> > user process.
> >
> > If the process crashes, that shouldn't affect the actual data. The
> > tracepoint remains even upon a crash. If you try to unregister the
> > tracepoint while BPF is attached, it is prevented, as the tracepoints
> > are ref-counted and cannot be unregistered if anything is using it
> > (user processes, ftrace, perf/bpf).
> >
> > We have been using libbpf to attach and monitor user_events with this
> > patch and haven't hit issues for what we plan to use it for (decode
> > the payload, aggregate, and track what's happening per-TID/PID). The
> > data we care about is already in kernel memory via the perf trace
> > buffer.
>
> What bpf prog type do you use? How does libbpf attach it?
> You have to provide a patch for selftest/bpf/ for us to meaningfully review it.
>
This is how I wired up libbpf via libbpf-bootstrap for the sample that's
checked in:
struct example {
unsigned long long unused;
int count;
};
SEC("tp/user_events/test")
int handle_tp(struct example *ctx)
{
int pid = bpf_get_current_pid_tgid() >> 32;
bpf_printk("BPF triggered from PID %d, count=%d.\n", pid, ctx->count);
return 0;
}
I'm not sure if tp is referencing traditional tracepoint or not
(guessing it is).
> >
> > > In general we don't want bpf to be called in various parts of the kernel
> > > just because bpf was used in similar parts elsewhere.
> > > bpf needs to provide real value for a particular kernel subsystem.
> > >
> >
> > For sure. I've had a lot of requests within Microsoft to wire up BPF to
> > user_events which prompted this patch. I've been in a few conversations
> > where we start talking about perf_event buffers and teams stop and ask
> > why it cannot go to BPF directly.
>
> So you need perf_event buffers or ftrace ring buffer (aka trace_pipe) ?
> Which one do you want to use ?
>
We use both, depending on the situation. Local debugging we typically
use ftrace since it's quite easy to use. In production we use perf_event
buffers mainly.
> > Yeah, keep consistent was more about using the GPL export symbol, which
> > the kernel tracepoints currently utilize. I wanted to avoid any special
> > casing BPF needed to add for user_events, and I also expect users would
> > like one way to write a BPF program for tracepoints/trace_events even
> > if they are from user processes vs kernel.
>
> BPF progs have three ways to access kernel tracepoints:
> 1. traditional tracepoint
> 2. raw tracepoint
> 3. raw tracepoint with BTF
>
> 1 was added first and now rarely used (only by old tools), since it's slow.
> 2 was added later to address performance concerns.
> 3 was added after BTF was introduced to provide accurate types.
>
> 3 is the only one that bpf community recommends and is the one that is used most often.
>
> As far as I know trace_events were never connected to bpf.
> Unless somebody sneaked the code in without us seeing it.
>
> I think you're trying to model user_events+bpf as 1.
> Which means that you'll be repeating the same mistakes.
>
See above, asking for guidance.
> >
> > > Beau,
> > > please provide a detailed explanation of your use case and how bpf helps.
> > >
> >
> > There are teams that have existing BPF programs that want to also pull
> > in data from user processes in addition to the data they already collect
> > from the kernel.
> >
> > We are also seeing a trend of teams wanting to drop buffering approaches
> > and move into non-buffered analysis of problems. An example is as soon
> > as a fault happens in a user-process, they would like the ability to see
> > what that thread has done, what the kernel did a bit before the error
> > (or other processes that have swapped in, etc).
>
> Sounds like bpf prog would need to access user memory.
> What we've learned the hard way that you cannot do it cleanly from the kernel
> tracepoint/trace_event/perf_event (and user_event in your case).
> The only clean way to do it is from uprobe where it's user context and it is
> sleepable and fault-able. That's why we've added 'sleepable bpf uprobes'.
>
> Just going with perf_trace_run_bpf_submit() you'll only have 'best effort' access
> to user data. Not recommended.
>
Good to know, do you recommend then how uprobes did this? These are in
user context via write()/writev(). I don't see why I wouldn't pick
sleepable / faultable if it offers better options to folks.
[...]
> > We've used branch + syscall approaches in Windows for a long time and
> > have found them to work well in these locked down environments as well
> > as for JIT'd languages like C#.
>
> Ok. Looks like we've got to the main reason for user_events.
> Re-phrasing above statement. User_events-like facility existed in Windows
> and we've decided to implement the same in Linux to have common framework
> to monitor applications in both OSes.
> Are you planning to extend bpf-for-windows to attach to window's equivalent
> of user_events ?
> If so, we can allow bpf progs to be attached to user_events in Linux.
> Please send a proper patch with [PATCH bpf-next] subject targeting bpf-next
> with selftest and clear explanation of the true reason.
>
Happy to.
> Also think hard whether repeating our prior tracepoint+bpf mistakes is
> really want you want to do with user_events+bpf.
It sounds like I should look into sleepable BPF, which I will do, but
I'll wait to get you advice on this before sending a patch, etc.
Thanks,
-Beau
On Wed, May 17, 2023 at 10:22 AM Beau Belgrave
<[email protected]> wrote:
>
> On Tue, May 16, 2023 at 08:03:09PM -0700, Linus Torvalds wrote:
> > So what is it that could even race and change the list that is the
> > cause of that rcu-ness?
>
> Processes that fork() with previous user_events need to be duplicated.
BS.
Really. Stop making stuff up.
The above statement is clearly not true - just LOOK AT THE CODE.
Here's the loop in question:
list_for_each_entry_rcu(enabler, &mm->enablers, link) {
if (enabler->event == user) {
attempt = 0;
user_event_enabler_write(mm, enabler,
true, &attempt);
}
}
and AT THE VERY TOP OF user_event_enabler_write() we have this:
lockdep_assert_held(&event_mutex);
so either nobody has ever tested this code with lockdep enabled, or we
hold that lock.
And if nobody has ever tested the code, then it's broken anyway. That
code N#EEDS the mutex lock. It needs to stop thinking it's RCU-safe,
when it clearly isn't.
So I ask again: why is that code using RCU list traversal, when it
already holds the lock that makes the RCU'ness COMPLETELY POINTLESS.
And again, that pointless RCU locking around this all seems to be the
*only* reason for all these issues with pin_user_pages_remote().
So I claim that this code is garbage. Somebody didn't think about locking.
Now, it's true that during fork, we have *another* RCU loop, but that
one is harmless: that's not the one that does all this page pinning.
Now, that one *does* do
list_add_rcu(&enabler->link, &mm->enablers);
without actually holding any locks, but in this case 'mm' is a newly
allocated private thing of a task that hasn't even been exposed to the
world yet, so nobody should be able to even see it. So that code lacks
the proper locking for the new list, but it does so because there is
nothing that can race with the new list (and the old list is
read-only, so RCU traversal of the old list works).
So that "list_add_rcu()" there could probably be just a "list_add()",
with a comment saying "this is new, nobody can see it".
And if something *can* race it it and can see the new list, then it
had damn well needs that mutex lock anyway, because that "something"
could be actually modifying it. But that's separate from the page
pinning situation.
So again, I claim that the RCU'ness of the pin_user_pages part is
broken and should simply not exist.
> > Other code in that file happily just does
> >
> > mutex_lock(&event_mutex);
> >
> > list_for_each_entry_safe(enabler, next, &mm->enablers, link)
> >
> > with no RCU anywhere. Why does user_event_enabler_update() not do that?
>
> This is due to the fork() case above without taking the event_mutex.
See above. Your thinking is confused, and the code is broken.
If somebody can see the new list while it is created during fork(),
then you need the event_mutex to protect the creation of it.
And if nobody can see it, then you don't need any RCU protection against it.
Those are the two choices. You can't have it both ways.
> > Oh, and even those other loops are a bit strange. Why do they use the
> > "_safe" variant, even when they just traverse the list without
> > changing it? Look at user_event_enabler_exists(), for example.
>
> The other places in the code that do this either will remove the event
> depending on the situation during the for_each, or they only hold the
> register lock and don't hold the event_mutex.
So?
That "safe" variant doesn't imply any locking. It does *not* protect
against events being removed. It *purely* protects against the loop
itself removing entries.
So this code:
list_for_each_entry_safe(enabler, next, &mm->enablers, link) {
if (enabler->addr == uaddr &&
(enabler->values & ENABLE_VAL_BIT_MASK) == bit)
return true;
}
is simply nonsensical. There is no reason for the "safe". It does not
make anything safer.
The above loop is only safe under the mutex (it would need to be the
"rcu" variant to be safe to traverse without locking), and since it
isn't modifying the list, there's no reason for the safe.
End result: the "safe" part is simply wrong.
If the intention is "rcu" because of lack of locking, then the code needs to
(a) get the rcu read lock
(b) use the _rcu variant of the list traversal
And if the intention is that it's done under the proper 'event_mutex'
lock, then the "safe" part should simply be dropped.
Linus
On Wed, May 17, 2023 at 11:15:14AM -0700, Linus Torvalds wrote:
> On Wed, May 17, 2023 at 10:22 AM Beau Belgrave
> <[email protected]> wrote:
> >
> > On Tue, May 16, 2023 at 08:03:09PM -0700, Linus Torvalds wrote:
> > > So what is it that could even race and change the list that is the
> > > cause of that rcu-ness?
> >
> > Processes that fork() with previous user_events need to be duplicated.
>
> BS.
>
> Really. Stop making stuff up.
>
> The above statement is clearly not true - just LOOK AT THE CODE.
>
user_event_mm_dup() puts a new mm into the global list before the
enablers list is fully populated. As it stands now, since it's in the
global list, it can get enumerated in a small timing window via the
tracing subsystem register callbacks when someone enables the event via
ftrace/perf.
> Here's the loop in question:
>
> list_for_each_entry_rcu(enabler, &mm->enablers, link) {
> if (enabler->event == user) {
> attempt = 0;
> user_event_enabler_write(mm, enabler,
> true, &attempt);
> }
> }
>
> and AT THE VERY TOP OF user_event_enabler_write() we have this:
>
> lockdep_assert_held(&event_mutex);
>
> so either nobody has ever tested this code with lockdep enabled, or we
> hold that lock.
>
> And if nobody has ever tested the code, then it's broken anyway. That
> code N#EEDS the mutex lock. It needs to stop thinking it's RCU-safe,
> when it clearly isn't.
>
> So I ask again: why is that code using RCU list traversal, when it
> already holds the lock that makes the RCU'ness COMPLETELY POINTLESS.
>
> And again, that pointless RCU locking around this all seems to be the
> *only* reason for all these issues with pin_user_pages_remote().
>
> So I claim that this code is garbage. Somebody didn't think about locking.
>
> Now, it's true that during fork, we have *another* RCU loop, but that
> one is harmless: that's not the one that does all this page pinning.
>
> Now, that one *does* do
>
> list_add_rcu(&enabler->link, &mm->enablers);
>
> without actually holding any locks, but in this case 'mm' is a newly
> allocated private thing of a task that hasn't even been exposed to the
> world yet, so nobody should be able to even see it. So that code lacks
> the proper locking for the new list, but it does so because there is
> nothing that can race with the new list (and the old list is
> read-only, so RCU traversal of the old list works).
>
Well, that's the problem I was trying to point out. The fork path calls
user_event_mm_dup() -> user_event_mm_create(), which DO expose this to
the trace world. I definitely need to fix that, then I can drop these RCU
paths in the enablers. It has been exposed out to the tracing tracepoint
paths, but it has yet to be exposed out to the newly forked process that
could cause data writes, add new events, disable them, etc.
> So that "list_add_rcu()" there could probably be just a "list_add()",
> with a comment saying "this is new, nobody can see it".
>
Yes, after I fix the ordering of the mm add to the tracing global list.
That is clearly something I should have done originally and caused
confusion and extra RCU usage that is unneeded.
> And if something *can* race it it and can see the new list, then it
> had damn well needs that mutex lock anyway, because that "something"
> could be actually modifying it. But that's separate from the page
> pinning situation.
>
> So again, I claim that the RCU'ness of the pin_user_pages part is
> broken and should simply not exist.
>
> > > Other code in that file happily just does
> > >
> > > mutex_lock(&event_mutex);
> > >
> > > list_for_each_entry_safe(enabler, next, &mm->enablers, link)
> > >
> > > with no RCU anywhere. Why does user_event_enabler_update() not do that?
> >
> > This is due to the fork() case above without taking the event_mutex.
>
> See above. Your thinking is confused, and the code is broken.
>
> If somebody can see the new list while it is created during fork(),
> then you need the event_mutex to protect the creation of it.
>
> And if nobody can see it, then you don't need any RCU protection against it.
>
> Those are the two choices. You can't have it both ways.
>
> > > Oh, and even those other loops are a bit strange. Why do they use the
> > > "_safe" variant, even when they just traverse the list without
> > > changing it? Look at user_event_enabler_exists(), for example.
> >
> > The other places in the code that do this either will remove the event
> > depending on the situation during the for_each, or they only hold the
> > register lock and don't hold the event_mutex.
>
> So?
>
> That "safe" variant doesn't imply any locking. It does *not* protect
> against events being removed. It *purely* protects against the loop
> itself removing entries.
>
> So this code:
>
> list_for_each_entry_safe(enabler, next, &mm->enablers, link) {
> if (enabler->addr == uaddr &&
> (enabler->values & ENABLE_VAL_BIT_MASK) == bit)
> return true;
> }
>
> is simply nonsensical. There is no reason for the "safe". It does not
> make anything safer.
>
> The above loop is only safe under the mutex (it would need to be the
> "rcu" variant to be safe to traverse without locking), and since it
> isn't modifying the list, there's no reason for the safe.
>
> End result: the "safe" part is simply wrong.
>
Got it, I was confused.
> If the intention is "rcu" because of lack of locking, then the code needs to
> (a) get the rcu read lock
> (b) use the _rcu variant of the list traversal
>
> And if the intention is that it's done under the proper 'event_mutex'
> lock, then the "safe" part should simply be dropped.
>
> Linus
Thanks,
-Beau
On Wed, May 17, 2023 at 12:08 PM Beau Belgrave
<[email protected]> wrote:
>
> user_event_mm_dup() puts a new mm into the global list before the
> enablers list is fully populated.
Then that simply needs to be fixed.
user_event_mm_dup() should not madd the mm into the global list until
it is *done*.
Because if it makes that list visible to others in a half-way state,
then it needs to use the proper locking and use event_mutex.
You can't say "this is so critical that we can't take a lock" and then
use that as an excuse to simply do buggy code.
Either take the lock in user_event_mm_dup(), or make sure that the
data structures are all completely local so that no lock is necessary.
Here's a COMPLETELY UNTESTED patch that just separates out the notion
of "allocate" and "attach".
NOTE NOTE NOTE! I am *not* claiming this patch works. It builds for
me. It looks right. It seems like it's the right thing to do. But it
might have some issues.
With this, the newly dup'ed list is attached to the process once after
it is done, so nobody can see the list being built up.
Also note that this does NOT fix the incorrect RCU walks.
Linus
On Wed, May 17, 2023 at 12:26 PM Linus Torvalds
<[email protected]> wrote:
>
> Also note that this does NOT fix the incorrect RCU walks.
.. this is the patch that I think should go on top of it to fix the
misleading "safe" and the incorrect RCU walk.
NOTE! This adds that
lockdep_assert_held(&event_mutex);
to user_event_enabler_update() too. It's already there in
user_event_enabler_write(), but I'm not actually convinced this has
gotten enough coverage checking, so I also did it in that caller.
Some callers obviously hold that mutex. Others are much less obvious,
eg that user_event_reg() -> update_enable_bit_for() chain. I *assume*
all the 'class->reg()' callers get the event mutex, but I did not in
any way check that it is true.
So that lockdep annotation should be actually *tested* with lockdep
enabled and somebody doing all these operations.
Final note: I do not know this code *AT*ALL*. I'm literally just going
by "this is the only correct coding pattern to use", not by some
deeper understanding of what the code actually wants to do.
Linus
On Wed, May 17, 2023 at 12:26:44PM -0700, Linus Torvalds wrote:
> On Wed, May 17, 2023 at 12:08 PM Beau Belgrave
> <[email protected]> wrote:
> >
> > user_event_mm_dup() puts a new mm into the global list before the
> > enablers list is fully populated.
>
> Then that simply needs to be fixed.
>
Agreed.
> user_event_mm_dup() should not madd the mm into the global list until
> it is *done*.
>
> Because if it makes that list visible to others in a half-way state,
> then it needs to use the proper locking and use event_mutex.
>
> You can't say "this is so critical that we can't take a lock" and then
> use that as an excuse to simply do buggy code.
>
Didn't mean that, I just have more work to do then just the RCU walks.
I will fix these up.
> Either take the lock in user_event_mm_dup(), or make sure that the
> data structures are all completely local so that no lock is necessary.
>
> Here's a COMPLETELY UNTESTED patch that just separates out the notion
> of "allocate" and "attach".
>
> NOTE NOTE NOTE! I am *not* claiming this patch works. It builds for
> me. It looks right. It seems like it's the right thing to do. But it
> might have some issues.
>
> With this, the newly dup'ed list is attached to the process once after
> it is done, so nobody can see the list being built up.
>
> Also note that this does NOT fix the incorrect RCU walks.
>
> Linus
Thanks for the patch and feedback!
-Beau
> kernel/trace/trace_events_user.c | 29 ++++++++++++++++++-----------
> 1 file changed, 18 insertions(+), 11 deletions(-)
>
> diff --git a/kernel/trace/trace_events_user.c b/kernel/trace/trace_events_user.c
> index b1ecd7677642..b2aecbfbbd24 100644
> --- a/kernel/trace/trace_events_user.c
> +++ b/kernel/trace/trace_events_user.c
> @@ -538,10 +538,9 @@ static struct user_event_mm *user_event_mm_get_all(struct user_event *user)
> return found;
> }
>
> -static struct user_event_mm *user_event_mm_create(struct task_struct *t)
> +static struct user_event_mm *user_event_mm_alloc(struct task_struct *t)
> {
> struct user_event_mm *user_mm;
> - unsigned long flags;
>
> user_mm = kzalloc(sizeof(*user_mm), GFP_KERNEL_ACCOUNT);
>
> @@ -553,12 +552,6 @@ static struct user_event_mm *user_event_mm_create(struct task_struct *t)
> refcount_set(&user_mm->refcnt, 1);
> refcount_set(&user_mm->tasks, 1);
>
> - spin_lock_irqsave(&user_event_mms_lock, flags);
> - list_add_rcu(&user_mm->link, &user_event_mms);
> - spin_unlock_irqrestore(&user_event_mms_lock, flags);
> -
> - t->user_event_mm = user_mm;
> -
> /*
> * The lifetime of the memory descriptor can slightly outlast
> * the task lifetime if a ref to the user_event_mm is taken
> @@ -572,6 +565,17 @@ static struct user_event_mm *user_event_mm_create(struct task_struct *t)
> return user_mm;
> }
>
> +static void user_event_mm_attach(struct user_event_mm *user_mm, struct task_struct *t)
> +{
> + unsigned long flags;
> +
> + spin_lock_irqsave(&user_event_mms_lock, flags);
> + list_add_rcu(&user_mm->link, &user_event_mms);
> + spin_unlock_irqrestore(&user_event_mms_lock, flags);
> +
> + t->user_event_mm = user_mm;
> +}
> +
> static struct user_event_mm *current_user_event_mm(void)
> {
> struct user_event_mm *user_mm = current->user_event_mm;
> @@ -579,10 +583,12 @@ static struct user_event_mm *current_user_event_mm(void)
> if (user_mm)
> goto inc;
>
> - user_mm = user_event_mm_create(current);
> + user_mm = user_event_mm_alloc(current);
>
> if (!user_mm)
> goto error;
> +
> + user_event_mm_attach(user_mm, current);
> inc:
> refcount_inc(&user_mm->refcnt);
> error:
> @@ -670,7 +676,7 @@ void user_event_mm_remove(struct task_struct *t)
>
> void user_event_mm_dup(struct task_struct *t, struct user_event_mm *old_mm)
> {
> - struct user_event_mm *mm = user_event_mm_create(t);
> + struct user_event_mm *mm = user_event_mm_alloc(t);
> struct user_event_enabler *enabler;
>
> if (!mm)
> @@ -684,10 +690,11 @@ void user_event_mm_dup(struct task_struct *t, struct user_event_mm *old_mm)
>
> rcu_read_unlock();
>
> + user_event_mm_attach(mm, t);
> return;
> error:
> rcu_read_unlock();
> - user_event_mm_remove(t);
> + user_event_mm_destroy(mm);
> }
>
> static bool current_user_event_enabler_exists(unsigned long uaddr,
On Wed, May 17, 2023 at 12:36 PM Linus Torvalds
<[email protected]> wrote:
>
> .. this is the patch that I think should go on top of it to fix the
> misleading "safe" and the incorrect RCU walk.
Let's actually attach the patch too. Duh.
Linus
On Wed, May 17, 2023 at 12:36 PM Linus Torvalds
<[email protected]> wrote:
>
> .. this is the patch that I think should go on top of it to fix the
> misleading "safe" and the incorrect RCU walk.
>
> NOTE! This adds that
>
> lockdep_assert_held(&event_mutex);
>
> to user_event_enabler_update() too.
One more note: I think it would be really good to use different names
for the "links".
We have "mm->link", that is the list of mm's on the 'user_event_mms'
list, protected by the 'user_event_mms_lock' and RCU-walked.
And then we have 'enabler->link', which is the list of enables on the
'user_mm->enablers' list, protected by event_mutex, and _also_
occasionally RCU-walked.
And then we have 'validator->link', which isn't RCU-walked, and seems
to also be protected by the event_mutex (?).
This is all very confusing when looking at it as an outsider.
Particularly when you sometimes just see
list_del_rcu(&mm->link);
and you have to figure "which 'link' are we talking about again?".
Also, I have to say, I found "mm->next" *really* confusing at first.
It turns out that "mm->next" is this list that is dynamically built up
by user_event_mm_get_all(), and is only a one-shot list that is valid
only as long as 'event_mutex' is held. But the only lock the code
*talks* about is the RCU lock, which does *not* protect that list, and
only exists as a way to walk that user_event_mms list without taking
any locks.
So user_event_enabler_update() actually relies on that 'event_mutex'
lock in another way than the obvious one that is about the
mm->enablers list that it *also* walks.
Again, that all seems to work, but it's *really* confusing how
"mm->next" always exists as a field, but is only usable and valid
while you hold that event_mutex and have called
user_event_mm_get_all() to gather the list.
I think both user_event_enabler_update() and user_event_mm_get_all()
should have a mention of how they require event_mutex and how that
->next list works.
Anyway, I still *think* the two patches I sent out are right, but I'm
just mentioning this confusion I had to deal with when trying to
decode what the rules were. Maybe all the above is obvious to
everybody else, but it took me a while to decipher (and maybe I
misread something).
Linus
On Wed, May 17, 2023 at 12:37:11PM -0700, Linus Torvalds wrote:
> On Wed, May 17, 2023 at 12:36 PM Linus Torvalds
> <[email protected]> wrote:
> >
> > .. this is the patch that I think should go on top of it to fix the
> > misleading "safe" and the incorrect RCU walk.
>
> Let's actually attach the patch too. Duh.
>
> Linus
> kernel/trace/trace_events_user.c | 8 ++++----
> 1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/trace/trace_events_user.c b/kernel/trace/trace_events_user.c
> index b2aecbfbbd24..054e28cc5ad4 100644
> --- a/kernel/trace/trace_events_user.c
> +++ b/kernel/trace/trace_events_user.c
> @@ -439,7 +439,7 @@ static bool user_event_enabler_exists(struct user_event_mm *mm,
> struct user_event_enabler *enabler;
> struct user_event_enabler *next;
>
> - list_for_each_entry_safe(enabler, next, &mm->enablers, link) {
> + list_for_each_entry(enabler, next, &mm->enablers, link) {
> if (enabler->addr == uaddr &&
> (enabler->values & ENABLE_VAL_BIT_MASK) == bit)
> return true;
> @@ -455,19 +455,19 @@ static void user_event_enabler_update(struct user_event *user)
> struct user_event_mm *next;
> int attempt;
>
> + lockdep_assert_held(&event_mutex);
> +
> while (mm) {
> next = mm->next;
> mmap_read_lock(mm->mm);
> - rcu_read_lock();
>
> - list_for_each_entry_rcu(enabler, &mm->enablers, link) {
> + list_for_each_entry(enabler, &mm->enablers, link) {
> if (enabler->event == user) {
> attempt = 0;
> user_event_enabler_write(mm, enabler, true, &attempt);
> }
> }
>
> - rcu_read_unlock();
> mmap_read_unlock(mm->mm);
> user_event_mm_put(mm);
> mm = next;
Do you mind giving me your Signed-off-by for these?
I plan to do a series where I take these patches and then also fix up a
few comments and the link namings as you suggested.
First patch is clean, second patch I made the following changes and
after that passed all the self-tests without bug splats with
CONFIG_PROVE_LOCKING/RCU and ATOMIC_SLEEP:
diff --git a/kernel/trace/trace_events_user.c b/kernel/trace/trace_events_user.c
index b2aecbfbbd24..2f70dabb0f71 100644
--- a/kernel/trace/trace_events_user.c
+++ b/kernel/trace/trace_events_user.c
@@ -437,9 +437,8 @@ static bool user_event_enabler_exists(struct user_event_mm *mm,
unsigned long uaddr, unsigned char bit)
{
struct user_event_enabler *enabler;
- struct user_event_enabler *next;
- list_for_each_entry_safe(enabler, next, &mm->enablers, link) {
+ list_for_each_entry(enabler, &mm->enablers, link) {
if (enabler->addr == uaddr &&
(enabler->values & ENABLE_VAL_BIT_MASK) == bit)
return true;
@@ -495,7 +494,9 @@ static bool user_event_enabler_dup(struct user_event_enabler *orig,
enabler->values = orig->values & ENABLE_VAL_DUP_MASK;
refcount_inc(&enabler->event->refcnt);
- list_add_rcu(&enabler->link, &mm->enablers);
+
+ /* Enablers not exposed yet, RCU not required */
+ list_add(&enabler->link, &mm->enablers);
On Wed, May 17, 2023 at 4:01 PM Beau Belgrave <[email protected]> wrote:
>
> Do you mind giving me your Signed-off-by for these?
Assuming you have some test-cases that you've run them through, then yes:
Signed-off-by: Linus Torvalds <[email protected]>
for both. But since I can't really test them myself, I'd really like
that. I did these patches just by looking at the source code, and
while I think that's an excellent way to do development, I do think
that testing is _also_ required.
And that second patch was obviously not even build-tested, so really
just a "something like this".
> I plan to do a series where I take these patches and then also fix up a
> few comments and the link namings as you suggested.
Sounds good.
> First patch is clean, second patch I made the following changes and
> after that passed all the self-tests without bug splats with
> CONFIG_PROVE_LOCKING/RCU and ATOMIC_SLEEP:
Your suggested version with just "list_add()" and a comment about why
it doesn't need RCU safety looks good. And the build fix obviously
requited ;)
Linus
On Wed, 17 May 2023 16:14:43 -0700
Linus Torvalds <[email protected]> wrote:
> On Wed, May 17, 2023 at 4:01 PM Beau Belgrave <[email protected]> wrote:
> >
> > Do you mind giving me your Signed-off-by for these?
>
> Assuming you have some test-cases that you've run them through, then yes:
Beau,
Can you update the tools/testing/selftests/user_events/ to make sure that
it triggers the lockdep splat without these updates (assuming that it does
trigger without these patches). Then add these patches to make sure the
splat goes away. This will confirm that the patches do what is expected of
them.
I usually run the selftests for tracing and for your user events with
lockdep and prove locking enabled. But it may have triggered on something
else disabling it when I ran my tests, in which case I sometimes disable
that and forget to re-enable it.
-- Steve
On Wed, May 17, 2023 at 07:25:28PM -0400, Steven Rostedt wrote:
> On Wed, 17 May 2023 16:14:43 -0700
> Linus Torvalds <[email protected]> wrote:
>
> > On Wed, May 17, 2023 at 4:01 PM Beau Belgrave <[email protected]> wrote:
> > >
> > > Do you mind giving me your Signed-off-by for these?
> >
> > Assuming you have some test-cases that you've run them through, then yes:
>
> Beau,
>
> Can you update the tools/testing/selftests/user_events/ to make sure that
> it triggers the lockdep splat without these updates (assuming that it does
> trigger without these patches). Then add these patches to make sure the
> splat goes away. This will confirm that the patches do what is expected of
> them.
>
Yes, I have run these through selftests with CONFIG_DEBUG_ATOMIC_SLEEP=y.
I can confirm without the patches it splats with that setting. When
these have been applied, the splat is gone.
> I usually run the selftests for tracing and for your user events with
> lockdep and prove locking enabled. But it may have triggered on something
> else disabling it when I ran my tests, in which case I sometimes disable
> that and forget to re-enable it.
>
Do you run with CONFIG_DEBUG_ATOMIC_SLEEP? It will not splat with just
CONFIG_PROVE_LOCKING and CONFIG_PROVE_RCU, which bit me here. I'm now
running all three now that I know better.
> -- Steve
Thanks,
-Beau
On Wed, May 17, 2023 at 05:10:47PM -0700, Alexei Starovoitov wrote:
> On Wed, May 17, 2023 at 9:50 AM Beau Belgrave <[email protected]> wrote:
> > >
> > > >
> > > > Looks like user events were designed with intention to be unprivileged.
> > > > When I looked at kernel/trace/trace_events_user.c I assumed root.
> > > > I doubt other people reviewed it from security perspective.
> > > >
> > > > Recommending "chmod a+rw /sys/kernel/tracing/user_events_data" doesn't sound like a good idea.
> > > >
> > > > For example, I think the following is possible:
> > > > fd = open("/sys/kernel/tracing/user_events_data")
> > > > ioclt(fd, DIAG_IOCSDEL)
> > > > user_events_ioctl_del
> > > > delete_user_event(info->group, name);
> > > >
> > > > 'info' is different for every FD, but info->group is the same for all users/processes/fds,
> > > > because only one global init_group is created.
> > > > So one user can unregister other user event by knowing 'name'.
> > > > A security hole, no?
>
> ...
>
> > Regarding deleting events, only users that are given access can delete
> > events. They must know the event name, just like users with access to
> > delete files must know a path (and have access to it). Since the
> > write_index and other details are per-process, unless the user has
> > access to either /sys/kernel/tracing/events/user_events/* or
> > /sys/kernel/tracing/user_events_status, they do not know which names are
> > being used.
> >
> > If that is not enough, we could require CAP_SYSADMIN to be able to
> > delete events even when they have access to the file. Users can also
> > apply SELinux policies per-file to achieve further isolation, if
> > required.
>
> Whether /sys/kernel/tracing/user_events_status gets g+rw
> or it gets a+rw (as your documentation recommends)
> it is still a security issue.
> The "event name" is trivial to find out by looking at the source code
> of the target process or just "string target_binary".
I guess, if they have access to the binary, etc.
So they need both access to the binary and to the tracefs directory.
We would not give them access like this in any normal setup other than a
developer environment.
> Restricting to cap_sysadmin is not the answer, since you want unpriv.
We do not need unpriv to delete events, only to write and create events.
We allow unregistering call-sites, which would still work unpriv with
this requirement.
> SElinux is not the answer either.
> Since it's unpriv, different processes should not be able to mess with
> user events of other processes.
How is this different than uprobes if we give a user access to
/sys/kernel/tracing/dynamic_events? Users can delete those as well. I
don't see a difference here.
In our production environments we are not giving out wide security to
this file.
> It's a fundamental requirement of any kernel api.
> This has to be fixed before any bpf discussion.
> If it means that you need to redesign user_events do it now and
> excuses like "it's uapi now, so we cannot fix it" are not going to fly.
Thanks,
-Beau
On Wed, May 17, 2023 at 9:50 AM Beau Belgrave <[email protected]> wrote:
> >
> > >
> > > Looks like user events were designed with intention to be unprivileged.
> > > When I looked at kernel/trace/trace_events_user.c I assumed root.
> > > I doubt other people reviewed it from security perspective.
> > >
> > > Recommending "chmod a+rw /sys/kernel/tracing/user_events_data" doesn't sound like a good idea.
> > >
> > > For example, I think the following is possible:
> > > fd = open("/sys/kernel/tracing/user_events_data")
> > > ioclt(fd, DIAG_IOCSDEL)
> > > user_events_ioctl_del
> > > delete_user_event(info->group, name);
> > >
> > > 'info' is different for every FD, but info->group is the same for all users/processes/fds,
> > > because only one global init_group is created.
> > > So one user can unregister other user event by knowing 'name'.
> > > A security hole, no?
...
> Regarding deleting events, only users that are given access can delete
> events. They must know the event name, just like users with access to
> delete files must know a path (and have access to it). Since the
> write_index and other details are per-process, unless the user has
> access to either /sys/kernel/tracing/events/user_events/* or
> /sys/kernel/tracing/user_events_status, they do not know which names are
> being used.
>
> If that is not enough, we could require CAP_SYSADMIN to be able to
> delete events even when they have access to the file. Users can also
> apply SELinux policies per-file to achieve further isolation, if
> required.
Whether /sys/kernel/tracing/user_events_status gets g+rw
or it gets a+rw (as your documentation recommends)
it is still a security issue.
The "event name" is trivial to find out by looking at the source code
of the target process or just "string target_binary".
Restricting to cap_sysadmin is not the answer, since you want unpriv.
SElinux is not the answer either.
Since it's unpriv, different processes should not be able to mess with
user events of other processes.
It's a fundamental requirement of any kernel api.
This has to be fixed before any bpf discussion.
If it means that you need to redesign user_events do it now and
excuses like "it's uapi now, so we cannot fix it" are not going to fly.
On Wed, May 17, 2023 at 5:14 PM Beau Belgrave <[email protected]> wrote:
>
> Do you run with CONFIG_DEBUG_ATOMIC_SLEEP? It will not splat with just
> CONFIG_PROVE_LOCKING and CONFIG_PROVE_RCU, which bit me here. I'm now
> running all three now that I know better.
I wonder if we should just make PROVE_LOCKING select DEBUG_ATOMIC_SLEEP..
PROVE_LOCKING is the expensive and complicated one. In contrast,
DEBUG_ATOMIC_SLEEP is the "we've had this simplistic check for some
very basic requirements for a long time".
So DEBUG_ATOMIC_SLEEP is really just a minimal debugging thing, it
feels a bit silly to have all the expensive "prove locking with
lockdep and all our lock debugging", and then not test the trivial
basics.
Linus
On Wed, May 17, 2023 at 5:19 PM Beau Belgrave <[email protected]> wrote:
>
> On Wed, May 17, 2023 at 05:10:47PM -0700, Alexei Starovoitov wrote:
> > On Wed, May 17, 2023 at 9:50 AM Beau Belgrave <[email protected]> wrote:
> > > >
> > > > >
> > > > > Looks like user events were designed with intention to be unprivileged.
> > > > > When I looked at kernel/trace/trace_events_user.c I assumed root.
> > > > > I doubt other people reviewed it from security perspective.
> > > > >
> > > > > Recommending "chmod a+rw /sys/kernel/tracing/user_events_data" doesn't sound like a good idea.
> > > > >
> > > > > For example, I think the following is possible:
> > > > > fd = open("/sys/kernel/tracing/user_events_data")
> > > > > ioclt(fd, DIAG_IOCSDEL)
> > > > > user_events_ioctl_del
> > > > > delete_user_event(info->group, name);
> > > > >
> > > > > 'info' is different for every FD, but info->group is the same for all users/processes/fds,
> > > > > because only one global init_group is created.
> > > > > So one user can unregister other user event by knowing 'name'.
> > > > > A security hole, no?
> >
> > ...
> >
> > > Regarding deleting events, only users that are given access can delete
> > > events. They must know the event name, just like users with access to
> > > delete files must know a path (and have access to it). Since the
> > > write_index and other details are per-process, unless the user has
> > > access to either /sys/kernel/tracing/events/user_events/* or
> > > /sys/kernel/tracing/user_events_status, they do not know which names are
> > > being used.
> > >
> > > If that is not enough, we could require CAP_SYSADMIN to be able to
> > > delete events even when they have access to the file. Users can also
> > > apply SELinux policies per-file to achieve further isolation, if
> > > required.
> >
> > Whether /sys/kernel/tracing/user_events_status gets g+rw
> > or it gets a+rw (as your documentation recommends)
> > it is still a security issue.
> > The "event name" is trivial to find out by looking at the source code
> > of the target process or just "string target_binary".
>
> I guess, if they have access to the binary, etc.
> So they need both access to the binary and to the tracefs directory.
> We would not give them access like this in any normal setup other than a
> developer environment.
>
> > Restricting to cap_sysadmin is not the answer, since you want unpriv.
>
> We do not need unpriv to delete events, only to write and create events.
>
> We allow unregistering call-sites, which would still work unpriv with
> this requirement.
>
> > SElinux is not the answer either.
> > Since it's unpriv, different processes should not be able to mess with
> > user events of other processes.
>
> How is this different than uprobes if we give a user access to
> /sys/kernel/tracing/dynamic_events? Users can delete those as well. I
> don't see a difference here.
Because kprobe/uprobe are root only.
No sane person will do chmod a+rw /sys/kernel/tracing/uprobe_events.
It's just like chmod a+rw /etc/passwd
Whereas this is your recommended approach for user_events.
> In our production environments we are not giving out wide security to
> this file.
Fine by me. Keep it insecure and broken. Do not send bpf patches then.
I refuse to have bpf callable from such subsystems.
Somebody will inevitably blame bpf for the insecurity of user_events.
On Wed, May 17, 2023 at 05:56:34PM -0700, Alexei Starovoitov wrote:
> On Wed, May 17, 2023 at 5:19 PM Beau Belgrave <[email protected]> wrote:
> >
> > On Wed, May 17, 2023 at 05:10:47PM -0700, Alexei Starovoitov wrote:
> > > On Wed, May 17, 2023 at 9:50 AM Beau Belgrave <[email protected]> wrote:
> > > > >
> > > > > >
> > > > > > Looks like user events were designed with intention to be unprivileged.
> > > > > > When I looked at kernel/trace/trace_events_user.c I assumed root.
> > > > > > I doubt other people reviewed it from security perspective.
> > > > > >
> > > > > > Recommending "chmod a+rw /sys/kernel/tracing/user_events_data" doesn't sound like a good idea.
> > > > > >
> > > > > > For example, I think the following is possible:
> > > > > > fd = open("/sys/kernel/tracing/user_events_data")
> > > > > > ioclt(fd, DIAG_IOCSDEL)
> > > > > > user_events_ioctl_del
> > > > > > delete_user_event(info->group, name);
> > > > > >
> > > > > > 'info' is different for every FD, but info->group is the same for all users/processes/fds,
> > > > > > because only one global init_group is created.
> > > > > > So one user can unregister other user event by knowing 'name'.
> > > > > > A security hole, no?
> > >
> > > ...
> > >
> > > > Regarding deleting events, only users that are given access can delete
> > > > events. They must know the event name, just like users with access to
> > > > delete files must know a path (and have access to it). Since the
> > > > write_index and other details are per-process, unless the user has
> > > > access to either /sys/kernel/tracing/events/user_events/* or
> > > > /sys/kernel/tracing/user_events_status, they do not know which names are
> > > > being used.
> > > >
> > > > If that is not enough, we could require CAP_SYSADMIN to be able to
> > > > delete events even when they have access to the file. Users can also
> > > > apply SELinux policies per-file to achieve further isolation, if
> > > > required.
> > >
> > > Whether /sys/kernel/tracing/user_events_status gets g+rw
> > > or it gets a+rw (as your documentation recommends)
> > > it is still a security issue.
> > > The "event name" is trivial to find out by looking at the source code
> > > of the target process or just "string target_binary".
> >
> > I guess, if they have access to the binary, etc.
> > So they need both access to the binary and to the tracefs directory.
> > We would not give them access like this in any normal setup other than a
> > developer environment.
> >
> > > Restricting to cap_sysadmin is not the answer, since you want unpriv.
> >
> > We do not need unpriv to delete events, only to write and create events.
> >
> > We allow unregistering call-sites, which would still work unpriv with
> > this requirement.
> >
> > > SElinux is not the answer either.
> > > Since it's unpriv, different processes should not be able to mess with
> > > user events of other processes.
> >
> > How is this different than uprobes if we give a user access to
> > /sys/kernel/tracing/dynamic_events? Users can delete those as well. I
> > don't see a difference here.
>
> Because kprobe/uprobe are root only.
> No sane person will do chmod a+rw /sys/kernel/tracing/uprobe_events.
> It's just like chmod a+rw /etc/passwd
>
> Whereas this is your recommended approach for user_events.
>
I believe those instructions are for development only. I'll get them
changed to a more secure approach. We don't want to folks leaving it
wide open.
We should tell folks to use a group and give access to the group like
Steven said earlier.
> > In our production environments we are not giving out wide security to
> > this file.
>
> Fine by me. Keep it insecure and broken. Do not send bpf patches then.
> I refuse to have bpf callable from such subsystems.
> Somebody will inevitably blame bpf for the insecurity of user_events.
The delete IOCTL is different than reg/unreg. I don't see a problem with
adding a CAP_SYSADMIN check on the delete IOCTL (and other delete paths)
to prevent this. It shouldn't affect anything we are doing to add this
and it makes it so non-admins cannot delete any events if they are given
write access to the user_events_data file.
Thanks,
-Beau
On Wed, 17 May 2023 18:18:14 -0700
Beau Belgrave <[email protected]> wrote:
> We should tell folks to use a group and give access to the group like
> Steven said earlier.
Could we possibly add an "owner" attribute to the event. That is, the
creator of the event is the only one that can write to that event. Or at
least by the user that created it (not actually the process).
So if the user "rostedt" creates an event, only "rostedt" can write to or
delete the event.
Basically like how the /tmp directory works.
I think this would lock down the issue that Alexei is worried about.
Thoughts?
-- Steve
P.S. I'm about to leave on PTO and will be very late in my replies
starting now. (Unless I'm board at the airport, I may then reply).
-- Steve
On Wed, May 17, 2023 at 7:08 PM Steven Rostedt <[email protected]> wrote:
>
> On Wed, 17 May 2023 18:18:14 -0700
> Beau Belgrave <[email protected]> wrote:
>
> > We should tell folks to use a group and give access to the group like
> > Steven said earlier.
>
> Could we possibly add an "owner" attribute to the event. That is, the
> creator of the event is the only one that can write to that event. Or at
> least by the user that created it (not actually the process).
>
> So if the user "rostedt" creates an event, only "rostedt" can write to or
> delete the event.
That can work, but why name is global in the first place?
Just make everything per-fd.
> The delete IOCTL is different than reg/unreg. I don't see a problem with
> adding a CAP_SYSADMIN check on the delete IOCTL (and other delete paths)
> to prevent this. It shouldn't affect anything we are doing to add this
> and it makes it so non-admins cannot delete any events if they are given
> write access to the user_events_data file.
sysadmin for delete is a pointless.
user_events_ioctl_reg() has the same issue.
Two different processes using you fancy TRACELOGGING_DEFINE_PROVIDER()
macro and picking the same name will race.
TRACELOGGING_DEFINE_PROVIDER( // defines the MyProvider symbol
MyProvider, // Name of the provider symbol to define
"MyCompany_MyComponent_MyProvider", // Human-readable provider
name, no ' ' or ':' chars.
// {d5b90669-1aad-5db8-16c9-6286a7fcfe33} // Provider guid
(ignored on Linux)
(0xd5b90669,0x1aad,0x5db8,0x16,0xc9,0x62,0x86,0xa7,0xfc,0xfe,0x33));
I totally get it that Beau is copy pasting these ideas from windows,
but windows is likely similarly broken if it's registering names
globally.
FD should be the isolation boundary.
fd = open("/sys/kernel/tracing/user_event")
and make sure all events are bound to that file.
when file is closed the events _should be auto deleted_.
That's another issue I just spotted.
Looks like user_events_release() is leaking memory.
user_event_refs are just lost.
tbh the more I look into the code the more I want to suggest to mark it
depends on BROKEN
and go back to redesign.
On Wed, 17 May 2023 20:14:31 -0700
Alexei Starovoitov <[email protected]> wrote:
> On Wed, May 17, 2023 at 7:08 PM Steven Rostedt <[email protected]> wrote:
> > The delete IOCTL is different than reg/unreg. I don't see a problem with
> > adding a CAP_SYSADMIN check on the delete IOCTL (and other delete paths)
> > to prevent this. It shouldn't affect anything we are doing to add this
> > and it makes it so non-admins cannot delete any events if they are given
> > write access to the user_events_data file.
>
> sysadmin for delete is a pointless.
> user_events_ioctl_reg() has the same issue.
> Two different processes using you fancy TRACELOGGING_DEFINE_PROVIDER()
> macro and picking the same name will race.
>
> TRACELOGGING_DEFINE_PROVIDER( // defines the MyProvider symbol
> MyProvider, // Name of the provider symbol to define
> "MyCompany_MyComponent_MyProvider", // Human-readable provider
> name, no ' ' or ':' chars.
> // {d5b90669-1aad-5db8-16c9-6286a7fcfe33} // Provider guid
> (ignored on Linux)
> (0xd5b90669,0x1aad,0x5db8,0x16,0xc9,0x62,0x86,0xa7,0xfc,0xfe,0x33));
>
> I totally get it that Beau is copy pasting these ideas from windows,
> but windows is likely similarly broken if it's registering names
> globally.
>
> FD should be the isolation boundary.
> fd = open("/sys/kernel/tracing/user_event")
> and make sure all events are bound to that file.
> when file is closed the events _should be auto deleted_.
>
> That's another issue I just spotted.
> Looks like user_events_release() is leaking memory.
> user_event_refs are just lost.
>
> tbh the more I look into the code the more I want to suggest to mark it
> depends on BROKEN
> and go back to redesign.
I don't think these changes require a redesign. I do like the idea that
the events live with the fd. That is, when the fd dies, so does the event.
Although, we may keep it around for a bit (no new events, but being
able to parse it. That is, the event itself isn't deleted until the fd
is closed, and so is the tracing files being read are closed.
Beau,
How hard is it to give the event an owner, but not for task or user,
but with the fd. That way you don't need to worry about other tasks
deleting the event. And it also automatically cleans itself up. If we
leave it to the sysadmin to clean up, it's going to cause leaks,
because it's not something the sysadmin will want to do, as they will
need to keep track of what events are created.
-- Steve
PS. I missed my connection due to unseasonal freezing temperatures, and
my little airport didn't have a driver for the deicer, making my flight
2 hours delayed (had to wait for the sun to come up and deice the
plane!). Thus, instead of enjoying myself by the pool, I'm in an
airport lounge without much to do.
On Thu, May 18, 2023 at 09:36:00AM -0400, Steven Rostedt wrote:
> On Wed, 17 May 2023 20:14:31 -0700
> Alexei Starovoitov <[email protected]> wrote:
>
> > On Wed, May 17, 2023 at 7:08 PM Steven Rostedt <[email protected]> wrote:
>
> > > The delete IOCTL is different than reg/unreg. I don't see a problem with
> > > adding a CAP_SYSADMIN check on the delete IOCTL (and other delete paths)
> > > to prevent this. It shouldn't affect anything we are doing to add this
> > > and it makes it so non-admins cannot delete any events if they are given
> > > write access to the user_events_data file.
> >
> > sysadmin for delete is a pointless.
> > user_events_ioctl_reg() has the same issue.
> > Two different processes using you fancy TRACELOGGING_DEFINE_PROVIDER()
> > macro and picking the same name will race.
> >
> > TRACELOGGING_DEFINE_PROVIDER( // defines the MyProvider symbol
> > MyProvider, // Name of the provider symbol to define
> > "MyCompany_MyComponent_MyProvider", // Human-readable provider
> > name, no ' ' or ':' chars.
> > // {d5b90669-1aad-5db8-16c9-6286a7fcfe33} // Provider guid
> > (ignored on Linux)
> > (0xd5b90669,0x1aad,0x5db8,0x16,0xc9,0x62,0x86,0xa7,0xfc,0xfe,0x33));
> >
> > I totally get it that Beau is copy pasting these ideas from windows,
> > but windows is likely similarly broken if it's registering names
> > globally.
> >
> > FD should be the isolation boundary.
> > fd = open("/sys/kernel/tracing/user_event")
> > and make sure all events are bound to that file.
> > when file is closed the events _should be auto deleted_.
> >
> > That's another issue I just spotted.
> > Looks like user_events_release() is leaking memory.
> > user_event_refs are just lost.
> >
> > tbh the more I look into the code the more I want to suggest to mark it
> > depends on BROKEN
> > and go back to redesign.
>
> I don't think these changes require a redesign. I do like the idea that
> the events live with the fd. That is, when the fd dies, so does the event.
>
> Although, we may keep it around for a bit (no new events, but being
> able to parse it. That is, the event itself isn't deleted until the fd
> is closed, and so is the tracing files being read are closed.
>
> Beau,
>
> How hard is it to give the event an owner, but not for task or user,
> but with the fd. That way you don't need to worry about other tasks
> deleting the event. And it also automatically cleans itself up. If we
> leave it to the sysadmin to clean up, it's going to cause leaks,
> because it's not something the sysadmin will want to do, as they will
> need to keep track of what events are created.
>
We need to ensure that multiple processes can use the same event name:
Example we have shared libraries that processes use to publish events.
We need to ensure that if the original FD closes and other processes/FDs
are using it, those don't get ripped out from underneath it:
Example we have perf attached and then the process exits.
I think we can accomodate that all neatly if we just make the event
self-delete upon the last ref-count decrement. That way no one needs
the delete IOCTL and we can prevent things piling up.
We have flags in the struct, so we could either make this optional or
default. I like this approach Steven.
Thanks,
-Beau
> -- Steve
>
> PS. I missed my connection due to unseasonal freezing temperatures, and
> my little airport didn't have a driver for the deicer, making my flight
> 2 hours delayed (had to wait for the sun to come up and deice the
> plane!). Thus, instead of enjoying myself by the pool, I'm in an
> airport lounge without much to do.
On Wed, May 17, 2023 at 05:10:47PM -0700, Alexei Starovoitov wrote:
> On Wed, May 17, 2023 at 9:50 AM Beau Belgrave <[email protected]> wrote:
> > >
> > > >
> > > > Looks like user events were designed with intention to be unprivileged.
> > > > When I looked at kernel/trace/trace_events_user.c I assumed root.
> > > > I doubt other people reviewed it from security perspective.
> > > >
> > > > Recommending "chmod a+rw /sys/kernel/tracing/user_events_data" doesn't sound like a good idea.
> > > >
> > > > For example, I think the following is possible:
> > > > fd = open("/sys/kernel/tracing/user_events_data")
> > > > ioclt(fd, DIAG_IOCSDEL)
> > > > user_events_ioctl_del
> > > > delete_user_event(info->group, name);
> > > >
> > > > 'info' is different for every FD, but info->group is the same for all users/processes/fds,
> > > > because only one global init_group is created.
> > > > So one user can unregister other user event by knowing 'name'.
> > > > A security hole, no?
>
> ...
>
> > Regarding deleting events, only users that are given access can delete
> > events. They must know the event name, just like users with access to
> > delete files must know a path (and have access to it). Since the
> > write_index and other details are per-process, unless the user has
> > access to either /sys/kernel/tracing/events/user_events/* or
> > /sys/kernel/tracing/user_events_status, they do not know which names are
> > being used.
> >
> > If that is not enough, we could require CAP_SYSADMIN to be able to
> > delete events even when they have access to the file. Users can also
> > apply SELinux policies per-file to achieve further isolation, if
> > required.
>
> Whether /sys/kernel/tracing/user_events_status gets g+rw
> or it gets a+rw (as your documentation recommends)
> it is still a security issue.
> The "event name" is trivial to find out by looking at the source code
> of the target process or just "string target_binary".
> Restricting to cap_sysadmin is not the answer, since you want unpriv.
> SElinux is not the answer either.
> Since it's unpriv, different processes should not be able to mess with
> user events of other processes.
> It's a fundamental requirement of any kernel api.
> This has to be fixed before any bpf discussion.
> If it means that you need to redesign user_events do it now and
> excuses like "it's uapi now, so we cannot fix it" are not going to fly.
Looking at this a little because I have a few minutes.
What's all this unused code?
static inline struct user_event_group
*user_event_group_from_user_ns(struct user_namespace *user_ns)
{
if (user_ns == &init_user_ns)
return init_group;
return NULL;
}
static struct user_event_group *current_user_event_group(void)
{
struct user_namespace *user_ns = current_user_ns();
struct user_event_group *group = NULL;
while (user_ns) {
group = user_event_group_from_user_ns(user_ns);
if (group)
break;
user_ns = user_ns->parent;
}
return group;
}
User namespaces form strict hierarchies so you always end up at
init_user_ns no matter where you start from in the hierarchy. Return the
init_group and delete that code above.
static char *user_event_group_system_name(struct user_namespace *user_ns)
{
char *system_name;
int len = sizeof(USER_EVENTS_SYSTEM) + 1;
if (user_ns != &init_user_ns) {
/*
* Unexpected at this point:
* We only currently support init_user_ns.
* When we enable more, this will trigger a failure so log.
*/
pr_warn("user_events: Namespace other than init_user_ns!\n");
return NULL;
}
Your delegation model is premised on file permissions of a single file
in global tracefs. It won't work with user namespaces so let's not give
the false impression that this is on the table.
Plus, all of this is also called in a single place during
trace_events_user_init() which is called from fs_initcall() so you
couldn't even pass a different user namespace if you wanted to because
only init_user_ns exists.
On Thu, Jun 01, 2023 at 11:46:13AM +0200, Christian Brauner wrote:
> On Wed, May 17, 2023 at 05:10:47PM -0700, Alexei Starovoitov wrote:
> > On Wed, May 17, 2023 at 9:50 AM Beau Belgrave <[email protected]> wrote:
> > > >
> > > > >
> > > > > Looks like user events were designed with intention to be unprivileged.
> > > > > When I looked at kernel/trace/trace_events_user.c I assumed root.
> > > > > I doubt other people reviewed it from security perspective.
> > > > >
> > > > > Recommending "chmod a+rw /sys/kernel/tracing/user_events_data" doesn't sound like a good idea.
> > > > >
> > > > > For example, I think the following is possible:
> > > > > fd = open("/sys/kernel/tracing/user_events_data")
> > > > > ioclt(fd, DIAG_IOCSDEL)
> > > > > user_events_ioctl_del
> > > > > delete_user_event(info->group, name);
> > > > >
> > > > > 'info' is different for every FD, but info->group is the same for all users/processes/fds,
> > > > > because only one global init_group is created.
> > > > > So one user can unregister other user event by knowing 'name'.
> > > > > A security hole, no?
> >
> > ...
> >
> > > Regarding deleting events, only users that are given access can delete
> > > events. They must know the event name, just like users with access to
> > > delete files must know a path (and have access to it). Since the
> > > write_index and other details are per-process, unless the user has
> > > access to either /sys/kernel/tracing/events/user_events/* or
> > > /sys/kernel/tracing/user_events_status, they do not know which names are
> > > being used.
> > >
> > > If that is not enough, we could require CAP_SYSADMIN to be able to
> > > delete events even when they have access to the file. Users can also
> > > apply SELinux policies per-file to achieve further isolation, if
> > > required.
> >
> > Whether /sys/kernel/tracing/user_events_status gets g+rw
> > or it gets a+rw (as your documentation recommends)
> > it is still a security issue.
> > The "event name" is trivial to find out by looking at the source code
> > of the target process or just "string target_binary".
> > Restricting to cap_sysadmin is not the answer, since you want unpriv.
> > SElinux is not the answer either.
> > Since it's unpriv, different processes should not be able to mess with
> > user events of other processes.
> > It's a fundamental requirement of any kernel api.
> > This has to be fixed before any bpf discussion.
> > If it means that you need to redesign user_events do it now and
> > excuses like "it's uapi now, so we cannot fix it" are not going to fly.
>
> Looking at this a little because I have a few minutes.
> What's all this unused code?
>
These are stubs to integrate namespace support. I've been working on a
series that adds a tracing namespace support similiar to the IMA
namespace work [1]. That series is ending up taking more time than I
anticipated.
> static inline struct user_event_group
> *user_event_group_from_user_ns(struct user_namespace *user_ns)
> {
> if (user_ns == &init_user_ns)
> return init_group;
>
> return NULL;
> }
>
> static struct user_event_group *current_user_event_group(void)
> {
> struct user_namespace *user_ns = current_user_ns();
> struct user_event_group *group = NULL;
>
> while (user_ns) {
> group = user_event_group_from_user_ns(user_ns);
>
> if (group)
> break;
>
> user_ns = user_ns->parent;
> }
>
> return group;
> }
>
> User namespaces form strict hierarchies so you always end up at
> init_user_ns no matter where you start from in the hierarchy. Return the
> init_group and delete that code above.
>
This is a good point, I'll delete this code and bring it back as part of
the namespace support patch series when appropriate.
> static char *user_event_group_system_name(struct user_namespace *user_ns)
> {
> char *system_name;
> int len = sizeof(USER_EVENTS_SYSTEM) + 1;
>
> if (user_ns != &init_user_ns) {
> /*
> * Unexpected at this point:
> * We only currently support init_user_ns.
> * When we enable more, this will trigger a failure so log.
> */
> pr_warn("user_events: Namespace other than init_user_ns!\n");
> return NULL;
> }
>
> Your delegation model is premised on file permissions of a single file
> in global tracefs. It won't work with user namespaces so let's not give
> the false impression that this is on the table.
>
Users that are given access to the single file still should be able to
be isolated for each other. The series I'm working on does this by
changing the system name of user_events on a per-namespace basis.
This prevents one namespace from messing with another, such as deleting
their events, etc. I'll reach out to you for some time to discuss this
direction to ensure I'm not off base.
> Plus, all of this is also called in a single place during
> trace_events_user_init() which is called from fs_initcall() so you
> couldn't even pass a different user namespace if you wanted to because
> only init_user_ns exists.
In later series this is also called upon file open of user_events_data
to find the right group, etc. I'll get this code removed for now and
bring it back later.
Thanks,
-Beau
1. https://lore.kernel.org/linux-kernel/[email protected]/
On Thu, Jun 01, 2023 at 08:24:14AM -0700, Beau Belgrave wrote:
> On Thu, Jun 01, 2023 at 11:46:13AM +0200, Christian Brauner wrote:
> > On Wed, May 17, 2023 at 05:10:47PM -0700, Alexei Starovoitov wrote:
> > > On Wed, May 17, 2023 at 9:50 AM Beau Belgrave <[email protected]> wrote:
> > > > >
> > > > > >
> > > > > > Looks like user events were designed with intention to be unprivileged.
> > > > > > When I looked at kernel/trace/trace_events_user.c I assumed root.
> > > > > > I doubt other people reviewed it from security perspective.
> > > > > >
> > > > > > Recommending "chmod a+rw /sys/kernel/tracing/user_events_data" doesn't sound like a good idea.
> > > > > >
> > > > > > For example, I think the following is possible:
> > > > > > fd = open("/sys/kernel/tracing/user_events_data")
> > > > > > ioclt(fd, DIAG_IOCSDEL)
> > > > > > user_events_ioctl_del
> > > > > > delete_user_event(info->group, name);
> > > > > >
> > > > > > 'info' is different for every FD, but info->group is the same for all users/processes/fds,
> > > > > > because only one global init_group is created.
> > > > > > So one user can unregister other user event by knowing 'name'.
> > > > > > A security hole, no?
> > >
> > > ...
> > >
> > > > Regarding deleting events, only users that are given access can delete
> > > > events. They must know the event name, just like users with access to
> > > > delete files must know a path (and have access to it). Since the
> > > > write_index and other details are per-process, unless the user has
> > > > access to either /sys/kernel/tracing/events/user_events/* or
> > > > /sys/kernel/tracing/user_events_status, they do not know which names are
> > > > being used.
> > > >
> > > > If that is not enough, we could require CAP_SYSADMIN to be able to
> > > > delete events even when they have access to the file. Users can also
> > > > apply SELinux policies per-file to achieve further isolation, if
> > > > required.
> > >
> > > Whether /sys/kernel/tracing/user_events_status gets g+rw
> > > or it gets a+rw (as your documentation recommends)
> > > it is still a security issue.
> > > The "event name" is trivial to find out by looking at the source code
> > > of the target process or just "string target_binary".
> > > Restricting to cap_sysadmin is not the answer, since you want unpriv.
> > > SElinux is not the answer either.
> > > Since it's unpriv, different processes should not be able to mess with
> > > user events of other processes.
> > > It's a fundamental requirement of any kernel api.
> > > This has to be fixed before any bpf discussion.
> > > If it means that you need to redesign user_events do it now and
> > > excuses like "it's uapi now, so we cannot fix it" are not going to fly.
> >
> > Looking at this a little because I have a few minutes.
> > What's all this unused code?
> >
>
> These are stubs to integrate namespace support. I've been working on a
> series that adds a tracing namespace support similiar to the IMA
> namespace work [1]. That series is ending up taking more time than I
Look, this is all well and nice but you've integrated user events with
tracefs. This is currently a single-instance global filesystem. So what
you're effectively implying is that you're namespacing tracefs by
hanging it off of struct user namespace making it mountable by
unprivileged users. Or what's the plan?
That alone is massive work with _wild_ security implications. My
appetite for exposing more stuff under user namespaces is very low given
the amount of CVEs we've had over the years.
> anticipated.
Yet you were confident enough to leave the namespacing stubs for this
functionality in the code. ;)
What is the overall goal here? Letting arbitrary unprivileged containers
define their own custom user event type by mounting tracefs inside
unprivileged containers? If so, what security story is going to
guarantee that writing arbitrary tracepoints from random unprivileged
containers is safe?
>
> > static inline struct user_event_group
> > *user_event_group_from_user_ns(struct user_namespace *user_ns)
> > {
> > if (user_ns == &init_user_ns)
> > return init_group;
> >
> > return NULL;
> > }
> >
> > static struct user_event_group *current_user_event_group(void)
> > {
> > struct user_namespace *user_ns = current_user_ns();
> > struct user_event_group *group = NULL;
> >
> > while (user_ns) {
> > group = user_event_group_from_user_ns(user_ns);
> >
> > if (group)
> > break;
> >
> > user_ns = user_ns->parent;
> > }
> >
> > return group;
> > }
> >
> > User namespaces form strict hierarchies so you always end up at
> > init_user_ns no matter where you start from in the hierarchy. Return the
> > init_group and delete that code above.
> >
>
> This is a good point, I'll delete this code and bring it back as part of
> the namespace support patch series when appropriate.
>
> > static char *user_event_group_system_name(struct user_namespace *user_ns)
> > {
> > char *system_name;
> > int len = sizeof(USER_EVENTS_SYSTEM) + 1;
> >
> > if (user_ns != &init_user_ns) {
> > /*
> > * Unexpected at this point:
> > * We only currently support init_user_ns.
> > * When we enable more, this will trigger a failure so log.
> > */
> > pr_warn("user_events: Namespace other than init_user_ns!\n");
> > return NULL;
> > }
> >
> > Your delegation model is premised on file permissions of a single file
> > in global tracefs. It won't work with user namespaces so let's not give
> > the false impression that this is on the table.
> >
>
> Users that are given access to the single file still should be able to
> be isolated for each other. The series I'm working on does this by
How? You currently have a single file that will have to be shared across
all unprivileged containers which ultimately can only mean that you need
to either bind-mount tracefs or bind-mount the single file into each
container. If you have 1000 containers each with isolated idmaps from
each other you're going to have a lot of fun trying to ensure that each
container has access rights to that file.
> changing the system name of user_events on a per-namespace basis.
What is the "system name" and how does it protect against namespaces
messing with each other?
On Thu, Jun 01, 2023 at 05:57:22PM +0200, Christian Brauner wrote:
> On Thu, Jun 01, 2023 at 08:24:14AM -0700, Beau Belgrave wrote:
> > On Thu, Jun 01, 2023 at 11:46:13AM +0200, Christian Brauner wrote:
> > > On Wed, May 17, 2023 at 05:10:47PM -0700, Alexei Starovoitov wrote:
> > > > On Wed, May 17, 2023 at 9:50 AM Beau Belgrave <[email protected]> wrote:
> > > > > >
> > > > > > >
> > > > > > > Looks like user events were designed with intention to be unprivileged.
> > > > > > > When I looked at kernel/trace/trace_events_user.c I assumed root.
> > > > > > > I doubt other people reviewed it from security perspective.
> > > > > > >
> > > > > > > Recommending "chmod a+rw /sys/kernel/tracing/user_events_data" doesn't sound like a good idea.
> > > > > > >
> > > > > > > For example, I think the following is possible:
> > > > > > > fd = open("/sys/kernel/tracing/user_events_data")
> > > > > > > ioclt(fd, DIAG_IOCSDEL)
> > > > > > > user_events_ioctl_del
> > > > > > > delete_user_event(info->group, name);
> > > > > > >
> > > > > > > 'info' is different for every FD, but info->group is the same for all users/processes/fds,
> > > > > > > because only one global init_group is created.
> > > > > > > So one user can unregister other user event by knowing 'name'.
> > > > > > > A security hole, no?
> > > >
> > > > ...
> > > >
> > > > > Regarding deleting events, only users that are given access can delete
> > > > > events. They must know the event name, just like users with access to
> > > > > delete files must know a path (and have access to it). Since the
> > > > > write_index and other details are per-process, unless the user has
> > > > > access to either /sys/kernel/tracing/events/user_events/* or
> > > > > /sys/kernel/tracing/user_events_status, they do not know which names are
> > > > > being used.
> > > > >
> > > > > If that is not enough, we could require CAP_SYSADMIN to be able to
> > > > > delete events even when they have access to the file. Users can also
> > > > > apply SELinux policies per-file to achieve further isolation, if
> > > > > required.
> > > >
> > > > Whether /sys/kernel/tracing/user_events_status gets g+rw
> > > > or it gets a+rw (as your documentation recommends)
> > > > it is still a security issue.
> > > > The "event name" is trivial to find out by looking at the source code
> > > > of the target process or just "string target_binary".
> > > > Restricting to cap_sysadmin is not the answer, since you want unpriv.
> > > > SElinux is not the answer either.
> > > > Since it's unpriv, different processes should not be able to mess with
> > > > user events of other processes.
> > > > It's a fundamental requirement of any kernel api.
> > > > This has to be fixed before any bpf discussion.
> > > > If it means that you need to redesign user_events do it now and
> > > > excuses like "it's uapi now, so we cannot fix it" are not going to fly.
> > >
> > > Looking at this a little because I have a few minutes.
> > > What's all this unused code?
> > >
> >
> > These are stubs to integrate namespace support. I've been working on a
> > series that adds a tracing namespace support similiar to the IMA
> > namespace work [1]. That series is ending up taking more time than I
>
> Look, this is all well and nice but you've integrated user events with
> tracefs. This is currently a single-instance global filesystem. So what
> you're effectively implying is that you're namespacing tracefs by
> hanging it off of struct user namespace making it mountable by
> unprivileged users. Or what's the plan?
>
We don't have plans for unprivileged users currently. I think that is a
great goal and requires a proper tracing namespace, which we currently
don't have. I've done some thinking on this, but I would like to hear
your thoughts and others on how to do this properly. We do talk about
this in the tracefs meetings (those might be out of your time zone
unfortunately).
> That alone is massive work with _wild_ security implications. My
> appetite for exposing more stuff under user namespaces is very low given
> the amount of CVEs we've had over the years.
>
Ok, I based that approach on the feedback given in LPC 2022 - Containers
and Checkpoint/Retore MC [1]. I believe you gave feedback to use user
namespaces to provide the encapsulation that was required :)
> > anticipated.
>
> Yet you were confident enough to leave the namespacing stubs for this
> functionality in the code. ;)
>
> What is the overall goal here? Letting arbitrary unprivileged containers
> define their own custom user event type by mounting tracefs inside
> unprivileged containers? If so, what security story is going to
> guarantee that writing arbitrary tracepoints from random unprivileged
> containers is safe?
>
Unprivileged containers is not a goal, however, having a per-pod
user_event system name, such as user_event_<pod_name>, would be ideal
for certain diagnostic scenarios, such as monitoring the entire pod.
When you have a lot of containers, you also want to limit how many
tracepoints each container can create, even if they are given access to
the tracefs file. The per-group can limit how many events/tracepoints
that container can go create, since we currently only have 16-bit
identifiers for trace_event's we need to be cautious we don't run out.
user_events in general has tracepoint validators to ensure the payloads
coming in are "safe" from what the kernel might do with them, such as
filtering out data.
> >
> > > static inline struct user_event_group
> > > *user_event_group_from_user_ns(struct user_namespace *user_ns)
> > > {
> > > if (user_ns == &init_user_ns)
> > > return init_group;
> > >
> > > return NULL;
> > > }
> > >
> > > static struct user_event_group *current_user_event_group(void)
> > > {
> > > struct user_namespace *user_ns = current_user_ns();
> > > struct user_event_group *group = NULL;
> > >
> > > while (user_ns) {
> > > group = user_event_group_from_user_ns(user_ns);
> > >
> > > if (group)
> > > break;
> > >
> > > user_ns = user_ns->parent;
> > > }
> > >
> > > return group;
> > > }
> > >
> > > User namespaces form strict hierarchies so you always end up at
> > > init_user_ns no matter where you start from in the hierarchy. Return the
> > > init_group and delete that code above.
> > >
> >
> > This is a good point, I'll delete this code and bring it back as part of
> > the namespace support patch series when appropriate.
> >
> > > static char *user_event_group_system_name(struct user_namespace *user_ns)
> > > {
> > > char *system_name;
> > > int len = sizeof(USER_EVENTS_SYSTEM) + 1;
> > >
> > > if (user_ns != &init_user_ns) {
> > > /*
> > > * Unexpected at this point:
> > > * We only currently support init_user_ns.
> > > * When we enable more, this will trigger a failure so log.
> > > */
> > > pr_warn("user_events: Namespace other than init_user_ns!\n");
> > > return NULL;
> > > }
> > >
> > > Your delegation model is premised on file permissions of a single file
> > > in global tracefs. It won't work with user namespaces so let's not give
> > > the false impression that this is on the table.
> > >
> >
> > Users that are given access to the single file still should be able to
> > be isolated for each other. The series I'm working on does this by
>
> How? You currently have a single file that will have to be shared across
> all unprivileged containers which ultimately can only mean that you need
> to either bind-mount tracefs or bind-mount the single file into each
> container. If you have 1000 containers each with isolated idmaps from
> each other you're going to have a lot of fun trying to ensure that each
> container has access rights to that file.
>
I followed the patch I already stated, there would be a new tracefs file
that only admins have access to. Admins can then create groups, assign
limits, then finally attach them user namespaces once they have been
configured.
I'm sure there are other approaches, see [1] where another approach was
proposed by Mathieu, but then feedback in the crowd was to use user
namespaces instead.
> > changing the system name of user_events on a per-namespace basis.
>
> What is the "system name" and how does it protect against namespaces
> messing with each other?
trace_events in the tracing facility require both a system name and an
event name. IE: sched/sched_waking, sched is the system name,
sched_waking is the event name. For user_events in the root group, the
system name is "user_events". When groups are introduced, the system
name can be "user_events_<GUID>" for example.
The user_events ABI never lets anyone dictate the system name to allow
for this isolation. IE: user_events/myevent vs
user_events<GUID>/myevent are entirely different trace_events on the
system. This is called out as a note in the user_events documentation
today that the system name can and will change to allow for isolation in
the future.
Thanks,
-Beau
1. https://www.youtube.com/watch?v=zai3gvpuEHc&t=4403s
Hi Beau,
On Thu, 1 Jun 2023 09:29:21 -0700
Beau Belgrave <[email protected]> wrote:
> > > These are stubs to integrate namespace support. I've been working on a
> > > series that adds a tracing namespace support similiar to the IMA
> > > namespace work [1]. That series is ending up taking more time than I
> >
> > Look, this is all well and nice but you've integrated user events with
> > tracefs. This is currently a single-instance global filesystem. So what
> > you're effectively implying is that you're namespacing tracefs by
> > hanging it off of struct user namespace making it mountable by
> > unprivileged users. Or what's the plan?
> >
>
> We don't have plans for unprivileged users currently. I think that is a
> great goal and requires a proper tracing namespace, which we currently
> don't have. I've done some thinking on this, but I would like to hear
> your thoughts and others on how to do this properly. We do talk about
> this in the tracefs meetings (those might be out of your time zone
> unfortunately).
>
> > That alone is massive work with _wild_ security implications. My
> > appetite for exposing more stuff under user namespaces is very low given
> > the amount of CVEs we've had over the years.
> >
>
> Ok, I based that approach on the feedback given in LPC 2022 - Containers
> and Checkpoint/Retore MC [1]. I believe you gave feedback to use user
> namespaces to provide the encapsulation that was required :)
Even with the user namespace, I think we still need to provide separate
"eventname-space" for each application, since it may depend on the context
who and where it is launched. I think the easiest solution is (perhaps)
providing a PID-based new groups for each instance (the PID-prefix or
suffix will be hidden from the application).
I think it may not good to allow unprivileged user processes to detect
the registered event name each other by default.
>
> > > anticipated.
> >
> > Yet you were confident enough to leave the namespacing stubs for this
> > functionality in the code. ;)
> >
> > What is the overall goal here? Letting arbitrary unprivileged containers
> > define their own custom user event type by mounting tracefs inside
> > unprivileged containers? If so, what security story is going to
> > guarantee that writing arbitrary tracepoints from random unprivileged
> > containers is safe?
> >
>
> Unprivileged containers is not a goal, however, having a per-pod
> user_event system name, such as user_event_<pod_name>, would be ideal
> for certain diagnostic scenarios, such as monitoring the entire pod.
That can be done in the user-space tools, not in the kernel.
> When you have a lot of containers, you also want to limit how many
> tracepoints each container can create, even if they are given access to
> the tracefs file. The per-group can limit how many events/tracepoints
> that container can go create, since we currently only have 16-bit
> identifiers for trace_event's we need to be cautious we don't run out.
I agree, we need to have a knob to limit it to avoid DoS attack.
> user_events in general has tracepoint validators to ensure the payloads
> coming in are "safe" from what the kernel might do with them, such as
> filtering out data.
[...]
> > > changing the system name of user_events on a per-namespace basis.
> >
> > What is the "system name" and how does it protect against namespaces
> > messing with each other?
>
> trace_events in the tracing facility require both a system name and an
> event name. IE: sched/sched_waking, sched is the system name,
> sched_waking is the event name. For user_events in the root group, the
> system name is "user_events". When groups are introduced, the system
> name can be "user_events_<GUID>" for example.
So my suggestion is using PID in root pid namespace instead of GUID
by default.
Thank you,
--
Masami Hiramatsu (Google) <[email protected]>
Hi,
On Tue, 16 May 2023 17:36:28 -0700
Alexei Starovoitov <[email protected]> wrote:
> BPF progs have three ways to access kernel tracepoints:
> 1. traditional tracepoint
This is the trace_events, which is used by ftrace, right?
> 2. raw tracepoint
> 3. raw tracepoint with BTF
>
> 1 was added first and now rarely used (only by old tools), since it's slow.
> 2 was added later to address performance concerns.
> 3 was added after BTF was introduced to provide accurate types.
>
> 3 is the only one that bpf community recommends and is the one that is used most often.
>
> As far as I know trace_events were never connected to bpf.
> Unless somebody sneaked the code in without us seeing it.
With this design, I understand that you may not want to connect BPF
directly to user_events. It needs a different model.
>
> I think you're trying to model user_events+bpf as 1.
> Which means that you'll be repeating the same mistakes.
The user_events is completely different from the traceppoint and
must have no BTF with it.
Also, all information must be sent in the user-written data packet.
(No data structure, event if there is a structure, it must be fully
contained in the packet.)
For the tracepoint, there is a function call with some values or
pointers of data structure. So it is meaningful to skip using the
traceevent (which converts all pointers to actual field values of
the data structure and store it to ftrace buffer) because most of
the values can be ignored in the BPF prog.
However, for the user_events, the data is just passed from the
user as a data packet, and BPF prog can access to the data packet
(to avoid accessing malicious data, data validator can not be
skipped). So this seems like 1. but actually you can access to
the validated data on perf buffer. Maybe we can allow BPF to
hook the write syscall and access user-space data, but it may
not safe. I think this is the safest way to do that.
Thank you,
--
Masami Hiramatsu (Google) <[email protected]>
On Tue, Jun 06, 2023 at 10:37:52PM +0900, Masami Hiramatsu wrote:
> Hi Beau,
>
> On Thu, 1 Jun 2023 09:29:21 -0700
> Beau Belgrave <[email protected]> wrote:
>
> > > > These are stubs to integrate namespace support. I've been working on a
> > > > series that adds a tracing namespace support similiar to the IMA
> > > > namespace work [1]. That series is ending up taking more time than I
> > >
> > > Look, this is all well and nice but you've integrated user events with
> > > tracefs. This is currently a single-instance global filesystem. So what
> > > you're effectively implying is that you're namespacing tracefs by
> > > hanging it off of struct user namespace making it mountable by
> > > unprivileged users. Or what's the plan?
> > >
> >
> > We don't have plans for unprivileged users currently. I think that is a
> > great goal and requires a proper tracing namespace, which we currently
> > don't have. I've done some thinking on this, but I would like to hear
> > your thoughts and others on how to do this properly. We do talk about
> > this in the tracefs meetings (those might be out of your time zone
> > unfortunately).
> >
> > > That alone is massive work with _wild_ security implications. My
> > > appetite for exposing more stuff under user namespaces is very low given
> > > the amount of CVEs we've had over the years.
> > >
> >
> > Ok, I based that approach on the feedback given in LPC 2022 - Containers
> > and Checkpoint/Retore MC [1]. I believe you gave feedback to use user
> > namespaces to provide the encapsulation that was required :)
>
> Even with the user namespace, I think we still need to provide separate
> "eventname-space" for each application, since it may depend on the context
> who and where it is launched. I think the easiest solution is (perhaps)
> providing a PID-based new groups for each instance (the PID-prefix or
> suffix will be hidden from the application).
> I think it may not good to allow unprivileged user processes to detect
> the registered event name each other by default.
>
Regarding PID, are you referring the PID namespace the application
resides within? Or the actual single PID of the process?
In production we monitor things in sets that encompass more than a
single application. A requirement we need is the ability to group
like-processes together for monitoring purposes.
We really need a way to know these set of events are for this group, the
easiest way to do that is by the system name provided on each event. If
this were to be single PID (and not the PID namespace), then we wouldn't
be able to achieve this requirement. Ideally an admin would be able to
setup the name in some way that means something to them in user-space.
IE: user_events_critical as a system name, vs knowing (user_events_5
or user_events_6 or user_events_8) are "critical".
Another simple example is the same "application" but it gets exec'd more
than once. Each time it execs the system name would change if it was
really by the actual PID vs PID namespace. This would be very hard to
manage on a perf_event or eBPF level for us. It would also vastly
increase the number of trace_events that would get created on the
system.
> >
> > > > anticipated.
> > >
> > > Yet you were confident enough to leave the namespacing stubs for this
> > > functionality in the code. ;)
> > >
> > > What is the overall goal here? Letting arbitrary unprivileged containers
> > > define their own custom user event type by mounting tracefs inside
> > > unprivileged containers? If so, what security story is going to
> > > guarantee that writing arbitrary tracepoints from random unprivileged
> > > containers is safe?
> > >
> >
> > Unprivileged containers is not a goal, however, having a per-pod
> > user_event system name, such as user_event_<pod_name>, would be ideal
> > for certain diagnostic scenarios, such as monitoring the entire pod.
>
> That can be done in the user-space tools, not in the kernel.
>
Right, during k8s pod creation we would create the group and name it
something that makes sense to the operator as an example. I'm sure there
are lots of scenarios user-space can do. However, they almost always
involve more than 1 application together in our scenarios.
> > When you have a lot of containers, you also want to limit how many
> > tracepoints each container can create, even if they are given access to
> > the tracefs file. The per-group can limit how many events/tracepoints
> > that container can go create, since we currently only have 16-bit
> > identifiers for trace_event's we need to be cautious we don't run out.
>
> I agree, we need to have a knob to limit it to avoid DoS attack.
>
> > user_events in general has tracepoint validators to ensure the payloads
> > coming in are "safe" from what the kernel might do with them, such as
> > filtering out data.
>
> [...]
> > > > changing the system name of user_events on a per-namespace basis.
> > >
> > > What is the "system name" and how does it protect against namespaces
> > > messing with each other?
> >
> > trace_events in the tracing facility require both a system name and an
> > event name. IE: sched/sched_waking, sched is the system name,
> > sched_waking is the event name. For user_events in the root group, the
> > system name is "user_events". When groups are introduced, the system
> > name can be "user_events_<GUID>" for example.
>
> So my suggestion is using PID in root pid namespace instead of GUID
> by default.
>
By default this would be fine as long as admins can change this to a larger
group before activation for our purposes. PID however, might be a bit
too granular of an identifier for our scenarios as I've explained above.
I think these logical steps make sense:
1. Create "event namespace" (Default system name suffix, max count)
2. Setup "event namespace" (Change system name suffix, max count)
3. Attach "event namespace"
I'm not sure we know what to attach to in #3 yet, so far both a tracer
namespace and user namespace have been proposed. I think we need to
answer that. Right now everything is in the root "event namespace" and
is simply referred to by default as "user_events" as the system name
without a suffix, and with the boot configured max event count.
Thanks,
-Beau
> Thank you,
>
>
> --
> Masami Hiramatsu (Google) <[email protected]>
On Tue, Jun 6, 2023 at 6:57 AM Masami Hiramatsu <[email protected]> wrote:
>
> Hi,
>
> On Tue, 16 May 2023 17:36:28 -0700
> Alexei Starovoitov <[email protected]> wrote:
>
> > BPF progs have three ways to access kernel tracepoints:
> > 1. traditional tracepoint
>
> This is the trace_events, which is used by ftrace, right?
>
> > 2. raw tracepoint
> > 3. raw tracepoint with BTF
> >
> > 1 was added first and now rarely used (only by old tools), since it's slow.
> > 2 was added later to address performance concerns.
> > 3 was added after BTF was introduced to provide accurate types.
> >
> > 3 is the only one that bpf community recommends and is the one that is used most often.
> >
> > As far as I know trace_events were never connected to bpf.
> > Unless somebody sneaked the code in without us seeing it.
>
> With this design, I understand that you may not want to connect BPF
> directly to user_events. It needs a different model.
>
> >
> > I think you're trying to model user_events+bpf as 1.
> > Which means that you'll be repeating the same mistakes.
>
> The user_events is completely different from the traceppoint and
> must have no BTF with it.
> Also, all information must be sent in the user-written data packet.
> (No data structure, event if there is a structure, it must be fully
> contained in the packet.)
>
> For the tracepoint, there is a function call with some values or
> pointers of data structure. So it is meaningful to skip using the
> traceevent (which converts all pointers to actual field values of
> the data structure and store it to ftrace buffer) because most of
> the values can be ignored in the BPF prog.
>
> However, for the user_events, the data is just passed from the
> user as a data packet, and BPF prog can access to the data packet
> (to avoid accessing malicious data, data validator can not be
> skipped). So this seems like 1. but actually you can access to
> the validated data on perf buffer. Maybe we can allow BPF to
> hook the write syscall and access user-space data, but it may
> not safe. I think this is the safest way to do that.
I'm trying to understand why we need a new kernel concept for all
this. It looks like we are just creating a poor man's
publisher/subscriber solution in the kernel, but mostly intend to use
it from user-space? Why not just use Unix domain sockets for this,
though? Use SOCK_SEQPACKET, put "event data" into a single packet
that's guaranteed to not be broken up. Expose this to other processes
through named pipes, if necessary.
Sorry if it's naive questions, but it's not clear what problem
user_events are solving and why we need a new thing and can't use
existing kernel primitives?
>
> Thank you,
>
> --
> Masami Hiramatsu (Google) <[email protected]>
On Tue, Jun 06, 2023 at 09:57:14AM -0700, Andrii Nakryiko wrote:
> On Tue, Jun 6, 2023 at 6:57 AM Masami Hiramatsu <[email protected]> wrote:
> >
> > Hi,
> >
> > On Tue, 16 May 2023 17:36:28 -0700
> > Alexei Starovoitov <[email protected]> wrote:
> >
> > > BPF progs have three ways to access kernel tracepoints:
> > > 1. traditional tracepoint
> >
> > This is the trace_events, which is used by ftrace, right?
> >
> > > 2. raw tracepoint
> > > 3. raw tracepoint with BTF
> > >
> > > 1 was added first and now rarely used (only by old tools), since it's slow.
> > > 2 was added later to address performance concerns.
> > > 3 was added after BTF was introduced to provide accurate types.
> > >
> > > 3 is the only one that bpf community recommends and is the one that is used most often.
> > >
> > > As far as I know trace_events were never connected to bpf.
> > > Unless somebody sneaked the code in without us seeing it.
> >
> > With this design, I understand that you may not want to connect BPF
> > directly to user_events. It needs a different model.
> >
> > >
> > > I think you're trying to model user_events+bpf as 1.
> > > Which means that you'll be repeating the same mistakes.
> >
> > The user_events is completely different from the traceppoint and
> > must have no BTF with it.
> > Also, all information must be sent in the user-written data packet.
> > (No data structure, event if there is a structure, it must be fully
> > contained in the packet.)
> >
> > For the tracepoint, there is a function call with some values or
> > pointers of data structure. So it is meaningful to skip using the
> > traceevent (which converts all pointers to actual field values of
> > the data structure and store it to ftrace buffer) because most of
> > the values can be ignored in the BPF prog.
> >
> > However, for the user_events, the data is just passed from the
> > user as a data packet, and BPF prog can access to the data packet
> > (to avoid accessing malicious data, data validator can not be
> > skipped). So this seems like 1. but actually you can access to
> > the validated data on perf buffer. Maybe we can allow BPF to
> > hook the write syscall and access user-space data, but it may
> > not safe. I think this is the safest way to do that.
>
> I'm trying to understand why we need a new kernel concept for all
> this. It looks like we are just creating a poor man's
> publisher/subscriber solution in the kernel, but mostly intend to use
> it from user-space? Why not just use Unix domain sockets for this,
> though? Use SOCK_SEQPACKET, put "event data" into a single packet
> that's guaranteed to not be broken up. Expose this to other processes
> through named pipes, if necessary.
>
> Sorry if it's naive questions, but it's not clear what problem
> user_events are solving and why we need a new thing and can't use
> existing kernel primitives?
>
There's a number of reasons why we did not do as you suggest.
The first reason is we want to only take the write() syscall cost when
events are wanting to be monitored. This is done at a per-trace_event
level and is not at a per-process level. user_events gives us the
ability to know cheaply when an event is or is not to be written. It
does this by setting/clearing a bit in each process when the trace_event
classes register function is invoked to attach/detach perf or ftrace.
By using a bit instead of bytes, we also have the ability to share
tracing out to the kernel as well as any local user tracer in the
future, this work was started by Mathieu Desnoyers via libside [1].
The second reason is we have found user based buffers to be unreliable
when either the user process is crashing or has a corruption bug. By
having the data buffers reside within the kernel, we prevent this from
happening. If the kernel panics, we can also pull events out of the
perf_event buffers via GDB to understand what our user processes were
doing before the time of the panic.
The third reason is we want to make use of all the features that perf,
ftrace, and eBPF have. We do not want to have to re-write all of those
features. The main things are being able to filter upon event payloads
and aggregate them together. We also selectively turn on and off stack
walking for some events (not all). Perf lets us selectively do this on a
per-event basis in addition to grabbing raw stack data to enable
unwinding via DWARF instructions. When we monitor events via
perf/ftrace, we can find each offset and type for the fields within the
event. We need to know these to properly decode events and analyze them.
Tracefs gives a us a single place to see all of these events and
efficiently decode them, including a stable event ID. We would have to
replicate all of that work in userspace in addition to the other
features we rely upon.
The fourth reason is related to the third, we have a lot of existing
diagnostics that rely upon and setup perf ring buffers. We want the user
and kernel diagnostics to land in the same buffers with the same
timestamps so we can see a full picture of what is going on.
Thanks,
-Beau
1. https://github.com/compudj/libside
>
> >
> > Thank you,
> >
> > --
> > Masami Hiramatsu (Google) <[email protected]>
On Tue, 6 Jun 2023 10:05:49 -0700
Beau Belgrave <[email protected]> wrote:
> On Tue, Jun 06, 2023 at 10:37:52PM +0900, Masami Hiramatsu wrote:
> > Hi Beau,
> >
> > On Thu, 1 Jun 2023 09:29:21 -0700
> > Beau Belgrave <[email protected]> wrote:
> >
> > > > > These are stubs to integrate namespace support. I've been working on a
> > > > > series that adds a tracing namespace support similiar to the IMA
> > > > > namespace work [1]. That series is ending up taking more time than I
> > > >
> > > > Look, this is all well and nice but you've integrated user events with
> > > > tracefs. This is currently a single-instance global filesystem. So what
> > > > you're effectively implying is that you're namespacing tracefs by
> > > > hanging it off of struct user namespace making it mountable by
> > > > unprivileged users. Or what's the plan?
> > > >
> > >
> > > We don't have plans for unprivileged users currently. I think that is a
> > > great goal and requires a proper tracing namespace, which we currently
> > > don't have. I've done some thinking on this, but I would like to hear
> > > your thoughts and others on how to do this properly. We do talk about
> > > this in the tracefs meetings (those might be out of your time zone
> > > unfortunately).
> > >
> > > > That alone is massive work with _wild_ security implications. My
> > > > appetite for exposing more stuff under user namespaces is very low given
> > > > the amount of CVEs we've had over the years.
> > > >
> > >
> > > Ok, I based that approach on the feedback given in LPC 2022 - Containers
> > > and Checkpoint/Retore MC [1]. I believe you gave feedback to use user
> > > namespaces to provide the encapsulation that was required :)
> >
> > Even with the user namespace, I think we still need to provide separate
> > "eventname-space" for each application, since it may depend on the context
> > who and where it is launched. I think the easiest solution is (perhaps)
> > providing a PID-based new groups for each instance (the PID-prefix or
> > suffix will be hidden from the application).
> > I think it may not good to allow unprivileged user processes to detect
> > the registered event name each other by default.
> >
>
> Regarding PID, are you referring the PID namespace the application
> resides within? Or the actual single PID of the process?
I meant the actual single PID of the process. That will be the safest
way by default.
>
> In production we monitor things in sets that encompass more than a
> single application. A requirement we need is the ability to group
> like-processes together for monitoring purposes.
>
> We really need a way to know these set of events are for this group, the
> easiest way to do that is by the system name provided on each event. If
> this were to be single PID (and not the PID namespace), then we wouldn't
> be able to achieve this requirement. Ideally an admin would be able to
> setup the name in some way that means something to them in user-space.
Would you mean using the same events between several different processes?
I think it needs more care about security concerns. More on this later.
If not, I think admin has a way to identify which processes are running in
the same group outside of ftrace, and can set the filter correctly.
>
> IE: user_events_critical as a system name, vs knowing (user_events_5
> or user_events_6 or user_events_8) are "critical".
My thought is the latter. Then the process can not access to the
other process's namespace each other.
>
> Another simple example is the same "application" but it gets exec'd more
> than once. Each time it execs the system name would change if it was
> really by the actual PID vs PID namespace. This would be very hard to
> manage on a perf_event or eBPF level for us. It would also vastly
> increase the number of trace_events that would get created on the
> system.
Indeed. But fundamentally allowing user to create (register) the new
event means such DoS attack can happen. That's why we have a limitation
of the max number of user_events. (BTW, I want to make this number
controllable from sysctl or tracefs. Also, we need something against the
event-id space contamination by this DoS attack.)
I also think it would be better to have some rate-limit about registering
new events.
>
> > >
> > > > > anticipated.
> > > >
> > > > Yet you were confident enough to leave the namespacing stubs for this
> > > > functionality in the code. ;)
> > > >
> > > > What is the overall goal here? Letting arbitrary unprivileged containers
> > > > define their own custom user event type by mounting tracefs inside
> > > > unprivileged containers? If so, what security story is going to
> > > > guarantee that writing arbitrary tracepoints from random unprivileged
> > > > containers is safe?
> > > >
> > >
> > > Unprivileged containers is not a goal, however, having a per-pod
> > > user_event system name, such as user_event_<pod_name>, would be ideal
> > > for certain diagnostic scenarios, such as monitoring the entire pod.
> >
> > That can be done in the user-space tools, not in the kernel.
> >
>
> Right, during k8s pod creation we would create the group and name it
> something that makes sense to the operator as an example. I'm sure there
> are lots of scenarios user-space can do. However, they almost always
> involve more than 1 application together in our scenarios.
Yeah, if it is always used with k8s in the backend servers, it maybe OK.
But if it is used in more unreliable environment, we need to consider
about malicious normal users.
>
> > > When you have a lot of containers, you also want to limit how many
> > > tracepoints each container can create, even if they are given access to
> > > the tracefs file. The per-group can limit how many events/tracepoints
> > > that container can go create, since we currently only have 16-bit
> > > identifiers for trace_event's we need to be cautious we don't run out.
> >
> > I agree, we need to have a knob to limit it to avoid DoS attack.
> >
> > > user_events in general has tracepoint validators to ensure the payloads
> > > coming in are "safe" from what the kernel might do with them, such as
> > > filtering out data.
> >
> > [...]
> > > > > changing the system name of user_events on a per-namespace basis.
> > > >
> > > > What is the "system name" and how does it protect against namespaces
> > > > messing with each other?
> > >
> > > trace_events in the tracing facility require both a system name and an
> > > event name. IE: sched/sched_waking, sched is the system name,
> > > sched_waking is the event name. For user_events in the root group, the
> > > system name is "user_events". When groups are introduced, the system
> > > name can be "user_events_<GUID>" for example.
> >
> > So my suggestion is using PID in root pid namespace instead of GUID
> > by default.
> >
>
> By default this would be fine as long as admins can change this to a larger
> group before activation for our purposes. PID however, might be a bit
> too granular of an identifier for our scenarios as I've explained above.
>
> I think these logical steps make sense:
> 1. Create "event namespace" (Default system name suffix, max count)
> 2. Setup "event namespace" (Change system name suffix, max count)
> 3. Attach "event namespace"
>
> I'm not sure we know what to attach to in #3 yet, so far both a tracer
> namespace and user namespace have been proposed. I think we need to
> answer that. Right now everything is in the root "event namespace" and
> is simply referred to by default as "user_events" as the system name
> without a suffix, and with the boot configured max event count.
OK, so I think we are on the same page :)
I think the user namespace is not enough for protecting events on
multi-user system without containers. So it has less flexibility.
The new tracer namespace may be OK, we still need a helper user
program like 'user_eventd' for managing access based on some policy.
If we have a way to manage it with SELinux etc. it will be the best
I think. (Perhaps using UNIX domain socket will give us such flexibility.)
Thank you,
>
> Thanks,
> -Beau
>
> > Thank you,
> >
> >
> > --
> > Masami Hiramatsu (Google) <[email protected]>
--
Masami Hiramatsu (Google) <[email protected]>
On Wed, Jun 07, 2023 at 11:07:02PM +0900, Masami Hiramatsu wrote:
> On Tue, 6 Jun 2023 10:05:49 -0700
> Beau Belgrave <[email protected]> wrote:
>
> > On Tue, Jun 06, 2023 at 10:37:52PM +0900, Masami Hiramatsu wrote:
> > > Hi Beau,
> > >
> > > On Thu, 1 Jun 2023 09:29:21 -0700
> > > Beau Belgrave <[email protected]> wrote:
> > >
> > > > > > These are stubs to integrate namespace support. I've been working on a
> > > > > > series that adds a tracing namespace support similiar to the IMA
> > > > > > namespace work [1]. That series is ending up taking more time than I
> > > > >
> > > > > Look, this is all well and nice but you've integrated user events with
> > > > > tracefs. This is currently a single-instance global filesystem. So what
> > > > > you're effectively implying is that you're namespacing tracefs by
> > > > > hanging it off of struct user namespace making it mountable by
> > > > > unprivileged users. Or what's the plan?
> > > > >
> > > >
> > > > We don't have plans for unprivileged users currently. I think that is a
> > > > great goal and requires a proper tracing namespace, which we currently
> > > > don't have. I've done some thinking on this, but I would like to hear
> > > > your thoughts and others on how to do this properly. We do talk about
> > > > this in the tracefs meetings (those might be out of your time zone
> > > > unfortunately).
> > > >
> > > > > That alone is massive work with _wild_ security implications. My
> > > > > appetite for exposing more stuff under user namespaces is very low given
> > > > > the amount of CVEs we've had over the years.
> > > > >
> > > >
> > > > Ok, I based that approach on the feedback given in LPC 2022 - Containers
> > > > and Checkpoint/Retore MC [1]. I believe you gave feedback to use user
> > > > namespaces to provide the encapsulation that was required :)
> > >
> > > Even with the user namespace, I think we still need to provide separate
> > > "eventname-space" for each application, since it may depend on the context
> > > who and where it is launched. I think the easiest solution is (perhaps)
> > > providing a PID-based new groups for each instance (the PID-prefix or
> > > suffix will be hidden from the application).
> > > I think it may not good to allow unprivileged user processes to detect
> > > the registered event name each other by default.
> > >
> >
> > Regarding PID, are you referring the PID namespace the application
> > resides within? Or the actual single PID of the process?
>
> I meant the actual single PID of the process. That will be the safest
> way by default.
>
How do you feel about instead of single PID using the effective user ID?
That way we wouldn't have so many events on the system, and the user is
controlling what runs and can share events.
I could see a way for admins to also override the user_event suffix on a
per-user basis to allow for broader event name scopes if required (IE:
Our k8s and production scenarios).
> >
> > In production we monitor things in sets that encompass more than a
> > single application. A requirement we need is the ability to group
> > like-processes together for monitoring purposes.
> >
> > We really need a way to know these set of events are for this group, the
> > easiest way to do that is by the system name provided on each event. If
> > this were to be single PID (and not the PID namespace), then we wouldn't
> > be able to achieve this requirement. Ideally an admin would be able to
> > setup the name in some way that means something to them in user-space.
>
> Would you mean using the same events between several different processes?
> I think it needs more care about security concerns. More on this later.
>
> If not, I think admin has a way to identify which processes are running in
> the same group outside of ftrace, and can set the filter correctly.
>
Agree that's possible, but it's going to be a massive amount of events
for both tracefs and perf_event ring buffers to handle (we need a perf
FD per trace_event ID).
> >
> > IE: user_events_critical as a system name, vs knowing (user_events_5
> > or user_events_6 or user_events_8) are "critical".
>
> My thought is the latter. Then the process can not access to the
> other process's namespace each other.
>
> >
> > Another simple example is the same "application" but it gets exec'd more
> > than once. Each time it execs the system name would change if it was
> > really by the actual PID vs PID namespace. This would be very hard to
> > manage on a perf_event or eBPF level for us. It would also vastly
> > increase the number of trace_events that would get created on the
> > system.
>
> Indeed. But fundamentally allowing user to create (register) the new
> event means such DoS attack can happen. That's why we have a limitation
> of the max number of user_events. (BTW, I want to make this number
> controllable from sysctl or tracefs. Also, we need something against the
> event-id space contamination by this DoS attack.)
> I also think it would be better to have some rate-limit about registering
> new events.
>
Totally agree here.
> >
> > > >
> > > > > > anticipated.
> > > > >
> > > > > Yet you were confident enough to leave the namespacing stubs for this
> > > > > functionality in the code. ;)
> > > > >
> > > > > What is the overall goal here? Letting arbitrary unprivileged containers
> > > > > define their own custom user event type by mounting tracefs inside
> > > > > unprivileged containers? If so, what security story is going to
> > > > > guarantee that writing arbitrary tracepoints from random unprivileged
> > > > > containers is safe?
> > > > >
> > > >
> > > > Unprivileged containers is not a goal, however, having a per-pod
> > > > user_event system name, such as user_event_<pod_name>, would be ideal
> > > > for certain diagnostic scenarios, such as monitoring the entire pod.
> > >
> > > That can be done in the user-space tools, not in the kernel.
> > >
> >
> > Right, during k8s pod creation we would create the group and name it
> > something that makes sense to the operator as an example. I'm sure there
> > are lots of scenarios user-space can do. However, they almost always
> > involve more than 1 application together in our scenarios.
>
> Yeah, if it is always used with k8s in the backend servers, it maybe OK.
> But if it is used in more unreliable environment, we need to consider
> about malicious normal users.
>
> >
> > > > When you have a lot of containers, you also want to limit how many
> > > > tracepoints each container can create, even if they are given access to
> > > > the tracefs file. The per-group can limit how many events/tracepoints
> > > > that container can go create, since we currently only have 16-bit
> > > > identifiers for trace_event's we need to be cautious we don't run out.
> > >
> > > I agree, we need to have a knob to limit it to avoid DoS attack.
> > >
> > > > user_events in general has tracepoint validators to ensure the payloads
> > > > coming in are "safe" from what the kernel might do with them, such as
> > > > filtering out data.
> > >
> > > [...]
> > > > > > changing the system name of user_events on a per-namespace basis.
> > > > >
> > > > > What is the "system name" and how does it protect against namespaces
> > > > > messing with each other?
> > > >
> > > > trace_events in the tracing facility require both a system name and an
> > > > event name. IE: sched/sched_waking, sched is the system name,
> > > > sched_waking is the event name. For user_events in the root group, the
> > > > system name is "user_events". When groups are introduced, the system
> > > > name can be "user_events_<GUID>" for example.
> > >
> > > So my suggestion is using PID in root pid namespace instead of GUID
> > > by default.
> > >
> >
> > By default this would be fine as long as admins can change this to a larger
> > group before activation for our purposes. PID however, might be a bit
> > too granular of an identifier for our scenarios as I've explained above.
> >
> > I think these logical steps make sense:
> > 1. Create "event namespace" (Default system name suffix, max count)
> > 2. Setup "event namespace" (Change system name suffix, max count)
> > 3. Attach "event namespace"
> >
> > I'm not sure we know what to attach to in #3 yet, so far both a tracer
> > namespace and user namespace have been proposed. I think we need to
> > answer that. Right now everything is in the root "event namespace" and
> > is simply referred to by default as "user_events" as the system name
> > without a suffix, and with the boot configured max event count.
>
> OK, so I think we are on the same page :)
>
> I think the user namespace is not enough for protecting events on
> multi-user system without containers. So it has less flexibility.
> The new tracer namespace may be OK, we still need a helper user
> program like 'user_eventd' for managing access based on some policy.
> If we have a way to manage it with SELinux etc. it will be the best
> I think. (Perhaps using UNIX domain socket will give us such flexibility.)
>
I'm adding Mathieu to CC since I think he had a few cases where a static
namespace wasn't enough and we might need hierarchy support.
If we don't need hierarchy support, I think it's a lot easier to do. I
like the idea of a per-user event namespace vs a per-PID event namespace
knowing what we have to do to monitor all of this via perf. Like I said
above, that will be a huge amount of events compared to a per-user or
namespace approach.
But I do like where this is headed and glad we are having this
conversation :)
Thanks,
-Beau
On Wed, 7 Jun 2023 12:26:11 -0700
Beau Belgrave <[email protected]> wrote:
> On Wed, Jun 07, 2023 at 11:07:02PM +0900, Masami Hiramatsu wrote:
> > On Tue, 6 Jun 2023 10:05:49 -0700
> > Beau Belgrave <[email protected]> wrote:
> >
> > > On Tue, Jun 06, 2023 at 10:37:52PM +0900, Masami Hiramatsu wrote:
> > > > Hi Beau,
> > > >
> > > > On Thu, 1 Jun 2023 09:29:21 -0700
> > > > Beau Belgrave <[email protected]> wrote:
> > > >
> > > > > > > These are stubs to integrate namespace support. I've been working on a
> > > > > > > series that adds a tracing namespace support similiar to the IMA
> > > > > > > namespace work [1]. That series is ending up taking more time than I
> > > > > >
> > > > > > Look, this is all well and nice but you've integrated user events with
> > > > > > tracefs. This is currently a single-instance global filesystem. So what
> > > > > > you're effectively implying is that you're namespacing tracefs by
> > > > > > hanging it off of struct user namespace making it mountable by
> > > > > > unprivileged users. Or what's the plan?
> > > > > >
> > > > >
> > > > > We don't have plans for unprivileged users currently. I think that is a
> > > > > great goal and requires a proper tracing namespace, which we currently
> > > > > don't have. I've done some thinking on this, but I would like to hear
> > > > > your thoughts and others on how to do this properly. We do talk about
> > > > > this in the tracefs meetings (those might be out of your time zone
> > > > > unfortunately).
> > > > >
> > > > > > That alone is massive work with _wild_ security implications. My
> > > > > > appetite for exposing more stuff under user namespaces is very low given
> > > > > > the amount of CVEs we've had over the years.
> > > > > >
> > > > >
> > > > > Ok, I based that approach on the feedback given in LPC 2022 - Containers
> > > > > and Checkpoint/Retore MC [1]. I believe you gave feedback to use user
> > > > > namespaces to provide the encapsulation that was required :)
> > > >
> > > > Even with the user namespace, I think we still need to provide separate
> > > > "eventname-space" for each application, since it may depend on the context
> > > > who and where it is launched. I think the easiest solution is (perhaps)
> > > > providing a PID-based new groups for each instance (the PID-prefix or
> > > > suffix will be hidden from the application).
> > > > I think it may not good to allow unprivileged user processes to detect
> > > > the registered event name each other by default.
> > > >
> > >
> > > Regarding PID, are you referring the PID namespace the application
> > > resides within? Or the actual single PID of the process?
> >
> > I meant the actual single PID of the process. That will be the safest
> > way by default.
> >
>
> How do you feel about instead of single PID using the effective user ID?
>
> That way we wouldn't have so many events on the system, and the user is
> controlling what runs and can share events.
I think that is another option, and maybe good for some application which
does not use thread but forking worker process for data isolation. And
also, I think that can be covered by the tracer namespace too.
One concern is that if the system uses finer grained access control like
SELinux, it will still be problematic, because one of those processes is
compromised, the event-name is detected.
(In this case, the events still needs to be separated for each process?)
I think there are two points: allowing users to choose the most secure or
relaxed method, and which should be the default behavior.
>
> I could see a way for admins to also override the user_event suffix on a
> per-user basis to allow for broader event name scopes if required (IE:
> Our k8s and production scenarios).
>
> > >
> > > In production we monitor things in sets that encompass more than a
> > > single application. A requirement we need is the ability to group
> > > like-processes together for monitoring purposes.
> > >
> > > We really need a way to know these set of events are for this group, the
> > > easiest way to do that is by the system name provided on each event. If
> > > this were to be single PID (and not the PID namespace), then we wouldn't
> > > be able to achieve this requirement. Ideally an admin would be able to
> > > setup the name in some way that means something to them in user-space.
> >
> > Would you mean using the same events between several different processes?
> > I think it needs more care about security concerns. More on this later.
> >
> > If not, I think admin has a way to identify which processes are running in
> > the same group outside of ftrace, and can set the filter correctly.
> >
>
> Agree that's possible, but it's going to be a massive amount of events
> for both tracefs and perf_event ring buffers to handle (we need a perf
> FD per trace_event ID).
So, for that case, it will need "sharing" events among the different
processes.
>
> > >
> > > IE: user_events_critical as a system name, vs knowing (user_events_5
> > > or user_events_6 or user_events_8) are "critical".
> >
> > My thought is the latter. Then the process can not access to the
> > other process's namespace each other.
> >
> > >
> > > Another simple example is the same "application" but it gets exec'd more
> > > than once. Each time it execs the system name would change if it was
> > > really by the actual PID vs PID namespace. This would be very hard to
> > > manage on a perf_event or eBPF level for us. It would also vastly
> > > increase the number of trace_events that would get created on the
> > > system.
> >
> > Indeed. But fundamentally allowing user to create (register) the new
> > event means such DoS attack can happen. That's why we have a limitation
> > of the max number of user_events. (BTW, I want to make this number
> > controllable from sysctl or tracefs. Also, we need something against the
> > event-id space contamination by this DoS attack.)
> > I also think it would be better to have some rate-limit about registering
> > new events.
> >
>
> Totally agree here.
>
> > >
> > > > >
> > > > > > > anticipated.
> > > > > >
> > > > > > Yet you were confident enough to leave the namespacing stubs for this
> > > > > > functionality in the code. ;)
> > > > > >
> > > > > > What is the overall goal here? Letting arbitrary unprivileged containers
> > > > > > define their own custom user event type by mounting tracefs inside
> > > > > > unprivileged containers? If so, what security story is going to
> > > > > > guarantee that writing arbitrary tracepoints from random unprivileged
> > > > > > containers is safe?
> > > > > >
> > > > >
> > > > > Unprivileged containers is not a goal, however, having a per-pod
> > > > > user_event system name, such as user_event_<pod_name>, would be ideal
> > > > > for certain diagnostic scenarios, such as monitoring the entire pod.
> > > >
> > > > That can be done in the user-space tools, not in the kernel.
> > > >
> > >
> > > Right, during k8s pod creation we would create the group and name it
> > > something that makes sense to the operator as an example. I'm sure there
> > > are lots of scenarios user-space can do. However, they almost always
> > > involve more than 1 application together in our scenarios.
> >
> > Yeah, if it is always used with k8s in the backend servers, it maybe OK.
> > But if it is used in more unreliable environment, we need to consider
> > about malicious normal users.
> >
> > >
> > > > > When you have a lot of containers, you also want to limit how many
> > > > > tracepoints each container can create, even if they are given access to
> > > > > the tracefs file. The per-group can limit how many events/tracepoints
> > > > > that container can go create, since we currently only have 16-bit
> > > > > identifiers for trace_event's we need to be cautious we don't run out.
> > > >
> > > > I agree, we need to have a knob to limit it to avoid DoS attack.
> > > >
> > > > > user_events in general has tracepoint validators to ensure the payloads
> > > > > coming in are "safe" from what the kernel might do with them, such as
> > > > > filtering out data.
> > > >
> > > > [...]
> > > > > > > changing the system name of user_events on a per-namespace basis.
> > > > > >
> > > > > > What is the "system name" and how does it protect against namespaces
> > > > > > messing with each other?
> > > > >
> > > > > trace_events in the tracing facility require both a system name and an
> > > > > event name. IE: sched/sched_waking, sched is the system name,
> > > > > sched_waking is the event name. For user_events in the root group, the
> > > > > system name is "user_events". When groups are introduced, the system
> > > > > name can be "user_events_<GUID>" for example.
> > > >
> > > > So my suggestion is using PID in root pid namespace instead of GUID
> > > > by default.
> > > >
> > >
> > > By default this would be fine as long as admins can change this to a larger
> > > group before activation for our purposes. PID however, might be a bit
> > > too granular of an identifier for our scenarios as I've explained above.
> > >
> > > I think these logical steps make sense:
> > > 1. Create "event namespace" (Default system name suffix, max count)
> > > 2. Setup "event namespace" (Change system name suffix, max count)
> > > 3. Attach "event namespace"
> > >
> > > I'm not sure we know what to attach to in #3 yet, so far both a tracer
> > > namespace and user namespace have been proposed. I think we need to
> > > answer that. Right now everything is in the root "event namespace" and
> > > is simply referred to by default as "user_events" as the system name
> > > without a suffix, and with the boot configured max event count.
> >
> > OK, so I think we are on the same page :)
> >
> > I think the user namespace is not enough for protecting events on
> > multi-user system without containers. So it has less flexibility.
> > The new tracer namespace may be OK, we still need a helper user
> > program like 'user_eventd' for managing access based on some policy.
> > If we have a way to manage it with SELinux etc. it will be the best
> > I think. (Perhaps using UNIX domain socket will give us such flexibility.)
> >
>
> I'm adding Mathieu to CC since I think he had a few cases where a static
> namespace wasn't enough and we might need hierarchy support.
>
> If we don't need hierarchy support, I think it's a lot easier to do. I
> like the idea of a per-user event namespace vs a per-PID event namespace
> knowing what we have to do to monitor all of this via perf. Like I said
> above, that will be a huge amount of events compared to a per-user or
> namespace approach.
I think we can start with non hierarchy, but just grouping it. Adding
hierarchy can be done afterwards.
>
> But I do like where this is headed and glad we are having this
> conversation :)
Yeah, me too :)
Thank you!
>
> Thanks,
> -Beau
--
Masami Hiramatsu (Google) <[email protected]>