Hello,
I'm currently seeing latest Linus' master being somewhat broken w.r.t
krpobes. In particular I have the following test-case:
#!/bin/bash
mkfs.btrfs -f /dev/vdc &> /dev/null
mount /dev/vdc /media/scratch/
bpftrace -e 'kprobe:btrfs_sync_file {printf("kprobe: %s\n", kstack());}'
&>bpf-output &
bpf_trace_pid=$!
# force btrfs_sync_file to be called
sleep 2
xfs_io -f -c "pwrite 0 4m" -c "fsync" /media/scratch/file5
kill $bpf_trace_pid
sleep 1
grep -q kprobe bpf-output
retval=$?
rm -f bpf-output
umount /media/scratch
exit $retval
It traces btrfs_sync_file which is called when fsync is executed on a
btrfs file, however I don't see the stacktrace being printed i.e the
kprobe doesn't fire at all. The following alternative program:
bpftrace -e 'tracepoint:btrfs:btrfs_sync_file {printf("tracepoint:
%s\n", kstack());} kprobe:btrfs_sync_file {printf("kprobe: %s\n",
kstack());}'
only prints the stack from the tracepoint and not from the kprobe, given
that the tracepoint is called from the btrfs_sync_file function.
I started bisecting this and arrived at the following commit:
0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")
FWIW the following series is applied on the kernel I was testing:
https://lore.kernel.org/lkml/159870598914.1229682.15230803449082078353.stgit@devnote2/
but it's still broken.
Hi Nikolay,
On Wed, 27 Jan 2021 15:43:29 +0200
Nikolay Borisov <[email protected]> wrote:
> Hello,
>
> I'm currently seeing latest Linus' master being somewhat broken w.r.t
> krpobes. In particular I have the following test-case:
>
> #!/bin/bash
>
> mkfs.btrfs -f /dev/vdc &> /dev/null
> mount /dev/vdc /media/scratch/
>
> bpftrace -e 'kprobe:btrfs_sync_file {printf("kprobe: %s\n", kstack());}'
> &>bpf-output &
> bpf_trace_pid=$!
>
> # force btrfs_sync_file to be called
> sleep 2
> xfs_io -f -c "pwrite 0 4m" -c "fsync" /media/scratch/file5
>
> kill $bpf_trace_pid
> sleep 1
>
> grep -q kprobe bpf-output
> retval=$?
> rm -f bpf-output
> umount /media/scratch
>
> exit $retval
>
> It traces btrfs_sync_file which is called when fsync is executed on a
> btrfs file, however I don't see the stacktrace being printed i.e the
> kprobe doesn't fire at all. The following alternative program:
>
> bpftrace -e 'tracepoint:btrfs:btrfs_sync_file {printf("tracepoint:
> %s\n", kstack());} kprobe:btrfs_sync_file {printf("kprobe: %s\n",
> kstack());}'
>
> only prints the stack from the tracepoint and not from the kprobe, given
> that the tracepoint is called from the btrfs_sync_file function.
Thank you for reporting!
If you don't mind, could you confirm it with ftrace (tracefs)?
bpftrace etc. involves too many things. It is better to test with
simpler way to test it.
I'm not familer with the bpftrace, but I think you can check it with
# cd /sys/kernel/debug/tracing
# echo p:myevent btrfs_sync_file >> kprobe_events
# echo stacktrace > events/kprobes/myevent/trigger
(or echo 1 > options/stacktrace , if trigger file doesn't exist)
Could you also share your kernel config, so that we can reproduce it?
>
> I started bisecting this and arrived at the following commit:
>
> 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")
>
> FWIW the following series is applied on the kernel I was testing:
> https://lore.kernel.org/lkml/159870598914.1229682.15230803449082078353.stgit@devnote2/
>
> but it's still broken.
Peter, would you have any idea?
Thank you,
--
Masami Hiramatsu <[email protected]>
On Thu, 28 Jan 2021 00:13:53 +0900
Masami Hiramatsu <[email protected]> wrote:
> Hi Nikolay,
>
> On Wed, 27 Jan 2021 15:43:29 +0200
> Nikolay Borisov <[email protected]> wrote:
>
> > Hello,
> >
> > I'm currently seeing latest Linus' master being somewhat broken w.r.t
> > krpobes. In particular I have the following test-case:
> >
> > #!/bin/bash
> >
> > mkfs.btrfs -f /dev/vdc &> /dev/null
> > mount /dev/vdc /media/scratch/
> >
> > bpftrace -e 'kprobe:btrfs_sync_file {printf("kprobe: %s\n", kstack());}'
> > &>bpf-output &
> > bpf_trace_pid=$!
> >
> > # force btrfs_sync_file to be called
> > sleep 2
> > xfs_io -f -c "pwrite 0 4m" -c "fsync" /media/scratch/file5
> >
> > kill $bpf_trace_pid
> > sleep 1
> >
> > grep -q kprobe bpf-output
> > retval=$?
> > rm -f bpf-output
> > umount /media/scratch
> >
> > exit $retval
> >
> > It traces btrfs_sync_file which is called when fsync is executed on a
> > btrfs file, however I don't see the stacktrace being printed i.e the
> > kprobe doesn't fire at all. The following alternative program:
> >
> > bpftrace -e 'tracepoint:btrfs:btrfs_sync_file {printf("tracepoint:
> > %s\n", kstack());} kprobe:btrfs_sync_file {printf("kprobe: %s\n",
> > kstack());}'
> >
> > only prints the stack from the tracepoint and not from the kprobe, given
> > that the tracepoint is called from the btrfs_sync_file function.
>
> Thank you for reporting!
>
> If you don't mind, could you confirm it with ftrace (tracefs)?
> bpftrace etc. involves too many things. It is better to test with
> simpler way to test it.
> I'm not familer with the bpftrace, but I think you can check it with
>
> # cd /sys/kernel/debug/tracing
> # echo p:myevent btrfs_sync_file >> kprobe_events
> # echo stacktrace > events/kprobes/myevent/trigger
> (or echo 1 > options/stacktrace , if trigger file doesn't exist)
Of course, also you have to enable the event.
# echo 1 > events/kprobes/myevent/enable
And check the results
# cat trace
> Could you also share your kernel config, so that we can reproduce it?
Thank you,
>
> >
> > I started bisecting this and arrived at the following commit:
> >
> > 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")
> >
> > FWIW the following series is applied on the kernel I was testing:
> > https://lore.kernel.org/lkml/159870598914.1229682.15230803449082078353.stgit@devnote2/
> >
> > but it's still broken.
>
> Peter, would you have any idea?
>
> Thank you,
>
> --
> Masami Hiramatsu <[email protected]>
--
Masami Hiramatsu <[email protected]>
On Thu, Jan 28, 2021 at 12:13:53AM +0900, Masami Hiramatsu wrote:
> Peter, would you have any idea?
Without a clear error, I've no clue yet. I'll try and investigate a
little more once we have a little more data.
On 27.01.21 г. 17:24 ч., Masami Hiramatsu wrote:
> On Thu, 28 Jan 2021 00:13:53 +0900
> Masami Hiramatsu <[email protected]> wrote:
>
>> Hi Nikolay,
>>
>> On Wed, 27 Jan 2021 15:43:29 +0200
>> Nikolay Borisov <[email protected]> wrote:
>>
>>> Hello,
>>>
>>> I'm currently seeing latest Linus' master being somewhat broken w.r.t
>>> krpobes. In particular I have the following test-case:
>>>
>>> #!/bin/bash
>>>
>>> mkfs.btrfs -f /dev/vdc &> /dev/null
>>> mount /dev/vdc /media/scratch/
>>>
>>> bpftrace -e 'kprobe:btrfs_sync_file {printf("kprobe: %s\n", kstack());}'
>>> &>bpf-output &
>>> bpf_trace_pid=$!
>>>
>>> # force btrfs_sync_file to be called
>>> sleep 2
>>> xfs_io -f -c "pwrite 0 4m" -c "fsync" /media/scratch/file5
>>>
>>> kill $bpf_trace_pid
>>> sleep 1
>>>
>>> grep -q kprobe bpf-output
>>> retval=$?
>>> rm -f bpf-output
>>> umount /media/scratch
>>>
>>> exit $retval
>>>
>>> It traces btrfs_sync_file which is called when fsync is executed on a
>>> btrfs file, however I don't see the stacktrace being printed i.e the
>>> kprobe doesn't fire at all. The following alternative program:
>>>
>>> bpftrace -e 'tracepoint:btrfs:btrfs_sync_file {printf("tracepoint:
>>> %s\n", kstack());} kprobe:btrfs_sync_file {printf("kprobe: %s\n",
>>> kstack());}'
>>>
>>> only prints the stack from the tracepoint and not from the kprobe, given
>>> that the tracepoint is called from the btrfs_sync_file function.
>>
>> Thank you for reporting!
>>
>> If you don't mind, could you confirm it with ftrace (tracefs)?
>> bpftrace etc. involves too many things. It is better to test with
>> simpler way to test it.
>> I'm not familer with the bpftrace, but I think you can check it with
>>
>> # cd /sys/kernel/debug/tracing
>> # echo p:myevent btrfs_sync_file >> kprobe_events
>> # echo stacktrace > events/kprobes/myevent/trigger
>> (or echo 1 > options/stacktrace , if trigger file doesn't exist)
>
> Of course, also you have to enable the event.
> # echo 1 > events/kprobes/myevent/enable
>
> And check the results
>
> # cat trace
>
>
>> Could you also share your kernel config, so that we can reproduce it?
>
I've attached the config and indeed with the scenario you proposed it
seems to works. I see:
xfs_io-20280 [000] d.Z. 9900.748633: myevent:
(btrfs_sync_file+0x0/0x580)
xfs_io-20280 [000] d.Z. 9900.748647: <stack trace>
=> kprobe_trace_func
=> kprobe_dispatcher
=> kprobe_int3_handler
=> exc_int3
=> asm_exc_int3
=> btrfs_sync_file
=> do_fsync
=> __x64_sys_fsync
=> do_syscall_64
=> entry_SYSCALL_64_after_hwframe
<snip>
On Wed, 27 Jan 2021 19:57:56 +0200
Nikolay Borisov <[email protected]> wrote:
>
>
> On 27.01.21 г. 17:24 ч., Masami Hiramatsu wrote:
> > On Thu, 28 Jan 2021 00:13:53 +0900
> > Masami Hiramatsu <[email protected]> wrote:
> >
> >> Hi Nikolay,
> >>
> >> On Wed, 27 Jan 2021 15:43:29 +0200
> >> Nikolay Borisov <[email protected]> wrote:
> >>
> >>> Hello,
> >>>
> >>> I'm currently seeing latest Linus' master being somewhat broken w.r.t
> >>> krpobes. In particular I have the following test-case:
> >>>
> >>> #!/bin/bash
> >>>
> >>> mkfs.btrfs -f /dev/vdc &> /dev/null
> >>> mount /dev/vdc /media/scratch/
> >>>
> >>> bpftrace -e 'kprobe:btrfs_sync_file {printf("kprobe: %s\n", kstack());}'
> >>> &>bpf-output &
> >>> bpf_trace_pid=$!
> >>>
> >>> # force btrfs_sync_file to be called
> >>> sleep 2
> >>> xfs_io -f -c "pwrite 0 4m" -c "fsync" /media/scratch/file5
> >>>
> >>> kill $bpf_trace_pid
> >>> sleep 1
> >>>
> >>> grep -q kprobe bpf-output
> >>> retval=$?
> >>> rm -f bpf-output
> >>> umount /media/scratch
> >>>
> >>> exit $retval
> >>>
> >>> It traces btrfs_sync_file which is called when fsync is executed on a
> >>> btrfs file, however I don't see the stacktrace being printed i.e the
> >>> kprobe doesn't fire at all. The following alternative program:
> >>>
> >>> bpftrace -e 'tracepoint:btrfs:btrfs_sync_file {printf("tracepoint:
> >>> %s\n", kstack());} kprobe:btrfs_sync_file {printf("kprobe: %s\n",
> >>> kstack());}'
> >>>
> >>> only prints the stack from the tracepoint and not from the kprobe, given
> >>> that the tracepoint is called from the btrfs_sync_file function.
> >>
> >> Thank you for reporting!
> >>
> >> If you don't mind, could you confirm it with ftrace (tracefs)?
> >> bpftrace etc. involves too many things. It is better to test with
> >> simpler way to test it.
> >> I'm not familer with the bpftrace, but I think you can check it with
> >>
> >> # cd /sys/kernel/debug/tracing
> >> # echo p:myevent btrfs_sync_file >> kprobe_events
> >> # echo stacktrace > events/kprobes/myevent/trigger
> >> (or echo 1 > options/stacktrace , if trigger file doesn't exist)
> >
> > Of course, also you have to enable the event.
> > # echo 1 > events/kprobes/myevent/enable
> >
> > And check the results
> >
> > # cat trace
> >
> >
> >> Could you also share your kernel config, so that we can reproduce it?
> >
>
> I've attached the config and indeed with the scenario you proposed it
> seems to works. I see:
>
> xfs_io-20280 [000] d.Z. 9900.748633: myevent:
> (btrfs_sync_file+0x0/0x580)
> xfs_io-20280 [000] d.Z. 9900.748647: <stack trace>
> => kprobe_trace_func
> => kprobe_dispatcher
> => kprobe_int3_handler
> => exc_int3
> => asm_exc_int3
> => btrfs_sync_file
> => do_fsync
> => __x64_sys_fsync
> => do_syscall_64
> => entry_SYSCALL_64_after_hwframe
Hmm, then there might be a problem in bpftrace or ebpf (need more info).
At least kprobes itself isn't broken.
I guess they check "in_nmi()" and skip such event?
Thank you,
--
Masami Hiramatsu <[email protected]>
Hi,
On Thu, 28 Jan 2021 10:34:15 +0900
Masami Hiramatsu <[email protected]> wrote:
> On Wed, 27 Jan 2021 19:57:56 +0200
> Nikolay Borisov <[email protected]> wrote:
>
> >
> >
> > On 27.01.21 г. 17:24 ч., Masami Hiramatsu wrote:
> > > On Thu, 28 Jan 2021 00:13:53 +0900
> > > Masami Hiramatsu <[email protected]> wrote:
> > >
> > >> Hi Nikolay,
> > >>
> > >> On Wed, 27 Jan 2021 15:43:29 +0200
> > >> Nikolay Borisov <[email protected]> wrote:
> > >>
> > >>> Hello,
> > >>>
> > >>> I'm currently seeing latest Linus' master being somewhat broken w.r.t
> > >>> krpobes. In particular I have the following test-case:
> > >>>
> > >>> #!/bin/bash
> > >>>
> > >>> mkfs.btrfs -f /dev/vdc &> /dev/null
> > >>> mount /dev/vdc /media/scratch/
> > >>>
> > >>> bpftrace -e 'kprobe:btrfs_sync_file {printf("kprobe: %s\n", kstack());}'
> > >>> &>bpf-output &
> > >>> bpf_trace_pid=$!
> > >>>
> > >>> # force btrfs_sync_file to be called
> > >>> sleep 2
> > >>> xfs_io -f -c "pwrite 0 4m" -c "fsync" /media/scratch/file5
> > >>>
> > >>> kill $bpf_trace_pid
> > >>> sleep 1
> > >>>
> > >>> grep -q kprobe bpf-output
> > >>> retval=$?
> > >>> rm -f bpf-output
> > >>> umount /media/scratch
> > >>>
> > >>> exit $retval
> > >>>
> > >>> It traces btrfs_sync_file which is called when fsync is executed on a
> > >>> btrfs file, however I don't see the stacktrace being printed i.e the
> > >>> kprobe doesn't fire at all. The following alternative program:
> > >>>
> > >>> bpftrace -e 'tracepoint:btrfs:btrfs_sync_file {printf("tracepoint:
> > >>> %s\n", kstack());} kprobe:btrfs_sync_file {printf("kprobe: %s\n",
> > >>> kstack());}'
> > >>>
> > >>> only prints the stack from the tracepoint and not from the kprobe, given
> > >>> that the tracepoint is called from the btrfs_sync_file function.
> > >>
> > >> Thank you for reporting!
> > >>
> > >> If you don't mind, could you confirm it with ftrace (tracefs)?
> > >> bpftrace etc. involves too many things. It is better to test with
> > >> simpler way to test it.
> > >> I'm not familer with the bpftrace, but I think you can check it with
> > >>
> > >> # cd /sys/kernel/debug/tracing
> > >> # echo p:myevent btrfs_sync_file >> kprobe_events
> > >> # echo stacktrace > events/kprobes/myevent/trigger
> > >> (or echo 1 > options/stacktrace , if trigger file doesn't exist)
> > >
> > > Of course, also you have to enable the event.
> > > # echo 1 > events/kprobes/myevent/enable
> > >
> > > And check the results
> > >
> > > # cat trace
> > >
> > >
> > >> Could you also share your kernel config, so that we can reproduce it?
> > >
> >
> > I've attached the config and indeed with the scenario you proposed it
> > seems to works. I see:
> >
> > xfs_io-20280 [000] d.Z. 9900.748633: myevent:
> > (btrfs_sync_file+0x0/0x580)
> > xfs_io-20280 [000] d.Z. 9900.748647: <stack trace>
> > => kprobe_trace_func
> > => kprobe_dispatcher
> > => kprobe_int3_handler
> > => exc_int3
> > => asm_exc_int3
> > => btrfs_sync_file
> > => do_fsync
> > => __x64_sys_fsync
> > => do_syscall_64
> > => entry_SYSCALL_64_after_hwframe
>
> Hmm, then there might be a problem in bpftrace or ebpf (need more info).
> At least kprobes itself isn't broken.
> I guess they check "in_nmi()" and skip such event?
Yeah, there is. Nikolay, could you try this tentative patch?
Of course this just drops the NMI check from the handler, so alternative
checker is required. But I'm not sure what the original code concerns.
As far as I can see, there seems no re-entrant block flag, nor locks
among ebpf programs in runtime.
Alexei, could you tell me what is the concerning situation for bpf?
Thank you,
From c5cd0e5f60ef6494c9e1579ec1b82b7344c41f9a Mon Sep 17 00:00:00 2001
From: Masami Hiramatsu <[email protected]>
Date: Thu, 28 Jan 2021 12:31:02 +0900
Subject: [PATCH] tracing: bpf: Remove in_nmi() check from kprobe handler
Since commit 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()") has
changed the kprobe handler to run in the NMI context, in_nmi() always returns
true. This means the bpf events on kprobes always skipped.
Signed-off-by: Masami Hiramatsu <[email protected]>
---
kernel/trace/bpf_trace.c | 3 ---
1 file changed, 3 deletions(-)
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 6c0018abe68a..764400260eb6 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -96,9 +96,6 @@ unsigned int trace_call_bpf(struct trace_event_call *call, void *ctx)
{
unsigned int ret;
- if (in_nmi()) /* not supported yet */
- return 1;
-
cant_sleep();
if (unlikely(__this_cpu_inc_return(bpf_prog_active) != 1)) {
--
2.25.1
--
Masami Hiramatsu
On 28.01.21 г. 5:38 ч., Masami Hiramatsu wrote:
> Hi,
>
<snip>
>
> Yeah, there is. Nikolay, could you try this tentative patch?
I can confirm that with this patch everything is working. I also applied
the following diff:
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 6c0018abe68a..cc5a3a18816d 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -96,8 +96,10 @@ unsigned int trace_call_bpf(struct trace_event_call
*call, void *ctx)
{
unsigned int ret;
- if (in_nmi()) /* not supported yet */
+ if (in_nmi()) /* not supported yet */ {
+ trace_dump_stack(0);
return 1;
+ }
cant_sleep();
And can confirm that the branch is being hit and the following call
trace is produced:
=> __ftrace_trace_stack
=> trace_call_bpf
=> kprobe_perf_func
=> kprobe_int3_handler
=> exc_int3
=> asm_exc_int3
=> btrfs_sync_file
=> do_fsync
=> __x64_sys_fsync
=> do_syscall_64
=> entry_SYSCALL_64_after_hwframe
>
> Of course this just drops the NMI check from the handler, so alternative
> checker is required. But I'm not sure what the original code concerns.
> As far as I can see, there seems no re-entrant block flag, nor locks
> among ebpf programs in runtime.
>
> Alexei, could you tell me what is the concerning situation for bpf?
>
> Thank you,
>
> From c5cd0e5f60ef6494c9e1579ec1b82b7344c41f9a Mon Sep 17 00:00:00 2001
> From: Masami Hiramatsu <[email protected]>
> Date: Thu, 28 Jan 2021 12:31:02 +0900
> Subject: [PATCH] tracing: bpf: Remove in_nmi() check from kprobe handler
>
> Since commit 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()") has
> changed the kprobe handler to run in the NMI context, in_nmi() always returns
> true. This means the bpf events on kprobes always skipped.
FWIW I'd prefer if in addition to the original commit you also mention:
ba1f2b2eaa2a ("x86/entry: Fix NMI vs IRQ state tracking")
b6be002bcd1d ("x86/entry: Move nmi entry/exit into common code")
Since they changed the way nmi state is managed in exc_int3 and not in
the original do_int3. THe latter no longer contains any references to
nmi-related code.
>
> Signed-off-by: Masami Hiramatsu <[email protected]>
> ---
> kernel/trace/bpf_trace.c | 3 ---
> 1 file changed, 3 deletions(-)
>
> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> index 6c0018abe68a..764400260eb6 100644
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -96,9 +96,6 @@ unsigned int trace_call_bpf(struct trace_event_call *call, void *ctx)
> {
> unsigned int ret;
>
> - if (in_nmi()) /* not supported yet */
> - return 1;
> -
> cant_sleep();
>
> if (unlikely(__this_cpu_inc_return(bpf_prog_active) != 1)) {
>
On 28.01.21 г. 5:38 ч., Masami Hiramatsu wrote:
> Hi,
<snip>
>
> Alexei, could you tell me what is the concerning situation for bpf?
Another data point masami is that this affects bpf kprobes which are
entered via int3, alternatively if the kprobe is entered via
kprobe_ftrace_handler it works as expected. I haven't been able to
determine why a particular bpf probe won't use ftrace's infrastructure
if it's put at the beginning of the function. An alternative call chain
is :
=> __ftrace_trace_stack
=> trace_call_bpf
=> kprobe_perf_func
=> kprobe_ftrace_handler
=> 0xffffffffc095d0c8
=> btrfs_validate_metadata_buffer
=> end_bio_extent_readpage
=> end_workqueue_fn
=> btrfs_work_helper
=> process_one_work
=> worker_thread
=> kthread
=> ret_from_fork
>
> Thank you,
>
> From c5cd0e5f60ef6494c9e1579ec1b82b7344c41f9a Mon Sep 17 00:00:00 2001
> From: Masami Hiramatsu <[email protected]>
> Date: Thu, 28 Jan 2021 12:31:02 +0900
> Subject: [PATCH] tracing: bpf: Remove in_nmi() check from kprobe handler
>
> Since commit 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()") has
> changed the kprobe handler to run in the NMI context, in_nmi() always returns
> true. This means the bpf events on kprobes always skipped.
>
> Signed-off-by: Masami Hiramatsu <[email protected]>
> ---
> kernel/trace/bpf_trace.c | 3 ---
> 1 file changed, 3 deletions(-)
>
> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> index 6c0018abe68a..764400260eb6 100644
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -96,9 +96,6 @@ unsigned int trace_call_bpf(struct trace_event_call *call, void *ctx)
> {
> unsigned int ret;
>
> - if (in_nmi()) /* not supported yet */
> - return 1;
> -
> cant_sleep();
>
> if (unlikely(__this_cpu_inc_return(bpf_prog_active) != 1)) {
>
On 28.01.21 г. 18:12 ч., Nikolay Borisov wrote:
>
>
> On 28.01.21 г. 5:38 ч., Masami Hiramatsu wrote:
>> Hi,
>
> <snip>
>
>>
>> Alexei, could you tell me what is the concerning situation for bpf?
>
> Another data point masami is that this affects bpf kprobes which are
> entered via int3, alternatively if the kprobe is entered via
> kprobe_ftrace_handler it works as expected. I haven't been able to
> determine why a particular bpf probe won't use ftrace's infrastructure
> if it's put at the beginning of the function. An alternative call chain
> is :
>
> => __ftrace_trace_stack
> => trace_call_bpf
> => kprobe_perf_func
> => kprobe_ftrace_handler
> => 0xffffffffc095d0c8
> => btrfs_validate_metadata_buffer
> => end_bio_extent_readpage
> => end_workqueue_fn
> => btrfs_work_helper
> => process_one_work
> => worker_thread
> => kthread
> => ret_from_fork
>
>>
I have a working theory why I'm seeing this. My kernel (broken) was
compiled with retpolines off and with the gcc that comes with ubuntu
(both 9 and 10:
gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
gcc-10 (Ubuntu 10.2.0-5ubuntu1~20.04) 10.2.0
)
this results in CFI being enabled so functions look like:
0xffffffff81493890 <+0>: endbr64
0xffffffff81493894 <+4>: callq 0xffffffff8104d820 <__fentry__>
i.e fentry's thunk is not the first instruction on the function hence
it's not going through the optimized ftrace handler. Instead it's using
int3 which is broken as ascertained.
After testing with my testcase I confirm that with cfi off and
__fentry__ being the first entry bpf starts working. And indeed, even
with CFI turned on if I use a probe like :
bpftrace -e 'kprobe:btrfs_sync_file+4 {printf("kprobe: %s\n",
kstack());}' &>bpf-output &
it would be placed on the __fentry__ (and not endbr64) hence it works.
So perhaps a workaround outside of bpf could essentially detect this
scenario and adjust the probe to be on the __fentry__ and not preceding
instruction if it's detected to be endbr64 ?
<snip>
On Thu, Jan 28, 2021 at 06:45:56PM +0200, Nikolay Borisov wrote:
> On 28.01.21 г. 18:12 ч., Nikolay Borisov wrote:
> > On 28.01.21 г. 5:38 ч., Masami Hiramatsu wrote:
> >> Hi,
> >
> > <snip>
> >>
> >> Alexei, could you tell me what is the concerning situation for bpf?
> >
> > Another data point masami is that this affects bpf kprobes which are
> > entered via int3, alternatively if the kprobe is entered via
> > kprobe_ftrace_handler it works as expected. I haven't been able to
> > determine why a particular bpf probe won't use ftrace's infrastructure
> > if it's put at the beginning of the function. An alternative call chain
> > is :
> >
> > => __ftrace_trace_stack
> > => trace_call_bpf
> > => kprobe_perf_func
> > => kprobe_ftrace_handler
> > => 0xffffffffc095d0c8
> > => btrfs_validate_metadata_buffer
> > => end_bio_extent_readpage
> > => end_workqueue_fn
> > => btrfs_work_helper
> > => process_one_work
> > => worker_thread
> > => kthread
> > => ret_from_fork
> >
> >>
>
> I have a working theory why I'm seeing this. My kernel (broken) was
> compiled with retpolines off and with the gcc that comes with ubuntu
> (both 9 and 10:
> gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
> gcc-10 (Ubuntu 10.2.0-5ubuntu1~20.04) 10.2.0
> )
>
> this results in CFI being enabled so functions look like:
> 0xffffffff81493890 <+0>: endbr64
> 0xffffffff81493894 <+4>: callq 0xffffffff8104d820 <__fentry__>
>
> i.e fentry's thunk is not the first instruction on the function hence
> it's not going through the optimized ftrace handler. Instead it's using
> int3 which is broken as ascertained.
>
> After testing with my testcase I confirm that with cfi off and
> __fentry__ being the first entry bpf starts working. And indeed, even
> with CFI turned on if I use a probe like :
>
> bpftrace -e 'kprobe:btrfs_sync_file+4 {printf("kprobe: %s\n",
> kstack());}' &>bpf-output &
>
>
> it would be placed on the __fentry__ (and not endbr64) hence it works.
> So perhaps a workaround outside of bpf could essentially detect this
> scenario and adjust the probe to be on the __fentry__ and not preceding
> instruction if it's detected to be endbr64 ?
For now (and the foreseeable future), CET isn't enabled in the kernel.
So that endbr64 shouldn't be there in the first place. I can make a
proper patch in a bit.
diff --git a/Makefile b/Makefile
index e0af7a4a5598..5ccc4cdf1fb5 100644
--- a/Makefile
+++ b/Makefile
@@ -948,11 +948,8 @@ KBUILD_CFLAGS += $(call cc-option,-Werror=designated-init)
# change __FILE__ to the relative path from the srctree
KBUILD_CPPFLAGS += $(call cc-option,-fmacro-prefix-map=$(srctree)/=)
-# ensure -fcf-protection is disabled when using retpoline as it is
-# incompatible with -mindirect-branch=thunk-extern
-ifdef CONFIG_RETPOLINE
+# Intel CET isn't enabled in the kernel
KBUILD_CFLAGS += $(call cc-option,-fcf-protection=none)
-endif
# include additional Makefiles when needed
include-y := scripts/Makefile.extrawarn
On Thu, Jan 28, 2021 at 06:45:56PM +0200, Nikolay Borisov wrote:
> it would be placed on the __fentry__ (and not endbr64) hence it works.
> So perhaps a workaround outside of bpf could essentially detect this
> scenario and adjust the probe to be on the __fentry__ and not preceding
> instruction if it's detected to be endbr64 ?
Arguably the fentry handler should also set the nmi context, it can,
after all, interrupt pretty much any other context by construction.
With retpolines disabled, some configurations of GCC will add Intel CET
instrumentation to the kernel by default. That breaks certain tracing
scenarios by adding a superfluous ENDBR64 instruction before the fentry
call, for functions which can be called indirectly.
CET instrumentation isn't currently necessary in the kernel, as CET is
only supported in user space. Disable it unconditionally.
Reported-by: Nikolay Borisov <[email protected]>
Signed-off-by: Josh Poimboeuf <[email protected]>
---
Makefile | 6 ------
arch/x86/Makefile | 3 +++
2 files changed, 3 insertions(+), 6 deletions(-)
diff --git a/Makefile b/Makefile
index e0af7a4a5598..51c2bf34142d 100644
--- a/Makefile
+++ b/Makefile
@@ -948,12 +948,6 @@ KBUILD_CFLAGS += $(call cc-option,-Werror=designated-init)
# change __FILE__ to the relative path from the srctree
KBUILD_CPPFLAGS += $(call cc-option,-fmacro-prefix-map=$(srctree)/=)
-# ensure -fcf-protection is disabled when using retpoline as it is
-# incompatible with -mindirect-branch=thunk-extern
-ifdef CONFIG_RETPOLINE
-KBUILD_CFLAGS += $(call cc-option,-fcf-protection=none)
-endif
-
# include additional Makefiles when needed
include-y := scripts/Makefile.extrawarn
include-$(CONFIG_KASAN) += scripts/Makefile.kasan
diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 32dcdddc1089..109c7f86483c 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -120,6 +120,9 @@ else
KBUILD_CFLAGS += -mno-red-zone
KBUILD_CFLAGS += -mcmodel=kernel
+
+ # Intel CET isn't enabled in the kernel
+ KBUILD_CFLAGS += $(call cc-option,-fcf-protection=none)
endif
ifdef CONFIG_X86_X32
--
2.29.2
On Thu, Jan 28, 2021 at 07:24:14PM +0100, Peter Zijlstra wrote:
> On Thu, Jan 28, 2021 at 06:45:56PM +0200, Nikolay Borisov wrote:
> > it would be placed on the __fentry__ (and not endbr64) hence it works.
> > So perhaps a workaround outside of bpf could essentially detect this
> > scenario and adjust the probe to be on the __fentry__ and not preceding
> > instruction if it's detected to be endbr64 ?
>
> Arguably the fentry handler should also set the nmi context, it can,
> after all, interrupt pretty much any other context by construction.
But that doesn't make it a real nmi.
nmi can actually interrupt anything. Whereas kprobe via int3 cannot
do nokprobe and noinstr sections. The exposure is a lot different.
ftrace is even more contained. It's only at the start of the functions.
It's even smaller subset of places than kprobes.
So ftrace < kprobe < nmi.
Grouping them all into nmi doesn't make sense to me.
That bpf breaking change came unnoticed mostly because people use
kprobes in the beginning of the functions only, but there are cases
where kprobes are in the middle too. People just didn't upgrade kernels
fast enough to notice.
imo appropriate solution would be to have some distinction between
ftrace < kprobe_via_int3 < nmi, so that bpf side can react accordingly.
That code in trace_call_bpf:
if (in_nmi()) /* not supported yet */
was necessary in the past. Now we have all sorts of protections built in.
So it's probably ok to just drop it, but I think adding
called_via_ftrace vs called_via_kprobe_int3 vs called_via_nmi
is more appropriate solution long term.
On 28.01.21 г. 23:52 ч., Josh Poimboeuf wrote:
>
> With retpolines disabled, some configurations of GCC will add Intel CET
> instrumentation to the kernel by default. That breaks certain tracing
> scenarios by adding a superfluous ENDBR64 instruction before the fentry
> call, for functions which can be called indirectly.
>
> CET instrumentation isn't currently necessary in the kernel, as CET is
> only supported in user space. Disable it unconditionally.
>
> Reported-by: Nikolay Borisov <[email protected]>
> Signed-off-by: Josh Poimboeuf <[email protected]>
Tested-by: Nikolay Borisov <[email protected]>
Reviewed-by: Nikolay Borisov <[email protected]>
On 29.01.21 г. 3:34 ч., Alexei Starovoitov wrote:
> On Thu, Jan 28, 2021 at 07:24:14PM +0100, Peter Zijlstra wrote:
>> On Thu, Jan 28, 2021 at 06:45:56PM +0200, Nikolay Borisov wrote:
>>> it would be placed on the __fentry__ (and not endbr64) hence it works.
>>> So perhaps a workaround outside of bpf could essentially detect this
>>> scenario and adjust the probe to be on the __fentry__ and not preceding
>>> instruction if it's detected to be endbr64 ?
>>
>> Arguably the fentry handler should also set the nmi context, it can,
>> after all, interrupt pretty much any other context by construction.
>
> But that doesn't make it a real nmi.
> nmi can actually interrupt anything. Whereas kprobe via int3 cannot
> do nokprobe and noinstr sections. The exposure is a lot different.
> ftrace is even more contained. It's only at the start of the functions.
> It's even smaller subset of places than kprobes.
> So ftrace < kprobe < nmi.
> Grouping them all into nmi doesn't make sense to me.
> That bpf breaking change came unnoticed mostly because people use
> kprobes in the beginning of the functions only, but there are cases
> where kprobes are in the middle too. People just didn't upgrade kernels
> fast enough to notice.
nit: slight correction - I observed while I was putting kprobes at the
beginning of the function but __fentry__ wasn't the first thing in the
function's code. The reason why people haven't observed is because
everyone is running with retpolines enabled which disables CFI/CET.
> imo appropriate solution would be to have some distinction between
> ftrace < kprobe_via_int3 < nmi, so that bpf side can react accordingly.
> That code in trace_call_bpf:
> if (in_nmi()) /* not supported yet */
> was necessary in the past. Now we have all sorts of protections built in.
> So it's probably ok to just drop it, but I think adding
> called_via_ftrace vs called_via_kprobe_int3 vs called_via_nmi
> is more appropriate solution long term.
>
On Thu, Jan 28, 2021 at 03:52:19PM -0600, Josh Poimboeuf wrote:
>
> With retpolines disabled, some configurations of GCC will add Intel CET
> instrumentation to the kernel by default. That breaks certain tracing
> scenarios by adding a superfluous ENDBR64 instruction before the fentry
> call, for functions which can be called indirectly.
>
> CET instrumentation isn't currently necessary in the kernel, as CET is
> only supported in user space. Disable it unconditionally.
>
> Reported-by: Nikolay Borisov <[email protected]>
> Signed-off-by: Josh Poimboeuf <[email protected]>
> ---
> Makefile | 6 ------
> arch/x86/Makefile | 3 +++
> 2 files changed, 3 insertions(+), 6 deletions(-)
>
> diff --git a/Makefile b/Makefile
> index e0af7a4a5598..51c2bf34142d 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -948,12 +948,6 @@ KBUILD_CFLAGS += $(call cc-option,-Werror=designated-init)
> # change __FILE__ to the relative path from the srctree
> KBUILD_CPPFLAGS += $(call cc-option,-fmacro-prefix-map=$(srctree)/=)
>
> -# ensure -fcf-protection is disabled when using retpoline as it is
> -# incompatible with -mindirect-branch=thunk-extern
> -ifdef CONFIG_RETPOLINE
> -KBUILD_CFLAGS += $(call cc-option,-fcf-protection=none)
> -endif
> -
Why is that even here, in the main Makefile if this cf-protection thing
is x86-specific?
Are we going to move it back there when some other arch gets CET or
CET-like support?
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Fri, Jan 29, 2021 at 09:10:34AM -0600, Josh Poimboeuf wrote:
> Maybe eventually. But the enablement (actually enabling CET/CFI/etc)
> happens in the arch code anyway, right? So it could be a per-arch
> decision.
Right.
Ok, for this one, what about
Cc: <[email protected]>
?
What are "some configurations of GCC"? If it can be reproduced with
what's released out there, maybe that should go in now, even for 5.11?
Hmm?
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Fri, Jan 29, 2021 at 10:59:52AM -0500, Steven Rostedt wrote:
> On Fri, 29 Jan 2021 22:40:11 +0900
> Masami Hiramatsu <[email protected]> wrote:
>
> > > So what, they can all happen with random locks held. Marking them as NMI
> > > enables a whole bunch of sanity checks that are entirely appropriate.
> >
> > How about introducing an idea of Asynchronous NMI (ANMI) and Synchronous
> > NMI (SNMI)? kprobes and ftrace is synchronously called and can be controlled
> > (we can expect the context) but ANMI may be caused by asynchronous
> > hardware events on any context.
> >
> > If we can distinguish those 2 NMIs on preempt count, bpf people can easily
> > avoid the inevitable situation.
>
> I don't like the name NMI IN SNMI, because they are not NMIs. They are
> actually more like kernel exceptions. Even page faults in the kernel is
> similar to a kprobe breakpoint or ftrace. It can happen anywhere, with any
> lock held. Perhaps we need a kernel exception context? Which by definition
> is synchronous.
What problem are you trying to solve? AFAICT all these contexts have the
same restrictions, why try and muck about with different names for the
same thing?
On Fri, Jan 29, 2021 at 05:30:48PM +0100, Borislav Petkov wrote:
> On Fri, Jan 29, 2021 at 09:10:34AM -0600, Josh Poimboeuf wrote:
> > Maybe eventually. But the enablement (actually enabling CET/CFI/etc)
> > happens in the arch code anyway, right? So it could be a per-arch
> > decision.
>
> Right.
>
> Ok, for this one, what about
>
> Cc: <[email protected]>
>
> ?
>
> What are "some configurations of GCC"? If it can be reproduced with
> what's released out there, maybe that should go in now, even for 5.11?
>
> Hmm?
Agreed, stable is a good idea. I think Nikolay saw it with GCC 9.
--
Josh
On 29.01.21 г. 18:49 ч., Josh Poimboeuf wrote:
> Agreed, stable is a good idea. I think Nikolay saw it with GCC 9.
Yes I did, with the default Ubuntu compiler as well as the default gcc-10 compiler:
# gcc -v -Q -O2 --help=target | grep protection
gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04)
COLLECT_GCC_OPTIONS='-v' '-Q' '-O2' '--help=target' '-mtune=generic' '-march=x86-64'
/usr/lib/gcc/x86_64-linux-gnu/9/cc1 -v -imultiarch x86_64-linux-gnu help-dummy -dumpbase help-dummy -mtune=generic -march=x86-64 -auxbase help-dummy -O2 -version --help=target -fasynchronous-unwind-tables -fstack-protector-strong -Wformat -Wformat-security -fstack-clash-protection -fcf-protection -o /tmp/ccSecttk.s
GNU C17 (Ubuntu 9.3.0-17ubuntu1~20.04) version 9.3.0 (x86_64-linux-gnu)
compiled by GNU C version 9.3.0, GMP version 6.2.0, MPFR version 4.0.2, MPC version 1.1.0, isl version isl-0.22.1-GMP
It has -fcf-protection turned on by default it seems.
On Fri, Jan 29, 2021 at 06:54:08PM +0200, Nikolay Borisov wrote:
>
>
> On 29.01.21 г. 18:49 ч., Josh Poimboeuf wrote:
> > Agreed, stable is a good idea. I think Nikolay saw it with GCC 9.
>
>
> Yes I did, with the default Ubuntu compiler as well as the default gcc-10 compiler:
>
> # gcc -v -Q -O2 --help=target | grep protection
>
> gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04)
> COLLECT_GCC_OPTIONS='-v' '-Q' '-O2' '--help=target' '-mtune=generic' '-march=x86-64'
> /usr/lib/gcc/x86_64-linux-gnu/9/cc1 -v -imultiarch x86_64-linux-gnu help-dummy -dumpbase help-dummy -mtune=generic -march=x86-64 -auxbase help-dummy -O2 -version --help=target -fasynchronous-unwind-tables -fstack-protector-strong -Wformat -Wformat-security -fstack-clash-protection -fcf-protection -o /tmp/ccSecttk.s
> GNU C17 (Ubuntu 9.3.0-17ubuntu1~20.04) version 9.3.0 (x86_64-linux-gnu)
> compiled by GNU C version 9.3.0, GMP version 6.2.0, MPFR version 4.0.2, MPC version 1.1.0, isl version isl-0.22.1-GMP
>
>
> It has -fcf-protection turned on by default it seems.
Yup, explains why I didn't see it:
gcc version 10.2.1 20201125 (Red Hat 10.2.1-9) (GCC)
COLLECT_GCC_OPTIONS='-v' '-Q' '-O2' '--help=target' '-mtune=generic' '-march=x86-64'
/usr/libexec/gcc/x86_64-redhat-linux/10/cc1 -v help-dummy -dumpbase help-dummy -mtune=generic -march=x86-64 -auxbase help-dummy -O2 -version --help=target -o /tmp/cclBz55H.s
--
Josh
On Fri, Jan 29, 2021 at 11:03:31AM -0600, Josh Poimboeuf wrote:
> On Fri, Jan 29, 2021 at 06:54:08PM +0200, Nikolay Borisov wrote:
> >
> >
> > On 29.01.21 г. 18:49 ч., Josh Poimboeuf wrote:
> > > Agreed, stable is a good idea. I think Nikolay saw it with GCC 9.
> >
> >
> > Yes I did, with the default Ubuntu compiler as well as the default gcc-10 compiler:
> >
> > # gcc -v -Q -O2 --help=target | grep protection
> >
> > gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04)
> > COLLECT_GCC_OPTIONS='-v' '-Q' '-O2' '--help=target' '-mtune=generic' '-march=x86-64'
> > /usr/lib/gcc/x86_64-linux-gnu/9/cc1 -v -imultiarch x86_64-linux-gnu help-dummy -dumpbase help-dummy -mtune=generic -march=x86-64 -auxbase help-dummy -O2 -version --help=target -fasynchronous-unwind-tables -fstack-protector-strong -Wformat -Wformat-security -fstack-clash-protection -fcf-protection -o /tmp/ccSecttk.s
> > GNU C17 (Ubuntu 9.3.0-17ubuntu1~20.04) version 9.3.0 (x86_64-linux-gnu)
> > compiled by GNU C version 9.3.0, GMP version 6.2.0, MPFR version 4.0.2, MPC version 1.1.0, isl version isl-0.22.1-GMP
> >
> >
> > It has -fcf-protection turned on by default it seems.
>
> Yup, explains why I didn't see it:
>
> gcc version 10.2.1 20201125 (Red Hat 10.2.1-9) (GCC)
> COLLECT_GCC_OPTIONS='-v' '-Q' '-O2' '--help=target' '-mtune=generic' '-march=x86-64'
> /usr/libexec/gcc/x86_64-redhat-linux/10/cc1 -v help-dummy -dumpbase help-dummy -mtune=generic -march=x86-64 -auxbase help-dummy -O2 -version --help=target -o /tmp/cclBz55H.s
The fact that you triggered it with an Ubuntu gcc explains why the
original patch adding that switch:
29be86d7f9cb ("kbuild: add -fcf-protection=none when using retpoline flags")
came from a Canonical.
Adding the author to Cc for FYI.
Seth, you can find this thread starting here:
https://lkml.kernel.org/r/20210128215219.6kct3h2eiustncws@treble
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Fri, Jan 29, 2021 at 8:24 AM Peter Zijlstra <[email protected]> wrote:
>
> On Fri, Jan 29, 2021 at 10:59:52AM -0500, Steven Rostedt wrote:
> > On Fri, 29 Jan 2021 22:40:11 +0900
> > Masami Hiramatsu <[email protected]> wrote:
> >
> > > > So what, they can all happen with random locks held. Marking them as NMI
> > > > enables a whole bunch of sanity checks that are entirely appropriate.
> > >
> > > How about introducing an idea of Asynchronous NMI (ANMI) and Synchronous
> > > NMI (SNMI)? kprobes and ftrace is synchronously called and can be controlled
> > > (we can expect the context) but ANMI may be caused by asynchronous
> > > hardware events on any context.
> > >
> > > If we can distinguish those 2 NMIs on preempt count, bpf people can easily
> > > avoid the inevitable situation.
> >
> > I don't like the name NMI IN SNMI, because they are not NMIs. They are
> > actually more like kernel exceptions. Even page faults in the kernel is
> > similar to a kprobe breakpoint or ftrace. It can happen anywhere, with any
> > lock held. Perhaps we need a kernel exception context? Which by definition
> > is synchronous.
I like 'kernel exception' name. SNMI doesn't sound right. There is nothing
'non maskable' here.
>
> What problem are you trying to solve? AFAICT all these contexts have the
> same restrictions, why try and muck about with different names for the
> same thing?
from core kernel perspective the difference between 'kernel exception'
and true NMI is huge:
this_cpu vs __this_cpu
static checks vs runtime checks
Same things apply to bpf side. We can statically prove safety for
ftrace and kprobe attaching whereas to deal with NMI situation we
have to use run-time checks for recursion prevention, etc.
On Fri, Jan 29, 2021 at 06:07:55PM +0100, Borislav Petkov wrote:
> On Fri, Jan 29, 2021 at 11:03:31AM -0600, Josh Poimboeuf wrote:
> > On Fri, Jan 29, 2021 at 06:54:08PM +0200, Nikolay Borisov wrote:
> > >
> > >
> > > On 29.01.21 г. 18:49 ч., Josh Poimboeuf wrote:
> > > > Agreed, stable is a good idea. I think Nikolay saw it with GCC 9.
> > >
> > >
> > > Yes I did, with the default Ubuntu compiler as well as the default gcc-10 compiler:
> > >
> > > # gcc -v -Q -O2 --help=target | grep protection
> > >
> > > gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04)
> > > COLLECT_GCC_OPTIONS='-v' '-Q' '-O2' '--help=target' '-mtune=generic' '-march=x86-64'
> > > /usr/lib/gcc/x86_64-linux-gnu/9/cc1 -v -imultiarch x86_64-linux-gnu help-dummy -dumpbase help-dummy -mtune=generic -march=x86-64 -auxbase help-dummy -O2 -version --help=target -fasynchronous-unwind-tables -fstack-protector-strong -Wformat -Wformat-security -fstack-clash-protection -fcf-protection -o /tmp/ccSecttk.s
> > > GNU C17 (Ubuntu 9.3.0-17ubuntu1~20.04) version 9.3.0 (x86_64-linux-gnu)
> > > compiled by GNU C version 9.3.0, GMP version 6.2.0, MPFR version 4.0.2, MPC version 1.1.0, isl version isl-0.22.1-GMP
> > >
> > >
> > > It has -fcf-protection turned on by default it seems.
> >
> > Yup, explains why I didn't see it:
> >
> > gcc version 10.2.1 20201125 (Red Hat 10.2.1-9) (GCC)
> > COLLECT_GCC_OPTIONS='-v' '-Q' '-O2' '--help=target' '-mtune=generic' '-march=x86-64'
> > /usr/libexec/gcc/x86_64-redhat-linux/10/cc1 -v help-dummy -dumpbase help-dummy -mtune=generic -march=x86-64 -auxbase help-dummy -O2 -version --help=target -o /tmp/cclBz55H.s
>
> The fact that you triggered it with an Ubuntu gcc explains why the
> original patch adding that switch:
>
> 29be86d7f9cb ("kbuild: add -fcf-protection=none when using retpoline flags")
>
> came from a Canonical.
>
> Adding the author to Cc for FYI.
>
> Seth, you can find this thread starting here:
>
> https://lkml.kernel.org/r/20210128215219.6kct3h2eiustncws@treble
Thanks for the heads up. This still works fine for our needs.
Acked-by: Seth Forshee <[email protected]>
On Fri, Jan 29, 2021 at 09:45:48AM -0800, Alexei Starovoitov wrote:
> Same things apply to bpf side. We can statically prove safety for
> ftrace and kprobe attaching whereas to deal with NMI situation we
> have to use run-time checks for recursion prevention, etc.
I have no idea what you're saying. You can attach to functions that are
called with random locks held, you can create kprobes in some very
sensitive places.
What can you staticlly prove about that?
The following commit has been merged into the x86/urgent branch of tip:
Commit-ID: 20bf2b378729c4a0366a53e2018a0b70ace94bcd
Gitweb: https://git.kernel.org/tip/20bf2b378729c4a0366a53e2018a0b70ace94bcd
Author: Josh Poimboeuf <[email protected]>
AuthorDate: Thu, 28 Jan 2021 15:52:19 -06:00
Committer: Borislav Petkov <[email protected]>
CommitterDate: Fri, 29 Jan 2021 18:41:06 +01:00
x86/build: Disable CET instrumentation in the kernel
With retpolines disabled, some configurations of GCC, and specifically
the GCC versions 9 and 10 in Ubuntu will add Intel CET instrumentation
to the kernel by default. That breaks certain tracing scenarios by
adding a superfluous ENDBR64 instruction before the fentry call, for
functions which can be called indirectly.
CET instrumentation isn't currently necessary in the kernel, as CET is
only supported in user space. Disable it unconditionally and move it
into the x86's Makefile as CET/CFI... enablement should be a per-arch
decision anyway.
[ bp: Massage and extend commit message. ]
Fixes: 29be86d7f9cb ("kbuild: add -fcf-protection=none when using retpoline flags")
Reported-by: Nikolay Borisov <[email protected]>
Signed-off-by: Josh Poimboeuf <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Reviewed-by: Nikolay Borisov <[email protected]>
Tested-by: Nikolay Borisov <[email protected]>
Cc: <[email protected]>
Cc: Seth Forshee <[email protected]>
Cc: Masahiro Yamada <[email protected]>
Link: https://lkml.kernel.org/r/20210128215219.6kct3h2eiustncws@treble
---
Makefile | 6 ------
arch/x86/Makefile | 3 +++
2 files changed, 3 insertions(+), 6 deletions(-)
diff --git a/Makefile b/Makefile
index e0af7a4..51c2bf3 100644
--- a/Makefile
+++ b/Makefile
@@ -948,12 +948,6 @@ KBUILD_CFLAGS += $(call cc-option,-Werror=designated-init)
# change __FILE__ to the relative path from the srctree
KBUILD_CPPFLAGS += $(call cc-option,-fmacro-prefix-map=$(srctree)/=)
-# ensure -fcf-protection is disabled when using retpoline as it is
-# incompatible with -mindirect-branch=thunk-extern
-ifdef CONFIG_RETPOLINE
-KBUILD_CFLAGS += $(call cc-option,-fcf-protection=none)
-endif
-
# include additional Makefiles when needed
include-y := scripts/Makefile.extrawarn
include-$(CONFIG_KASAN) += scripts/Makefile.kasan
diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 7116da3..5857917 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -120,6 +120,9 @@ else
KBUILD_CFLAGS += -mno-red-zone
KBUILD_CFLAGS += -mcmodel=kernel
+
+ # Intel CET isn't enabled in the kernel
+ KBUILD_CFLAGS += $(call cc-option,-fcf-protection=none)
endif
ifdef CONFIG_X86_X32
On Fri, 29 Jan 2021 18:59:43 +0100
Peter Zijlstra <[email protected]> wrote:
> On Fri, Jan 29, 2021 at 09:45:48AM -0800, Alexei Starovoitov wrote:
> > Same things apply to bpf side. We can statically prove safety for
> > ftrace and kprobe attaching whereas to deal with NMI situation we
> > have to use run-time checks for recursion prevention, etc.
>
> I have no idea what you're saying. You can attach to functions that are
> called with random locks held, you can create kprobes in some very
> sensitive places.
>
> What can you staticlly prove about that?
I think the main difference is, if you attach a kprobe or ftrace function,
you can theoretically analyze the location before you do the attachment.
Does, the NMI context mean "in_nmi()" returns true? Because there's cases
in ftrace callbacks where that is checked (like the stack tracer). And
having ftrace return true for "in_nmi()" will break a lot of existing
utilities.
-- Steve
On Fri, Jan 29, 2021 at 02:01:03PM -0500, Steven Rostedt wrote:
> On Fri, 29 Jan 2021 18:59:43 +0100
> Peter Zijlstra <[email protected]> wrote:
>
> > On Fri, Jan 29, 2021 at 09:45:48AM -0800, Alexei Starovoitov wrote:
> > > Same things apply to bpf side. We can statically prove safety for
> > > ftrace and kprobe attaching whereas to deal with NMI situation we
> > > have to use run-time checks for recursion prevention, etc.
> >
> > I have no idea what you're saying. You can attach to functions that are
> > called with random locks held, you can create kprobes in some very
> > sensitive places.
> >
> > What can you staticlly prove about that?
>
> I think the main difference is, if you attach a kprobe or ftrace function,
> you can theoretically analyze the location before you do the attachment.
Excatly.
When we're writing bpf helpers we need to carefully think about reentrance and NMI.
If the helper looks like:
int nokprobe notrace bpf_something(...)
{
// access variables A and B
}
The implementation can rely on the fact that even if the helper is reentrant
the state of A and B will be consistent. Either both got touched or none.
Only NMI condition we have to worry about, because A could be modified
without touching B.
If we write it as
int nokprobe bpf_something(...) { ... }
that would be another case.
Here we need to consider the case that bpf prog can be attached to it via fentry nop.
But no need to worry about instructions split in the middle because of kprobe via int3.
Since we have big "if (in_nmi()) goto unsupported;" check in the beginning we
only need to worry about combinations of kprobe at the start of the func,
kprobe anywhere inside the func via int3, and ftrace at the start.
Not having to think of NMI helps a ton.
My earlier this_cpu vs __this_cpu comment is an example of that.
If in_nmi is filtered early it's one implementation. If nmi has to be handled
it's completely different algorithm.
Now you've broke all this logic by making int3 to be marked as 'in_nmi' and
bpf in kprobe in the middle of the func are now broken.
Do people use that? Yeah they do.
We have to fix it.
What were your reasons to make int3 in_nmi?
I've read the commit log, but I don't see in it the actual motivation
for int3 other than "it looks like NMI to me. Let's make it so".
The commit logs talk about cpu exceptions. I agree that #DB and #MC do behave like NMI.
But #BP is not really. My understanding it's used by kprobes and text_poke_bp only.
If the motivation was to close some issue with text_poke_bp then, sure,
let's make handling of text_poke_bp to be treated as nmi.
But kprobe is not that.
I'm thinking either of the following solutions would be generic:
- introduce another state to preempt flags like "kernel exception"
- remove kprobe's int3 from in_nmi
As bpf specific alternative we can do:
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 6c0018abe68a..37cc549ad52e 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -96,7 +96,7 @@ unsigned int trace_call_bpf(struct trace_event_call *call, void *ctx)
{
unsigned int ret;
- if (in_nmi()) /* not supported yet */
+ if (in_nmi() && !kprobe_running()) /* not supported yet */
return 1;
but imo it's less clean than the first two options.
On Fri, 29 Jan 2021 14:01:03 -0500
Steven Rostedt <[email protected]> wrote:
> On Fri, 29 Jan 2021 18:59:43 +0100
> Peter Zijlstra <[email protected]> wrote:
>
> > On Fri, Jan 29, 2021 at 09:45:48AM -0800, Alexei Starovoitov wrote:
> > > Same things apply to bpf side. We can statically prove safety for
> > > ftrace and kprobe attaching whereas to deal with NMI situation we
> > > have to use run-time checks for recursion prevention, etc.
> >
> > I have no idea what you're saying. You can attach to functions that are
> > called with random locks held, you can create kprobes in some very
> > sensitive places.
> >
> > What can you staticlly prove about that?
>
> I think the main difference is, if you attach a kprobe or ftrace function,
> you can theoretically analyze the location before you do the attachment.
>
> Does, the NMI context mean "in_nmi()" returns true? Because there's cases
> in ftrace callbacks where that is checked (like the stack tracer). And
> having ftrace return true for "in_nmi()" will break a lot of existing
> utilities.
Specifically, kprobe and ftrace callbacks may have this:
if (in_nmi())
return;
raw_spin_lock_irqsave(&lock, flags);
[..]
raw_spin_unlock_irqrestore(&lock, flags);
Which is totally fine to have, but the above only works if "in_nmi()"
returns true only if you are in a real NMI.
The stack tracer code does exactly the above.
-- Steve
On Fri, 29 Jan 2021 13:05:33 -0800
Alexei Starovoitov <[email protected]> wrote:
> On Fri, Jan 29, 2021 at 02:01:03PM -0500, Steven Rostedt wrote:
> > On Fri, 29 Jan 2021 18:59:43 +0100
> > Peter Zijlstra <[email protected]> wrote:
> >
> > > On Fri, Jan 29, 2021 at 09:45:48AM -0800, Alexei Starovoitov wrote:
> > > > Same things apply to bpf side. We can statically prove safety for
> > > > ftrace and kprobe attaching whereas to deal with NMI situation we
> > > > have to use run-time checks for recursion prevention, etc.
> > >
> > > I have no idea what you're saying. You can attach to functions that are
> > > called with random locks held, you can create kprobes in some very
> > > sensitive places.
> > >
> > > What can you staticlly prove about that?
> >
> > I think the main difference is, if you attach a kprobe or ftrace function,
> > you can theoretically analyze the location before you do the attachment.
>
> Excatly.
> When we're writing bpf helpers we need to carefully think about reentrance and NMI.
> If the helper looks like:
> int nokprobe notrace bpf_something(...)
> {
> // access variables A and B
> }
>
> The implementation can rely on the fact that even if the helper is reentrant
> the state of A and B will be consistent. Either both got touched or none.
> Only NMI condition we have to worry about, because A could be modified
> without touching B.
> If we write it as
> int nokprobe bpf_something(...) { ... }
> that would be another case.
> Here we need to consider the case that bpf prog can be attached to it via fentry nop.
> But no need to worry about instructions split in the middle because of kprobe via int3.
> Since we have big "if (in_nmi()) goto unsupported;" check in the beginning we
> only need to worry about combinations of kprobe at the start of the func,
> kprobe anywhere inside the func via int3, and ftrace at the start.
> Not having to think of NMI helps a ton.
> My earlier this_cpu vs __this_cpu comment is an example of that.
> If in_nmi is filtered early it's one implementation. If nmi has to be handled
> it's completely different algorithm.
> Now you've broke all this logic by making int3 to be marked as 'in_nmi' and
> bpf in kprobe in the middle of the func are now broken.
> Do people use that? Yeah they do.
> We have to fix it.
> What were your reasons to make int3 in_nmi?
> I've read the commit log, but I don't see in it the actual motivation
> for int3 other than "it looks like NMI to me. Let's make it so".
> The commit logs talk about cpu exceptions. I agree that #DB and #MC do behave like NMI.
> But #BP is not really. My understanding it's used by kprobes and text_poke_bp only.
> If the motivation was to close some issue with text_poke_bp then, sure,
> let's make handling of text_poke_bp to be treated as nmi.
> But kprobe is not that.
> I'm thinking either of the following solutions would be generic:
> - introduce another state to preempt flags like "kernel exception"
I like this solution. Or, at least there should be a way to provide the
probed context is NMI or not.
(BTW, would the NMI has a specific stack area? If so, nmi_context(regs)
can be implemented.)
> - remove kprobe's int3 from in_nmi
> As bpf specific alternative we can do:
> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> index 6c0018abe68a..37cc549ad52e 100644
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -96,7 +96,7 @@ unsigned int trace_call_bpf(struct trace_event_call *call, void *ctx)
> {
> unsigned int ret;
>
> - if (in_nmi()) /* not supported yet */
> + if (in_nmi() && !kprobe_running()) /* not supported yet */
This doesn't make sense, because kprobe_running() always true in the kprobe handler.
The problem is that the in_nmi() checks whether the current context is NMI context,
but you want to know the context where the kprobe is invoked, is NMI context or not.
Thank you,
--
Masami Hiramatsu <[email protected]>
On Fri, 29 Jan 2021 18:59:43 +0100
Peter Zijlstra <[email protected]> wrote:
> On Fri, Jan 29, 2021 at 09:45:48AM -0800, Alexei Starovoitov wrote:
> > Same things apply to bpf side. We can statically prove safety for
> > ftrace and kprobe attaching whereas to deal with NMI situation we
> > have to use run-time checks for recursion prevention, etc.
>
> I have no idea what you're saying. You can attach to functions that are
> called with random locks held, you can create kprobes in some very
> sensitive places.
>
> What can you staticlly prove about that?
For the bpf and the kprobe tracer, if a probe hits in the NMI context,
it can call the handler with another handler processing events.
kprobes is carefully avoiding the deadlock by checking recursion
with per-cpu variable. But if the handler is shared with the other events
like tracepoints, it needs to its own recursion cheker too.
So, Alexei, maybe you need something like this instead of in_nmi() check.
DEFINE_PER_CPU(bool, under_running_bpf);
common_handler()
{
if (__this_cpu_read(under_running_bpf))
return;
__this_cpu_write(under_running_bpf, true);
/* execute bpf prog */
__this_cpu_write(under_running_bpf, false);
}
Does this work for you?
Thank you,
--
Masami Hiramatsu <[email protected]>
On Fri, Jan 29, 2021 at 04:24:54PM -0500, Steven Rostedt wrote:
> Specifically, kprobe and ftrace callbacks may have this:
>
> if (in_nmi())
> return;
>
> raw_spin_lock_irqsave(&lock, flags);
> [..]
> raw_spin_unlock_irqrestore(&lock, flags);
>
> Which is totally fine to have,
Why? There's a distinct lack of explaining here.
Note that we ripped out all such dodgy locking from kretprobes.
On Sat, Jan 30, 2021 at 11:02:49AM +0900, Masami Hiramatsu wrote:
> On Fri, 29 Jan 2021 18:59:43 +0100
> Peter Zijlstra <[email protected]> wrote:
>
> > On Fri, Jan 29, 2021 at 09:45:48AM -0800, Alexei Starovoitov wrote:
> > > Same things apply to bpf side. We can statically prove safety for
> > > ftrace and kprobe attaching whereas to deal with NMI situation we
> > > have to use run-time checks for recursion prevention, etc.
> >
> > I have no idea what you're saying. You can attach to functions that are
> > called with random locks held, you can create kprobes in some very
> > sensitive places.
> >
> > What can you staticlly prove about that?
>
> For the bpf and the kprobe tracer, if a probe hits in the NMI context,
> it can call the handler with another handler processing events.
>
> kprobes is carefully avoiding the deadlock by checking recursion
> with per-cpu variable. But if the handler is shared with the other events
> like tracepoints, it needs to its own recursion cheker too.
>
> So, Alexei, maybe you need something like this instead of in_nmi() check.
>
> DEFINE_PER_CPU(bool, under_running_bpf);
>
> common_handler()
> {
> if (__this_cpu_read(under_running_bpf))
> return;
> __this_cpu_write(under_running_bpf, true);
> /* execute bpf prog */
> __this_cpu_write(under_running_bpf, false);
> }
>
> Does this work for you?
This exactly check is already in trace_call_bpf.
Right after if (in_nmi()).
See bpf_prog_active. It serves different purpose though.
Simply removing if (in_nmi()) from trace_call_bpf is a bit scary.
I need to analyze all code paths first.
On Fri, 29 Jan 2021 19:08:40 -0800
Alexei Starovoitov <[email protected]> wrote:
> On Sat, Jan 30, 2021 at 11:02:49AM +0900, Masami Hiramatsu wrote:
> > On Fri, 29 Jan 2021 18:59:43 +0100
> > Peter Zijlstra <[email protected]> wrote:
> >
> > > On Fri, Jan 29, 2021 at 09:45:48AM -0800, Alexei Starovoitov wrote:
> > > > Same things apply to bpf side. We can statically prove safety for
> > > > ftrace and kprobe attaching whereas to deal with NMI situation we
> > > > have to use run-time checks for recursion prevention, etc.
> > >
> > > I have no idea what you're saying. You can attach to functions that are
> > > called with random locks held, you can create kprobes in some very
> > > sensitive places.
> > >
> > > What can you staticlly prove about that?
> >
> > For the bpf and the kprobe tracer, if a probe hits in the NMI context,
> > it can call the handler with another handler processing events.
> >
> > kprobes is carefully avoiding the deadlock by checking recursion
> > with per-cpu variable. But if the handler is shared with the other events
> > like tracepoints, it needs to its own recursion cheker too.
> >
> > So, Alexei, maybe you need something like this instead of in_nmi() check.
> >
> > DEFINE_PER_CPU(bool, under_running_bpf);
> >
> > common_handler()
> > {
> > if (__this_cpu_read(under_running_bpf))
> > return;
> > __this_cpu_write(under_running_bpf, true);
> > /* execute bpf prog */
> > __this_cpu_write(under_running_bpf, false);
> > }
> >
> > Does this work for you?
>
> This exactly check is already in trace_call_bpf.
> Right after if (in_nmi()).
> See bpf_prog_active. It serves different purpose though.
> Simply removing if (in_nmi()) from trace_call_bpf is a bit scary.
> I need to analyze all code paths first.
OK, if bpf already avoids its recursion, other considerable case is
that some resources are shared among bpf_prog and other parts. Since
asynchronous NMI can occur anywhere, such resource usage can conflict
with bpf_prog.
Kprobes had similar issue, so I set a dummy kprobes to current_kprobes
for protecting such critical sections.
See kprobe_busy_begin()/end() and where those are used.
Thank you,
--
Masami Hiramatsu <[email protected]>
On Sat, 30 Jan 2021 09:28:32 +0100
Peter Zijlstra <[email protected]> wrote:
> On Fri, Jan 29, 2021 at 04:24:54PM -0500, Steven Rostedt wrote:
> > Specifically, kprobe and ftrace callbacks may have this:
> >
> > if (in_nmi())
> > return;
> >
> > raw_spin_lock_irqsave(&lock, flags);
> > [..]
> > raw_spin_unlock_irqrestore(&lock, flags);
> >
> > Which is totally fine to have,
>
> Why? There's a distinct lack of explaining here.
>
> Note that we ripped out all such dodgy locking from kretprobes.
Actually, I think you helped explain the distinction. You mention
"kretpobes" do you mean the infrastructure of kretprobes or all its
users?
The infrastructure of ftrace and kprobes can work in any context, it
does not mean that the callbacks must. Again, these are more like
exceptions. Why have "in_nmi()"? If anything that can be called by an
NMI should just work, right? That's basically your argument for having
ftrace and kprobes set "in_nmi".
You can have locking in NMIs if the locking is *only* in NMI handlers,
right? If that's the case, then so should ftrace and kprobe callbacks.
The stack tracer checks the size of the stack, compares it to the
largest recorded size, and if it's bigger, it will save the stack. But
if this happens on two CPUs at the same time, only one can do the
recording at the same time. To synchronize this, a spin lock must be
taken. Similar to spin locks in an NMI.
But the problem here is, the callbacks can also be done from an NMI
context, so if we are in NMI, we don't want to take any locks, and
simply don't record the stack traces from NMIs.
The more I think about it, the more I hate the idea that ftrace
callbacks and kprobes are considered NMIs. Simply because they are not!
-- Steve
On Sat, Jan 30, 2021 at 07:44:10AM -0500, Steven Rostedt wrote:
> On Sat, 30 Jan 2021 09:28:32 +0100
> Peter Zijlstra <[email protected]> wrote:
>
> > On Fri, Jan 29, 2021 at 04:24:54PM -0500, Steven Rostedt wrote:
> > > Specifically, kprobe and ftrace callbacks may have this:
> > >
> > > if (in_nmi())
> > > return;
> > >
> > > raw_spin_lock_irqsave(&lock, flags);
> > > [..]
> > > raw_spin_unlock_irqrestore(&lock, flags);
> > >
> > > Which is totally fine to have,
> >
> > Why? There's a distinct lack of explaining here.
> >
> > Note that we ripped out all such dodgy locking from kretprobes.
>
> Actually, I think you helped explain the distinction. You mention
> "kretpobes" do you mean the infrastructure of kretprobes or all its
> users?
kretprobe infra.
> The infrastructure of ftrace and kprobes can work in any context, it
> does not mean that the callbacks must. Again, these are more like
> exceptions. Why have "in_nmi()"? If anything that can be called by an
> NMI should just work, right? That's basically your argument for having
> ftrace and kprobes set "in_nmi".
>
> You can have locking in NMIs if the locking is *only* in NMI handlers,
> right? If that's the case, then so should ftrace and kprobe callbacks.
Which is still dodgy as heck. NMIs _can_ nest. Now mostly it doesn't
happen, and the few sites that do use spinlocks from NMI context are
sure to use it from a specific NMI 'handler' context which typically
don't nest.
But I still utterly detest the idea of using spinlocks from NMI. It's
inherently fragile.
> The stack tracer checks the size of the stack, compares it to the
> largest recorded size, and if it's bigger, it will save the stack. But
> if this happens on two CPUs at the same time, only one can do the
> recording at the same time. To synchronize this, a spin lock must be
> taken. Similar to spin locks in an NMI.
That sounds like something cmpxchg() should be able to do.
Have a per-cpu stack trace buffer and a global max one, when cpu local
exceeds previous max, cmpxchg the buffer.
> But the problem here is, the callbacks can also be done from an NMI
> context, so if we are in NMI, we don't want to take any locks, and
> simply don't record the stack traces from NMIs.
Which is obviously shit :-) The NMI might have interesting stack usage.
> The more I think about it, the more I hate the idea that ftrace
> callbacks and kprobes are considered NMIs. Simply because they are not!
Yet they happen when IRQs are off, so they are ;-)
Also, given how everything can nest, it had better all be lockless
anyway. You can get your regular function trace interrupted, which can
hit a #DB, which can function trace, which can #BP which can function
trace again which can get #NMI etc.. Many wonderfun nestings possible.
And god knows what these handlers end up calling.
The only sane approach is treating it all as NMI and having it all
lockless.
On Tue, 2 Feb 2021 11:45:41 +0100
Peter Zijlstra <[email protected]> wrote:
> > The stack tracer checks the size of the stack, compares it to the
> > largest recorded size, and if it's bigger, it will save the stack. But
> > if this happens on two CPUs at the same time, only one can do the
> > recording at the same time. To synchronize this, a spin lock must be
> > taken. Similar to spin locks in an NMI.
>
> That sounds like something cmpxchg() should be able to do.
>
> Have a per-cpu stack trace buffer and a global max one, when cpu local
> exceeds previous max, cmpxchg the buffer.
>
> > But the problem here is, the callbacks can also be done from an NMI
> > context, so if we are in NMI, we don't want to take any locks, and
> > simply don't record the stack traces from NMIs.
>
> Which is obviously shit :-) The NMI might have interesting stack usage.
Actually, it only checks task stacks. It doesn't check IRQ stacks if
they are different than the task stack, because to do it properly, it
must know the size of the stack. The tracer currently masks the stack
pointer with THREAD_SIZE to find the top of the stack.
As other stacks may not be THREAD_SIZE, that won't work. It has been on
my TODO list (for a long time), to add an arch specific way to quickly find
the top of the stack.
>
> > The more I think about it, the more I hate the idea that ftrace
> > callbacks and kprobes are considered NMIs. Simply because they are not!
>
> Yet they happen when IRQs are off, so they are ;-)
But from a handler, you could do:
if (in_nmi())
return;
local_irq_save(flags);
/* Now you are safe from being re-entrant. */
Where as there's no equivalent in a NMI handler. That's what makes
kprobe/ftrace handlers different than NMI handlers.
>
> Also, given how everything can nest, it had better all be lockless
> anyway. You can get your regular function trace interrupted, which can
> hit a #DB, which can function trace, which can #BP which can function
> trace again which can get #NMI etc.. Many wonderfun nestings possible.
I would call #DB an #BP handlers very special.
Question: Do #DB and #BP set "in_interrupt()"? Because the function tracer
has infrastructure to prevent recursion in the same context. That is, a
ftrace handler calls something that gets traced, the recursion protection
will detect that and prevent the handler from being called again. But the
recursion protection is interrupt context aware and lets the handler get
called again if the recursion happens from a different context:
func:
call ftrace_caller
ftrace_caller:
call ftrace_handler
ftrace_handler() {
if (recursion_test()) <- false
return;
some_traced_func() {
call ftrace_caller
call ftrace_handler
ftrace_handler() {
if (recursion_test()) <- true
return
}
<interrupt>
func
call ftrace_caller
call ftrace_handler
ftrace_handler() {
if (recursion_test()) <- false
return;
/* continue */
If #DB and #BP do not change the in_interrupt() context, then the above
still will protect the ftrace handlers from recursion due to them.
>
> And god knows what these handlers end up calling.
>
> The only sane approach is treating it all as NMI and having it all
> lockless.
That would require refactoring all the code that's been around since 2008.
Worse yet. lockless is much more complex to get right. So this refactoring
will likely cause more bugs than it solves.
-- Steve
On Tue, Feb 02, 2021 at 09:52:49AM -0500, Steven Rostedt wrote:
> But from a handler, you could do:
>
> if (in_nmi())
> return;
> local_irq_save(flags);
> /* Now you are safe from being re-entrant. */
But that's an utter crap thing to do. That's like saying I don't care
about my events, at which point you might as well not bother at all.
And you can still do that, you just get less coverage today than you
used to. You used to throw things under the bus, now you throw more
under the bus. If you didn't care, I can't seem to find myself caring
either.
> Where as there's no equivalent in a NMI handler. That's what makes
> kprobe/ftrace handlers different than NMI handlers.
I don't see how.
> > Also, given how everything can nest, it had better all be lockless
> > anyway. You can get your regular function trace interrupted, which can
> > hit a #DB, which can function trace, which can #BP which can function
> > trace again which can get #NMI etc.. Many wonderfun nestings possible.
>
> I would call #DB an #BP handlers very special.
They are, just like NMI is special, which is why they're classed
together.
> Question: Do #DB and #BP set "in_interrupt()"? Because the function tracer
> has infrastructure to prevent recursion in the same context.
Sure we _could_ do that, but then we get into the 'fun' problem of
getting a breakpoint/int3 at random places and calling random code and
having deadlocks because they take the same lock.
There was very little that stopped that from happening.
> That is, a
> ftrace handler calls something that gets traced, the recursion protection
> will detect that and prevent the handler from being called again. But the
> recursion protection is interrupt context aware and lets the handler get
> called again if the recursion happens from a different context:
> If #DB and #BP do not change the in_interrupt() context, then the above
> still will protect the ftrace handlers from recursion due to them.
But it doesn't help with:
spin_lock_irq(&foo); // task context
#DB
spin_lock_irq(&foo); // interrupt context per your above
All you need to do is put a breakpoint on a piece of code that holds a
spinlock and a handler that takes the same spinlock.
There was very little from stopping that.
> That would require refactoring all the code that's been around since 2008.
Because I couldn't tell why/if any of that was correct at all. #DB/#BP
don't play by the normal rules. They're _far_ more NMI-like than they're
IRQ-like due to ignoring IF.
On Tue, 2 Feb 2021 17:45:47 +0100
Peter Zijlstra <[email protected]> wrote:
> On Tue, Feb 02, 2021 at 09:52:49AM -0500, Steven Rostedt wrote:
>
> > But from a handler, you could do:
> >
> > if (in_nmi())
> > return;
> > local_irq_save(flags);
> > /* Now you are safe from being re-entrant. */
>
> But that's an utter crap thing to do. That's like saying I don't care
> about my events, at which point you might as well not bother at all.
>
> And you can still do that, you just get less coverage today than you
> used to. You used to throw things under the bus, now you throw more
> under the bus. If you didn't care, I can't seem to find myself caring
> either.
NMIs are special, and they always have been. They shouldn't be doing much
anyway. If they are, then that's a problem.
But if you want to make the stack tracer work on all contexts, I'm happy to
take patches. I don't have time to work on it today.
>
> > Where as there's no equivalent in a NMI handler. That's what makes
> > kprobe/ftrace handlers different than NMI handlers.
>
> I don't see how.
>
> > > Also, given how everything can nest, it had better all be lockless
> > > anyway. You can get your regular function trace interrupted, which can
> > > hit a #DB, which can function trace, which can #BP which can function
> > > trace again which can get #NMI etc.. Many wonderfun nestings possible.
> >
> > I would call #DB an #BP handlers very special.
>
> They are, just like NMI is special, which is why they're classed
> together.
>
> > Question: Do #DB and #BP set "in_interrupt()"? Because the function tracer
> > has infrastructure to prevent recursion in the same context.
>
> Sure we _could_ do that, but then we get into the 'fun' problem of
> getting a breakpoint/int3 at random places and calling random code and
> having deadlocks because they take the same lock.
My question wasn't to have them do it, I was simply asking if they do. I
was assuming that they do not.
>
> There was very little that stopped that from happening.
>
> > That is, a
> > ftrace handler calls something that gets traced, the recursion protection
> > will detect that and prevent the handler from being called again. But the
> > recursion protection is interrupt context aware and lets the handler get
> > called again if the recursion happens from a different context:
>
> > If #DB and #BP do not change the in_interrupt() context, then the above
> > still will protect the ftrace handlers from recursion due to them.
>
> But it doesn't help with:
>
> spin_lock_irq(&foo); // task context
> #DB
> spin_lock_irq(&foo); // interrupt context per your above
The statement above said:
"If #DB and #BP do not change the in_interrupt() context"
Which would make the above be in the same context and the handler would
not be called for the #DB case.
>
> All you need to do is put a breakpoint on a piece of code that holds a
> spinlock and a handler that takes the same spinlock.
>
> There was very little from stopping that.
>
> > That would require refactoring all the code that's been around since 2008.
>
> Because I couldn't tell why/if any of that was correct at all. #DB/#BP
> don't play by the normal rules. They're _far_ more NMI-like than they're
> IRQ-like due to ignoring IF.
I'm fine with #DB and #BP being a "in_nmi()", as they are probably even
more special than NMIs.
-- Steve
On Tue, Feb 02, 2021 at 11:56:23AM -0500, Steven Rostedt wrote:
> NMIs are special, and they always have been. They shouldn't be doing much
> anyway. If they are, then that's a problem.
There is a fair amount of NMI level code these days, and it seems to be
ever increasing...
> My question wasn't to have them do it, I was simply asking if they do. I
> was assuming that they do not.
per nmi_enter() we do:
__preempt_count_add(NMI_OFFSET + HARDIRQ_OFFSET); \
> > But it doesn't help with:
> >
> > spin_lock_irq(&foo); // task context
> > #DB
> > spin_lock_irq(&foo); // interrupt context per your above
>
> The statement above said:
>
> "If #DB and #BP do not change the in_interrupt() context"
>
> Which would make the above be in the same context and the handler would
> not be called for the #DB case.
But then replace the above #DB with __fentry__ and explain how it is
fundamentally different? And consider that direct call into god knows
what code option you have. That doesn't go past any recursion checks
IIRC.
> I'm fine with #DB and #BP being a "in_nmi()", as they are probably even
> more special than NMIs.
That does mean that kprobes are then fundamentally running from
in_nmi(), which is what started all this.
Sure, the opt-probes and ftrace-probes don't actually have in_nmi() set
today (because they don't trigger an exception), but given that that is
all optional, any kprobe handler had better be in_nmi() clean.
On Tue, 2 Feb 2021 19:30:34 +0100
Peter Zijlstra <[email protected]> wrote:
> That does mean that kprobes are then fundamentally running from
> in_nmi(), which is what started all this.
I just thought about the fact that tracing records the context of the
function it is called in. If you set "in_nmi()" for all ftrace handlers,
then all functions will look like they are in an NMI context during tracing.
That is, the preempt count is checked to fill in the flags in the ring
buffer that denotes what context the event (in this case the function) was
called in.
-- Steve
On Tue, 2 Feb 2021 16:05:13 -0500
Steven Rostedt <[email protected]> wrote:
> On Tue, 2 Feb 2021 19:30:34 +0100
> Peter Zijlstra <[email protected]> wrote:
>
> > That does mean that kprobes are then fundamentally running from
> > in_nmi(), which is what started all this.
>
> I just thought about the fact that tracing records the context of the
> function it is called in. If you set "in_nmi()" for all ftrace handlers,
> then all functions will look like they are in an NMI context during tracing.
>
> That is, the preempt count is checked to fill in the flags in the ring
> buffer that denotes what context the event (in this case the function) was
> called in.
Ah, that is what I worried about. ftrace and kprobes handler usually want to
know "what is the actual status of the system where the probe hits".
If the new kernel_exception_enter() for ftrace/kprobes or any other kernel
instrumention does
__preempt_count_add(KEX_OFFSET + NMI_OFFSET + HARDIRQ_OFFSET);
And we can distinguish the KEX from NMI, and get the original status of the context.
What would you think about?
Thank you,
--
Masami Hiramatsu <[email protected]>
On Wed, 3 Feb 2021 22:33:28 +0900
Masami Hiramatsu <[email protected]> wrote:
> Ah, that is what I worried about. ftrace and kprobes handler usually want to
> know "what is the actual status of the system where the probe hits".
>
> If the new kernel_exception_enter() for ftrace/kprobes or any other kernel
> instrumention does
>
> __preempt_count_add(KEX_OFFSET + NMI_OFFSET + HARDIRQ_OFFSET);
>
> And we can distinguish the KEX from NMI, and get the original status of the context.
> What would you think about?
Oh, that reminds me about the obvious difference between an NMI and a
ftrace handler. A ftrace handler doesn't disable interrupts nor
preemption. Thus, if you set "in_nmi" to a ftrace handler, and an
interrupt (or NMI) comes in, then any ftrace handlers called by the
interrupt / NMI will be ignored, since it will think it is recursing
from NMI context.
-- Steve