2020-02-25 23:26:28

by Thomas Gleixner

[permalink] [raw]
Subject: [patch 4/8] x86/entry: Move irq tracing on syscall entry to C-code

Now that the C entry points are safe, move the irq flags tracing code into
the entry helper.

Signed-off-by: Thomas Gleixner <[email protected]>
---
arch/x86/entry/common.c | 5 +++++
arch/x86/entry/entry_32.S | 12 ------------
arch/x86/entry/entry_64.S | 2 --
arch/x86/entry/entry_64_compat.S | 18 ------------------
4 files changed, 5 insertions(+), 32 deletions(-)

--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -57,6 +57,11 @@ static inline void enter_from_user_mode(
*/
static __always_inline void syscall_entry_fixups(void)
{
+ /*
+ * Usermode is traced as interrupts enabled, but the syscall entry
+ * mechanisms disable interrupts. Tell the tracer.
+ */
+ trace_hardirqs_off();
enter_from_user_mode();
local_irq_enable();
}
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -960,12 +960,6 @@ SYM_FUNC_START(entry_SYSENTER_32)
jnz .Lsysenter_fix_flags
.Lsysenter_flags_fixed:

- /*
- * User mode is traced as though IRQs are on, and SYSENTER
- * turned them off.
- */
- TRACE_IRQS_OFF
-
movl %esp, %eax
call do_fast_syscall_32
/* XEN PV guests always use IRET path */
@@ -1075,12 +1069,6 @@ SYM_FUNC_START(entry_INT80_32)

SAVE_ALL pt_regs_ax=$-ENOSYS switch_stacks=1 /* save rest */

- /*
- * User mode is traced as though IRQs are on, and the interrupt gate
- * turned them off.
- */
- TRACE_IRQS_OFF
-
movl %esp, %eax
call do_int80_syscall_32
.Lsyscall_32_done:
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -167,8 +167,6 @@ SYM_INNER_LABEL(entry_SYSCALL_64_after_h

PUSH_AND_CLEAR_REGS rax=$-ENOSYS

- TRACE_IRQS_OFF
-
/* IRQs are off. */
movq %rax, %rdi
movq %rsp, %rsi
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -129,12 +129,6 @@ SYM_FUNC_START(entry_SYSENTER_compat)
jnz .Lsysenter_fix_flags
.Lsysenter_flags_fixed:

- /*
- * User mode is traced as though IRQs are on, and SYSENTER
- * turned them off.
- */
- TRACE_IRQS_OFF
-
movq %rsp, %rdi
call do_fast_syscall_32
/* XEN PV guests always use IRET path */
@@ -247,12 +241,6 @@ SYM_INNER_LABEL(entry_SYSCALL_compat_aft
pushq $0 /* pt_regs->r15 = 0 */
xorl %r15d, %r15d /* nospec r15 */

- /*
- * User mode is traced as though IRQs are on, and SYSENTER
- * turned them off.
- */
- TRACE_IRQS_OFF
-
movq %rsp, %rdi
call do_fast_syscall_32
/* XEN PV guests always use IRET path */
@@ -403,12 +391,6 @@ SYM_CODE_START(entry_INT80_compat)
xorl %r15d, %r15d /* nospec r15 */
cld

- /*
- * User mode is traced as though IRQs are on, and the interrupt
- * gate turned them off.
- */
- TRACE_IRQS_OFF
-
movq %rsp, %rdi
call do_int80_syscall_32
.Lsyscall_32_done:


2020-02-26 05:44:13

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [patch 4/8] x86/entry: Move irq tracing on syscall entry to C-code

On 2/25/20 2:08 PM, Thomas Gleixner wrote:
> Now that the C entry points are safe, move the irq flags tracing code into
> the entry helper.

I'm so confused.

>
> Signed-off-by: Thomas Gleixner <[email protected]>
> ---
> arch/x86/entry/common.c | 5 +++++
> arch/x86/entry/entry_32.S | 12 ------------
> arch/x86/entry/entry_64.S | 2 --
> arch/x86/entry/entry_64_compat.S | 18 ------------------
> 4 files changed, 5 insertions(+), 32 deletions(-)
>
> --- a/arch/x86/entry/common.c
> +++ b/arch/x86/entry/common.c
> @@ -57,6 +57,11 @@ static inline void enter_from_user_mode(
> */
> static __always_inline void syscall_entry_fixups(void)
> {
> + /*
> + * Usermode is traced as interrupts enabled, but the syscall entry
> + * mechanisms disable interrupts. Tell the tracer.
> + */
> + trace_hardirqs_off();

Your earlier patches suggest quite strongly that tracing isn't safe
until enter_from_user_mode(). But trace_hardirqs_off() calls
trace_irq_disable_rcuidle(), which looks [0] like a tracepoint.

Did you perhaps mean to do this *after* enter_from_user_mode()?

(My tracepoint-fu is pretty bad, and I can't actually find where it's
defined as a tracepoint. But then it does nothing at all, so it must be
a tracepoint, right?)

> enter_from_user_mode();
> local_irq_enable();
> }

2020-02-26 08:18:46

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch 4/8] x86/entry: Move irq tracing on syscall entry to C-code

On Tue, Feb 25, 2020 at 09:43:46PM -0800, Andy Lutomirski wrote:
> On 2/25/20 2:08 PM, Thomas Gleixner wrote:
> > Now that the C entry points are safe, move the irq flags tracing code into
> > the entry helper.
>
> I'm so confused.
>
> >
> > Signed-off-by: Thomas Gleixner <[email protected]>
> > ---
> > arch/x86/entry/common.c | 5 +++++
> > arch/x86/entry/entry_32.S | 12 ------------
> > arch/x86/entry/entry_64.S | 2 --
> > arch/x86/entry/entry_64_compat.S | 18 ------------------
> > 4 files changed, 5 insertions(+), 32 deletions(-)
> >
> > --- a/arch/x86/entry/common.c
> > +++ b/arch/x86/entry/common.c
> > @@ -57,6 +57,11 @@ static inline void enter_from_user_mode(
> > */
> > static __always_inline void syscall_entry_fixups(void)
> > {
> > + /*
> > + * Usermode is traced as interrupts enabled, but the syscall entry
> > + * mechanisms disable interrupts. Tell the tracer.
> > + */
> > + trace_hardirqs_off();
>
> Your earlier patches suggest quite strongly that tracing isn't safe
> until enter_from_user_mode(). But trace_hardirqs_off() calls
> trace_irq_disable_rcuidle(), which looks [0] like a tracepoint.
>
> Did you perhaps mean to do this *after* enter_from_user_mode()?

aside from the fact that enter_from_user_mode() itself also has a
tracepoint, the crucial detail is that we must not trace/kprobe the
function calling this.

Specifically for #PF, because we need read_cr2() before this. See later
patches.

2020-02-26 11:20:52

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [patch 4/8] x86/entry: Move irq tracing on syscall entry to C-code



> On Feb 26, 2020, at 12:17 AM, Peter Zijlstra <[email protected]> wrote:
>
> On Tue, Feb 25, 2020 at 09:43:46PM -0800, Andy Lutomirski wrote:
>>> On 2/25/20 2:08 PM, Thomas Gleixner wrote:
>>> Now that the C entry points are safe, move the irq flags tracing code into
>>> the entry helper.
>>
>> I'm so confused.
>>
>>>
>>> Signed-off-by: Thomas Gleixner <[email protected]>
>>> ---
>>> arch/x86/entry/common.c | 5 +++++
>>> arch/x86/entry/entry_32.S | 12 ------------
>>> arch/x86/entry/entry_64.S | 2 --
>>> arch/x86/entry/entry_64_compat.S | 18 ------------------
>>> 4 files changed, 5 insertions(+), 32 deletions(-)
>>>
>>> --- a/arch/x86/entry/common.c
>>> +++ b/arch/x86/entry/common.c
>>> @@ -57,6 +57,11 @@ static inline void enter_from_user_mode(
>>> */
>>> static __always_inline void syscall_entry_fixups(void)
>>> {
>>> + /*
>>> + * Usermode is traced as interrupts enabled, but the syscall entry
>>> + * mechanisms disable interrupts. Tell the tracer.
>>> + */
>>> + trace_hardirqs_off();
>>
>> Your earlier patches suggest quite strongly that tracing isn't safe
>> until enter_from_user_mode(). But trace_hardirqs_off() calls
>> trace_irq_disable_rcuidle(), which looks [0] like a tracepoint.
>>
>> Did you perhaps mean to do this *after* enter_from_user_mode()?
>
> aside from the fact that enter_from_user_mode() itself also has a
> tracepoint, the crucial detail is that we must not trace/kprobe the
> function calling this.
>
> Specifically for #PF, because we need read_cr2() before this. See later
> patches.

Indeed. I’m fine with this patch, but I still don’t understand what the changelog is about. And I’m still rather baffled by most of the notrace annotations in the series.

2020-02-26 19:53:12

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 4/8] x86/entry: Move irq tracing on syscall entry to C-code

Andy Lutomirski <[email protected]> writes:
>> On Feb 26, 2020, at 12:17 AM, Peter Zijlstra <[email protected]> wrote:
>> On Tue, Feb 25, 2020 at 09:43:46PM -0800, Andy Lutomirski wrote:
>>> Your earlier patches suggest quite strongly that tracing isn't safe
>>> until enter_from_user_mode(). But trace_hardirqs_off() calls
>>> trace_irq_disable_rcuidle(), which looks [0] like a tracepoint.
>>>
>>> Did you perhaps mean to do this *after* enter_from_user_mode()?
>>
>> aside from the fact that enter_from_user_mode() itself also has a
>> tracepoint, the crucial detail is that we must not trace/kprobe the
>> function calling this.
>>
>> Specifically for #PF, because we need read_cr2() before this. See later
>> patches.
>
> Indeed. I’m fine with this patch, but I still don’t understand what
> the changelog is about.

Yeah, the changelog is not really helpful. Let me fix that.

> And I’m still rather baffled by most of the notrace annotations in the
> series.

As discussed on IRC, this might be too broad, but then I rather have the
actual C-entry points neither traceable nor probable in general and
relax this by calling functions which can be traced and probed.

My rationale for this decision was that enter_from_user_mode() is marked
notrace/noprobe as well, so I kept the protection scope the same as we
had in the ASM maze which is marked noprobe already.

Thanks,

tglx

2020-02-27 23:12:18

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [patch 4/8] x86/entry: Move irq tracing on syscall entry to C-code

On Tue, Feb 25, 2020 at 11:08:05PM +0100, Thomas Gleixner wrote:
> Now that the C entry points are safe, move the irq flags tracing code into
> the entry helper.
>
> Signed-off-by: Thomas Gleixner <[email protected]>

Reviewed-by: Frederic Weisbecker <[email protected]>

2020-02-28 09:02:30

by Alexandre Chartre

[permalink] [raw]
Subject: Re: [patch 4/8] x86/entry: Move irq tracing on syscall entry to C-code


On 2/25/20 11:08 PM, Thomas Gleixner wrote:
> Now that the C entry points are safe, move the irq flags tracing code into
> the entry helper.
>
> Signed-off-by: Thomas Gleixner <[email protected]>
> ---
> arch/x86/entry/common.c | 5 +++++
> arch/x86/entry/entry_32.S | 12 ------------
> arch/x86/entry/entry_64.S | 2 --
> arch/x86/entry/entry_64_compat.S | 18 ------------------
> 4 files changed, 5 insertions(+), 32 deletions(-)
>

Reviewed-by: Alexandre Chartre <[email protected]>

alex.

2020-02-29 14:45:19

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 4/8] x86/entry: Move irq tracing on syscall entry to C-code

Thomas Gleixner <[email protected]> writes:
> Andy Lutomirski <[email protected]> writes:
>>> On Feb 26, 2020, at 12:17 AM, Peter Zijlstra <[email protected]> wrote:
>>> On Tue, Feb 25, 2020 at 09:43:46PM -0800, Andy Lutomirski wrote:
>>>> Your earlier patches suggest quite strongly that tracing isn't safe
>>>> until enter_from_user_mode(). But trace_hardirqs_off() calls
>>>> trace_irq_disable_rcuidle(), which looks [0] like a tracepoint.
>>>>
>>>> Did you perhaps mean to do this *after* enter_from_user_mode()?
>>>
>>> aside from the fact that enter_from_user_mode() itself also has a
>>> tracepoint, the crucial detail is that we must not trace/kprobe the
>>> function calling this.
>>>
>>> Specifically for #PF, because we need read_cr2() before this. See later
>>> patches.
>>
>> Indeed. I’m fine with this patch, but I still don’t understand what
>> the changelog is about.
>
> Yeah, the changelog is not really helpful. Let me fix that.
>
>> And I’m still rather baffled by most of the notrace annotations in the
>> series.
>
> As discussed on IRC, this might be too broad, but then I rather have the
> actual C-entry points neither traceable nor probable in general and
> relax this by calling functions which can be traced and probed.
>
> My rationale for this decision was that enter_from_user_mode() is marked
> notrace/noprobe as well, so I kept the protection scope the same as we
> had in the ASM maze which is marked noprobe already.

I have second thoughts vs. tracing in this context.

While the tracer itself seems to handle this correctly, what about
things like BPF programs which can be attached to tracepoints and
function trace entries?

Is that really safe _before_ context tracking has updated RCU state?

Thanks,

tglx


2020-02-29 19:35:49

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [patch 4/8] x86/entry: Move irq tracing on syscall entry to C-code



> On Feb 29, 2020, at 6:44 AM, Thomas Gleixner <[email protected]> wrote:
>
> Thomas Gleixner <[email protected]> writes:
>> Andy Lutomirski <[email protected]> writes:
>>>>> On Feb 26, 2020, at 12:17 AM, Peter Zijlstra <[email protected]> wrote:
>>>>> On Tue, Feb 25, 2020 at 09:43:46PM -0800, Andy Lutomirski wrote:
>>>>>> Your earlier patches suggest quite strongly that tracing isn't safe
>>>>>> until enter_from_user_mode(). But trace_hardirqs_off() calls
>>>>>> trace_irq_disable_rcuidle(), which looks [0] like a tracepoint.
>>>>>>
>>>>>> Did you perhaps mean to do this *after* enter_from_user_mode()?
>>>>>
>>>>> aside from the fact that enter_from_user_mode() itself also has a
>>>>> tracepoint, the crucial detail is that we must not trace/kprobe the
>>>>> function calling this.
>>>>>
>>>>> Specifically for #PF, because we need read_cr2() before this. See later
>>>>> patches.
>>>
>>> Indeed. I’m fine with this patch, but I still don’t understand what
>>> the changelog is about.
>>
>> Yeah, the changelog is not really helpful. Let me fix that.
>>
>>> And I’m still rather baffled by most of the notrace annotations in the
>>> series.
>>
>> As discussed on IRC, this might be too broad, but then I rather have the
>> actual C-entry points neither traceable nor probable in general and
>> relax this by calling functions which can be traced and probed.
>>
>> My rationale for this decision was that enter_from_user_mode() is marked
>> notrace/noprobe as well, so I kept the protection scope the same as we
>> had in the ASM maze which is marked noprobe already.
>
> I have second thoughts vs. tracing in this context.
>
> While the tracer itself seems to handle this correctly, what about
> things like BPF programs which can be attached to tracepoints and
> function trace entries?

I think that everything using the tracing code, including BPF, should either do its own rcuidle stuff or explicitly not execute if we’re not in CONTEXT_KERNEL. That is, we probably need to patch BPF.

>
> Is that really safe _before_ context tracking has updated RCU state?
>
> Thanks,
>
> tglx
>
>

2020-02-29 23:59:50

by Steven Rostedt

[permalink] [raw]
Subject: Re: [patch 4/8] x86/entry: Move irq tracing on syscall entry to C-code

On Sat, 29 Feb 2020 11:25:24 -0800
Andy Lutomirski <[email protected]> wrote:

> > While the tracer itself seems to handle this correctly, what about
> > things like BPF programs which can be attached to tracepoints and
> > function trace entries?
>
> I think that everything using the tracing code, including BPF, should
> either do its own rcuidle stuff or explicitly not execute if we’re
> not in CONTEXT_KERNEL. That is, we probably need to patch BPF.

That's basically the route we are taking.

-- Steve

2020-03-01 10:17:09

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 4/8] x86/entry: Move irq tracing on syscall entry to C-code

Steven Rostedt <[email protected]> writes:

> On Sat, 29 Feb 2020 11:25:24 -0800
> Andy Lutomirski <[email protected]> wrote:
>
>> > While the tracer itself seems to handle this correctly, what about
>> > things like BPF programs which can be attached to tracepoints and
>> > function trace entries?
>>
>> I think that everything using the tracing code, including BPF, should
>> either do its own rcuidle stuff or explicitly not execute if we’re
>> not in CONTEXT_KERNEL. That is, we probably need to patch BPF.
>
> That's basically the route we are taking.

Ok, but for the time being anything before/after CONTEXT_KERNEL is unsafe
except trace_hardirq_off/on() as those trace functions do not allow to
attach anything AFAICT.

Thanks,

tglx

2020-03-01 14:39:10

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [patch 4/8] x86/entry: Move irq tracing on syscall entry to C-code



> On Mar 1, 2020, at 2:16 AM, Thomas Gleixner <[email protected]> wrote:
>
> Steven Rostedt <[email protected]> writes:
>
>> On Sat, 29 Feb 2020 11:25:24 -0800
>> Andy Lutomirski <[email protected]> wrote:
>>
>>>> While the tracer itself seems to handle this correctly, what about
>>>> things like BPF programs which can be attached to tracepoints and
>>>> function trace entries?
>>>
>>> I think that everything using the tracing code, including BPF, should
>>> either do its own rcuidle stuff or explicitly not execute if we’re
>>> not in CONTEXT_KERNEL. That is, we probably need to patch BPF.
>>
>> That's basically the route we are taking.
>
> Ok, but for the time being anything before/after CONTEXT_KERNEL is unsafe
> except trace_hardirq_off/on() as those trace functions do not allow to
> attach anything AFAICT.

Can you point to whatever makes those particular functions special? I failed to follow the macro maze.

>
> Thanks,
>
> tglx

2020-03-01 15:22:01

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 4/8] x86/entry: Move irq tracing on syscall entry to C-code

Andy Lutomirski <[email protected]> writes:
>> On Mar 1, 2020, at 2:16 AM, Thomas Gleixner <[email protected]> wrote:
>> Ok, but for the time being anything before/after CONTEXT_KERNEL is unsafe
>> except trace_hardirq_off/on() as those trace functions do not allow to
>> attach anything AFAICT.
>
> Can you point to whatever makes those particular functions special? I
> failed to follow the macro maze.

Those are not tracepoints and not going through the macro maze. See
kernel/trace/trace_preemptirq.c

Thanks,

tglx

2020-03-01 16:01:37

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [patch 4/8] x86/entry: Move irq tracing on syscall entry to C-code

On Sun, Mar 1, 2020 at 7:21 AM Thomas Gleixner <[email protected]> wrote:
>
> Andy Lutomirski <[email protected]> writes:
> >> On Mar 1, 2020, at 2:16 AM, Thomas Gleixner <[email protected]> wrote:
> >> Ok, but for the time being anything before/after CONTEXT_KERNEL is unsafe
> >> except trace_hardirq_off/on() as those trace functions do not allow to
> >> attach anything AFAICT.
> >
> > Can you point to whatever makes those particular functions special? I
> > failed to follow the macro maze.
>
> Those are not tracepoints and not going through the macro maze. See
> kernel/trace/trace_preemptirq.c

That has:

void trace_hardirqs_on(void)
{
if (this_cpu_read(tracing_irq_cpu)) {
if (!in_nmi())
trace_irq_enable_rcuidle(CALLER_ADDR0, CALLER_ADDR1);
tracer_hardirqs_on(CALLER_ADDR0, CALLER_ADDR1);
this_cpu_write(tracing_irq_cpu, 0);
}

lockdep_hardirqs_on(CALLER_ADDR0);
}
EXPORT_SYMBOL(trace_hardirqs_on);
NOKPROBE_SYMBOL(trace_hardirqs_on);

But this calls trace_irq_enable_rcuidle(), and that's the part of the
macro maze I got lost in. I found:

#ifdef CONFIG_TRACE_IRQFLAGS
DEFINE_EVENT(preemptirq_template, irq_disable,
TP_PROTO(unsigned long ip, unsigned long parent_ip),
TP_ARGS(ip, parent_ip));

DEFINE_EVENT(preemptirq_template, irq_enable,
TP_PROTO(unsigned long ip, unsigned long parent_ip),
TP_ARGS(ip, parent_ip));
#else
#define trace_irq_enable(...)
#define trace_irq_disable(...)
#define trace_irq_enable_rcuidle(...)
#define trace_irq_disable_rcuidle(...)
#endif

But the DEFINE_EVENT doesn't have the "_rcuidle" part. And that's
where I got lost in the macro maze. I looked at the gcc asm output,
and there is, indeed:

# ./include/trace/events/preemptirq.h:40:
DEFINE_EVENT(preemptirq_template, irq_enable,

with a bunch of asm magic that looks like it's probably a tracepoint.
I still don't quite see where the "_rcuidle" went.

But I also don't see why this is any different from any other tracepoint.

--Andy

2020-03-01 18:13:55

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 4/8] x86/entry: Move irq tracing on syscall entry to C-code

Andy Lutomirski <[email protected]> writes:
> On Sun, Mar 1, 2020 at 7:21 AM Thomas Gleixner <[email protected]> wrote:
>> Andy Lutomirski <[email protected]> writes:
>> >> On Mar 1, 2020, at 2:16 AM, Thomas Gleixner <[email protected]> wrote:
>> >> Ok, but for the time being anything before/after CONTEXT_KERNEL is unsafe
>> >> except trace_hardirq_off/on() as those trace functions do not allow to
>> >> attach anything AFAICT.
>> >
>> > Can you point to whatever makes those particular functions special? I
>> > failed to follow the macro maze.
>>
>> Those are not tracepoints and not going through the macro maze. See
>> kernel/trace/trace_preemptirq.c
>
> That has:
>
> void trace_hardirqs_on(void)
> {
> if (this_cpu_read(tracing_irq_cpu)) {
> if (!in_nmi())
> trace_irq_enable_rcuidle(CALLER_ADDR0, CALLER_ADDR1);
> tracer_hardirqs_on(CALLER_ADDR0, CALLER_ADDR1);
> this_cpu_write(tracing_irq_cpu, 0);
> }
>
> lockdep_hardirqs_on(CALLER_ADDR0);
> }
> EXPORT_SYMBOL(trace_hardirqs_on);
> NOKPROBE_SYMBOL(trace_hardirqs_on);
>
> But this calls trace_irq_enable_rcuidle(), and that's the part of the
> macro maze I got lost in. I found:
>
> #ifdef CONFIG_TRACE_IRQFLAGS
> DEFINE_EVENT(preemptirq_template, irq_disable,
> TP_PROTO(unsigned long ip, unsigned long parent_ip),
> TP_ARGS(ip, parent_ip));
>
> DEFINE_EVENT(preemptirq_template, irq_enable,
> TP_PROTO(unsigned long ip, unsigned long parent_ip),
> TP_ARGS(ip, parent_ip));
> #else
> #define trace_irq_enable(...)
> #define trace_irq_disable(...)
> #define trace_irq_enable_rcuidle(...)
> #define trace_irq_disable_rcuidle(...)
> #endif
>
> But the DEFINE_EVENT doesn't have the "_rcuidle" part. And that's
> where I got lost in the macro maze. I looked at the gcc asm output,
> and there is, indeed:

DEFINE_EVENT
DECLARE_TRACE
__DECLARE_TRACE
__DECLARE_TRACE_RCU
static inline void trace_##name##_rcuidle(proto)
__DO_TRACE
if (rcuidle)
....

> But I also don't see why this is any different from any other tracepoint.

Indeed. I took a wrong turn at some point in the macro jungle :)

So tracing itself is fine, but then if you have probes or bpf programs
attached to a tracepoint these use rcu_read_lock()/unlock() which is
obviosly wrong in rcuidle context.

Thanks,

tglx

2020-03-01 18:21:05

by Steven Rostedt

[permalink] [raw]
Subject: Re: [patch 4/8] x86/entry: Move irq tracing on syscall entry to C-code

On Sun, 01 Mar 2020 16:21:15 +0100
Thomas Gleixner <[email protected]> wrote:

> Andy Lutomirski <[email protected]> writes:
> >> On Mar 1, 2020, at 2:16 AM, Thomas Gleixner <[email protected]> wrote:
> >> Ok, but for the time being anything before/after CONTEXT_KERNEL is unsafe
> >> except trace_hardirq_off/on() as those trace functions do not allow to
> >> attach anything AFAICT.
> >
> > Can you point to whatever makes those particular functions special? I
> > failed to follow the macro maze.
>
> Those are not tracepoints and not going through the macro maze. See
> kernel/trace/trace_preemptirq.c

For the latency tracing, they do call into the tracing infrastructure,
not just lockdep. And Joel Fernandez did try to make these into trace
events as well.

-- Steve

2020-03-01 18:24:30

by Steven Rostedt

[permalink] [raw]
Subject: Re: [patch 4/8] x86/entry: Move irq tracing on syscall entry to C-code

On Sun, 1 Mar 2020 08:00:01 -0800
Andy Lutomirski <[email protected]> wrote:

> But the DEFINE_EVENT doesn't have the "_rcuidle" part. And that's
> where I got lost in the macro maze. I looked at the gcc asm output,
> and there is, indeed:
>
> # ./include/trace/events/preemptirq.h:40:
> DEFINE_EVENT(preemptirq_template, irq_enable,
>
> with a bunch of asm magic that looks like it's probably a tracepoint.
> I still don't quite see where the "_rcuidle" went.

See the code in include/linux/tracepoint.h and search for "rcuidle"
there. All defined trace events get a "_rcuidle" version, but as it is
declared a static inline, then they only show up if they are used.

-- Steve

2020-03-01 18:26:59

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [patch 4/8] x86/entry: Move irq tracing on syscall entry to C-code

On Sun, Mar 01, 2020 at 07:12:25PM +0100, Thomas Gleixner wrote:
> Andy Lutomirski <[email protected]> writes:
> > On Sun, Mar 1, 2020 at 7:21 AM Thomas Gleixner <[email protected]> wrote:
> >> Andy Lutomirski <[email protected]> writes:
> >> >> On Mar 1, 2020, at 2:16 AM, Thomas Gleixner <[email protected]> wrote:
> >> >> Ok, but for the time being anything before/after CONTEXT_KERNEL is unsafe
> >> >> except trace_hardirq_off/on() as those trace functions do not allow to
> >> >> attach anything AFAICT.
> >> >
> >> > Can you point to whatever makes those particular functions special? I
> >> > failed to follow the macro maze.
> >>
> >> Those are not tracepoints and not going through the macro maze. See
> >> kernel/trace/trace_preemptirq.c
> >
> > That has:
> >
> > void trace_hardirqs_on(void)
> > {
> > if (this_cpu_read(tracing_irq_cpu)) {
> > if (!in_nmi())
> > trace_irq_enable_rcuidle(CALLER_ADDR0, CALLER_ADDR1);
> > tracer_hardirqs_on(CALLER_ADDR0, CALLER_ADDR1);
> > this_cpu_write(tracing_irq_cpu, 0);
> > }
> >
> > lockdep_hardirqs_on(CALLER_ADDR0);
> > }
> > EXPORT_SYMBOL(trace_hardirqs_on);
> > NOKPROBE_SYMBOL(trace_hardirqs_on);
> >
> > But this calls trace_irq_enable_rcuidle(), and that's the part of the
> > macro maze I got lost in. I found:
> >
> > #ifdef CONFIG_TRACE_IRQFLAGS
> > DEFINE_EVENT(preemptirq_template, irq_disable,
> > TP_PROTO(unsigned long ip, unsigned long parent_ip),
> > TP_ARGS(ip, parent_ip));
> >
> > DEFINE_EVENT(preemptirq_template, irq_enable,
> > TP_PROTO(unsigned long ip, unsigned long parent_ip),
> > TP_ARGS(ip, parent_ip));
> > #else
> > #define trace_irq_enable(...)
> > #define trace_irq_disable(...)
> > #define trace_irq_enable_rcuidle(...)
> > #define trace_irq_disable_rcuidle(...)
> > #endif
> >
> > But the DEFINE_EVENT doesn't have the "_rcuidle" part. And that's
> > where I got lost in the macro maze. I looked at the gcc asm output,
> > and there is, indeed:
>
> DEFINE_EVENT
> DECLARE_TRACE
> __DECLARE_TRACE
> __DECLARE_TRACE_RCU
> static inline void trace_##name##_rcuidle(proto)
> __DO_TRACE
> if (rcuidle)
> ....
>
> > But I also don't see why this is any different from any other tracepoint.
>
> Indeed. I took a wrong turn at some point in the macro jungle :)
>
> So tracing itself is fine, but then if you have probes or bpf programs
> attached to a tracepoint these use rcu_read_lock()/unlock() which is
> obviosly wrong in rcuidle context.

Definitely, any such code needs to use tricks similar to that of the
tracing code. Or instead use something like SRCU, which is OK with
readers from idle. Or use something like Steve Rostedt's workqueue-based
approach, though please be very careful with this latter, lest the
battery-powered embedded guys come after you for waking up idle CPUs
too often. ;-)

Thanx, Paul

2020-03-01 18:56:08

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [patch 4/8] x86/entry: Move irq tracing on syscall entry to C-code

On Sun, Mar 1, 2020 at 10:26 AM Paul E. McKenney <[email protected]> wrote:
>
> On Sun, Mar 01, 2020 at 07:12:25PM +0100, Thomas Gleixner wrote:
> > Andy Lutomirski <[email protected]> writes:
> > > On Sun, Mar 1, 2020 at 7:21 AM Thomas Gleixner <[email protected]> wrote:
> > >> Andy Lutomirski <[email protected]> writes:
> > >> >> On Mar 1, 2020, at 2:16 AM, Thomas Gleixner <[email protected]> wrote:
> > >> >> Ok, but for the time being anything before/after CONTEXT_KERNEL is unsafe
> > >> >> except trace_hardirq_off/on() as those trace functions do not allow to
> > >> >> attach anything AFAICT.
> > >> >
> > >> > Can you point to whatever makes those particular functions special? I
> > >> > failed to follow the macro maze.
> > >>
> > >> Those are not tracepoints and not going through the macro maze. See
> > >> kernel/trace/trace_preemptirq.c
> > >
> > > That has:
> > >
> > > void trace_hardirqs_on(void)
> > > {
> > > if (this_cpu_read(tracing_irq_cpu)) {
> > > if (!in_nmi())
> > > trace_irq_enable_rcuidle(CALLER_ADDR0, CALLER_ADDR1);
> > > tracer_hardirqs_on(CALLER_ADDR0, CALLER_ADDR1);
> > > this_cpu_write(tracing_irq_cpu, 0);
> > > }
> > >
> > > lockdep_hardirqs_on(CALLER_ADDR0);
> > > }
> > > EXPORT_SYMBOL(trace_hardirqs_on);
> > > NOKPROBE_SYMBOL(trace_hardirqs_on);
> > >
> > > But this calls trace_irq_enable_rcuidle(), and that's the part of the
> > > macro maze I got lost in. I found:
> > >
> > > #ifdef CONFIG_TRACE_IRQFLAGS
> > > DEFINE_EVENT(preemptirq_template, irq_disable,
> > > TP_PROTO(unsigned long ip, unsigned long parent_ip),
> > > TP_ARGS(ip, parent_ip));
> > >
> > > DEFINE_EVENT(preemptirq_template, irq_enable,
> > > TP_PROTO(unsigned long ip, unsigned long parent_ip),
> > > TP_ARGS(ip, parent_ip));
> > > #else
> > > #define trace_irq_enable(...)
> > > #define trace_irq_disable(...)
> > > #define trace_irq_enable_rcuidle(...)
> > > #define trace_irq_disable_rcuidle(...)
> > > #endif
> > >
> > > But the DEFINE_EVENT doesn't have the "_rcuidle" part. And that's
> > > where I got lost in the macro maze. I looked at the gcc asm output,
> > > and there is, indeed:
> >
> > DEFINE_EVENT
> > DECLARE_TRACE
> > __DECLARE_TRACE
> > __DECLARE_TRACE_RCU
> > static inline void trace_##name##_rcuidle(proto)
> > __DO_TRACE
> > if (rcuidle)
> > ....
> >
> > > But I also don't see why this is any different from any other tracepoint.
> >
> > Indeed. I took a wrong turn at some point in the macro jungle :)
> >
> > So tracing itself is fine, but then if you have probes or bpf programs
> > attached to a tracepoint these use rcu_read_lock()/unlock() which is
> > obviosly wrong in rcuidle context.
>
> Definitely, any such code needs to use tricks similar to that of the
> tracing code. Or instead use something like SRCU, which is OK with
> readers from idle. Or use something like Steve Rostedt's workqueue-based
> approach, though please be very careful with this latter, lest the
> battery-powered embedded guys come after you for waking up idle CPUs
> too often. ;-)
>

Are we okay if we somehow ensure that all the entry code before
enter_from_user_mode() only does rcuidle tracing variants and has
kprobes off? Including for BPF use cases?

It would be *really* nice if we could statically verify this, as has
been mentioned elsewhere in the thread. It would also probably be
good enough if we could do it at runtime. Maybe with lockdep on, we
verify rcu state in tracepoints even if the tracepoint isn't active?
And we could plausibly have some widget that could inject something
into *every* kprobeable function to check rcu state.

--Andy

2020-03-01 19:35:32

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [patch 4/8] x86/entry: Move irq tracing on syscall entry to C-code

On Sun, Mar 01, 2020 at 10:54:23AM -0800, Andy Lutomirski wrote:
> On Sun, Mar 1, 2020 at 10:26 AM Paul E. McKenney <[email protected]> wrote:
> >
> > On Sun, Mar 01, 2020 at 07:12:25PM +0100, Thomas Gleixner wrote:
> > > Andy Lutomirski <[email protected]> writes:
> > > > On Sun, Mar 1, 2020 at 7:21 AM Thomas Gleixner <[email protected]> wrote:
> > > >> Andy Lutomirski <[email protected]> writes:
> > > >> >> On Mar 1, 2020, at 2:16 AM, Thomas Gleixner <[email protected]> wrote:
> > > >> >> Ok, but for the time being anything before/after CONTEXT_KERNEL is unsafe
> > > >> >> except trace_hardirq_off/on() as those trace functions do not allow to
> > > >> >> attach anything AFAICT.
> > > >> >
> > > >> > Can you point to whatever makes those particular functions special? I
> > > >> > failed to follow the macro maze.
> > > >>
> > > >> Those are not tracepoints and not going through the macro maze. See
> > > >> kernel/trace/trace_preemptirq.c
> > > >
> > > > That has:
> > > >
> > > > void trace_hardirqs_on(void)
> > > > {
> > > > if (this_cpu_read(tracing_irq_cpu)) {
> > > > if (!in_nmi())
> > > > trace_irq_enable_rcuidle(CALLER_ADDR0, CALLER_ADDR1);
> > > > tracer_hardirqs_on(CALLER_ADDR0, CALLER_ADDR1);
> > > > this_cpu_write(tracing_irq_cpu, 0);
> > > > }
> > > >
> > > > lockdep_hardirqs_on(CALLER_ADDR0);
> > > > }
> > > > EXPORT_SYMBOL(trace_hardirqs_on);
> > > > NOKPROBE_SYMBOL(trace_hardirqs_on);
> > > >
> > > > But this calls trace_irq_enable_rcuidle(), and that's the part of the
> > > > macro maze I got lost in. I found:
> > > >
> > > > #ifdef CONFIG_TRACE_IRQFLAGS
> > > > DEFINE_EVENT(preemptirq_template, irq_disable,
> > > > TP_PROTO(unsigned long ip, unsigned long parent_ip),
> > > > TP_ARGS(ip, parent_ip));
> > > >
> > > > DEFINE_EVENT(preemptirq_template, irq_enable,
> > > > TP_PROTO(unsigned long ip, unsigned long parent_ip),
> > > > TP_ARGS(ip, parent_ip));
> > > > #else
> > > > #define trace_irq_enable(...)
> > > > #define trace_irq_disable(...)
> > > > #define trace_irq_enable_rcuidle(...)
> > > > #define trace_irq_disable_rcuidle(...)
> > > > #endif
> > > >
> > > > But the DEFINE_EVENT doesn't have the "_rcuidle" part. And that's
> > > > where I got lost in the macro maze. I looked at the gcc asm output,
> > > > and there is, indeed:
> > >
> > > DEFINE_EVENT
> > > DECLARE_TRACE
> > > __DECLARE_TRACE
> > > __DECLARE_TRACE_RCU
> > > static inline void trace_##name##_rcuidle(proto)
> > > __DO_TRACE
> > > if (rcuidle)
> > > ....
> > >
> > > > But I also don't see why this is any different from any other tracepoint.
> > >
> > > Indeed. I took a wrong turn at some point in the macro jungle :)
> > >
> > > So tracing itself is fine, but then if you have probes or bpf programs
> > > attached to a tracepoint these use rcu_read_lock()/unlock() which is
> > > obviosly wrong in rcuidle context.
> >
> > Definitely, any such code needs to use tricks similar to that of the
> > tracing code. Or instead use something like SRCU, which is OK with
> > readers from idle. Or use something like Steve Rostedt's workqueue-based
> > approach, though please be very careful with this latter, lest the
> > battery-powered embedded guys come after you for waking up idle CPUs
> > too often. ;-)
>
> Are we okay if we somehow ensure that all the entry code before
> enter_from_user_mode() only does rcuidle tracing variants and has
> kprobes off? Including for BPF use cases?

That would work, though if BPF used SRCU instead of RCU, this would
be unnecessary. Sadly, SRCU has full memory barriers in each of
srcu_read_lock() and srcu_read_unlock(), but we are working on it.
(As always, no promises!)

> It would be *really* nice if we could statically verify this, as has
> been mentioned elsewhere in the thread. It would also probably be
> good enough if we could do it at runtime. Maybe with lockdep on, we
> verify rcu state in tracepoints even if the tracepoint isn't active?
> And we could plausibly have some widget that could inject something
> into *every* kprobeable function to check rcu state.

Or just have at least one testing step that activates all tracepoints,
but with lockdep enabled?

Thanx, Paul

2020-03-01 19:40:08

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [patch 4/8] x86/entry: Move irq tracing on syscall entry to C-code



> On Mar 1, 2020, at 11:30 AM, Paul E. McKenney <[email protected]> wrote:
>
> On Sun, Mar 01, 2020 at 10:54:23AM -0800, Andy Lutomirski wrote:
>>> On Sun, Mar 1, 2020 at 10:26 AM Paul E. McKenney <[email protected]> wrote:
>>>
>>> On Sun, Mar 01, 2020 at 07:12:25PM +0100, Thomas Gleixner wrote:
>>>> Andy Lutomirski <[email protected]> writes:
>>>>> On Sun, Mar 1, 2020 at 7:21 AM Thomas Gleixner <[email protected]> wrote:
>>>>>> Andy Lutomirski <[email protected]> writes:
>>>>>>>> On Mar 1, 2020, at 2:16 AM, Thomas Gleixner <[email protected]> wrote:
>>>>>>>> Ok, but for the time being anything before/after CONTEXT_KERNEL is unsafe
>>>>>>>> except trace_hardirq_off/on() as those trace functions do not allow to
>>>>>>>> attach anything AFAICT.
>>>>>>>
>>>>>>> Can you point to whatever makes those particular functions special? I
>>>>>>> failed to follow the macro maze.
>>>>>>
>>>>>> Those are not tracepoints and not going through the macro maze. See
>>>>>> kernel/trace/trace_preemptirq.c
>>>>>
>>>>> That has:
>>>>>
>>>>> void trace_hardirqs_on(void)
>>>>> {
>>>>> if (this_cpu_read(tracing_irq_cpu)) {
>>>>> if (!in_nmi())
>>>>> trace_irq_enable_rcuidle(CALLER_ADDR0, CALLER_ADDR1);
>>>>> tracer_hardirqs_on(CALLER_ADDR0, CALLER_ADDR1);
>>>>> this_cpu_write(tracing_irq_cpu, 0);
>>>>> }
>>>>>
>>>>> lockdep_hardirqs_on(CALLER_ADDR0);
>>>>> }
>>>>> EXPORT_SYMBOL(trace_hardirqs_on);
>>>>> NOKPROBE_SYMBOL(trace_hardirqs_on);
>>>>>
>>>>> But this calls trace_irq_enable_rcuidle(), and that's the part of the
>>>>> macro maze I got lost in. I found:
>>>>>
>>>>> #ifdef CONFIG_TRACE_IRQFLAGS
>>>>> DEFINE_EVENT(preemptirq_template, irq_disable,
>>>>> TP_PROTO(unsigned long ip, unsigned long parent_ip),
>>>>> TP_ARGS(ip, parent_ip));
>>>>>
>>>>> DEFINE_EVENT(preemptirq_template, irq_enable,
>>>>> TP_PROTO(unsigned long ip, unsigned long parent_ip),
>>>>> TP_ARGS(ip, parent_ip));
>>>>> #else
>>>>> #define trace_irq_enable(...)
>>>>> #define trace_irq_disable(...)
>>>>> #define trace_irq_enable_rcuidle(...)
>>>>> #define trace_irq_disable_rcuidle(...)
>>>>> #endif
>>>>>
>>>>> But the DEFINE_EVENT doesn't have the "_rcuidle" part. And that's
>>>>> where I got lost in the macro maze. I looked at the gcc asm output,
>>>>> and there is, indeed:
>>>>
>>>> DEFINE_EVENT
>>>> DECLARE_TRACE
>>>> __DECLARE_TRACE
>>>> __DECLARE_TRACE_RCU
>>>> static inline void trace_##name##_rcuidle(proto)
>>>> __DO_TRACE
>>>> if (rcuidle)
>>>> ....
>>>>
>>>>> But I also don't see why this is any different from any other tracepoint.
>>>>
>>>> Indeed. I took a wrong turn at some point in the macro jungle :)
>>>>
>>>> So tracing itself is fine, but then if you have probes or bpf programs
>>>> attached to a tracepoint these use rcu_read_lock()/unlock() which is
>>>> obviosly wrong in rcuidle context.
>>>
>>> Definitely, any such code needs to use tricks similar to that of the
>>> tracing code. Or instead use something like SRCU, which is OK with
>>> readers from idle. Or use something like Steve Rostedt's workqueue-based
>>> approach, though please be very careful with this latter, lest the
>>> battery-powered embedded guys come after you for waking up idle CPUs
>>> too often. ;-)
>>
>> Are we okay if we somehow ensure that all the entry code before
>> enter_from_user_mode() only does rcuidle tracing variants and has
>> kprobes off? Including for BPF use cases?
>
> That would work, though if BPF used SRCU instead of RCU, this would
> be unnecessary. Sadly, SRCU has full memory barriers in each of
> srcu_read_lock() and srcu_read_unlock(), but we are working on it.
> (As always, no promises!)
>
>> It would be *really* nice if we could statically verify this, as has
>> been mentioned elsewhere in the thread. It would also probably be
>> good enough if we could do it at runtime. Maybe with lockdep on, we
>> verify rcu state in tracepoints even if the tracepoint isn't active?
>> And we could plausibly have some widget that could inject something
>> into *every* kprobeable function to check rcu state.
>
> Or just have at least one testing step that activates all tracepoints,
> but with lockdep enabled?

Also kprobe.

I don’t suppose we could make notrace imply nokprobe. Then all kprobeable functions would also have entry/exit tracepoints, right?

2020-03-01 20:18:47

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [patch 4/8] x86/entry: Move irq tracing on syscall entry to C-code

On Sun, Mar 01, 2020 at 11:39:42AM -0800, Andy Lutomirski wrote:
>
>
> > On Mar 1, 2020, at 11:30 AM, Paul E. McKenney <[email protected]> wrote:
> >
> > On Sun, Mar 01, 2020 at 10:54:23AM -0800, Andy Lutomirski wrote:
> >>> On Sun, Mar 1, 2020 at 10:26 AM Paul E. McKenney <[email protected]> wrote:
> >>>
> >>> On Sun, Mar 01, 2020 at 07:12:25PM +0100, Thomas Gleixner wrote:
> >>>> Andy Lutomirski <[email protected]> writes:
> >>>>> On Sun, Mar 1, 2020 at 7:21 AM Thomas Gleixner <[email protected]> wrote:
> >>>>>> Andy Lutomirski <[email protected]> writes:
> >>>>>>>> On Mar 1, 2020, at 2:16 AM, Thomas Gleixner <[email protected]> wrote:
> >>>>>>>> Ok, but for the time being anything before/after CONTEXT_KERNEL is unsafe
> >>>>>>>> except trace_hardirq_off/on() as those trace functions do not allow to
> >>>>>>>> attach anything AFAICT.
> >>>>>>>
> >>>>>>> Can you point to whatever makes those particular functions special? I
> >>>>>>> failed to follow the macro maze.
> >>>>>>
> >>>>>> Those are not tracepoints and not going through the macro maze. See
> >>>>>> kernel/trace/trace_preemptirq.c
> >>>>>
> >>>>> That has:
> >>>>>
> >>>>> void trace_hardirqs_on(void)
> >>>>> {
> >>>>> if (this_cpu_read(tracing_irq_cpu)) {
> >>>>> if (!in_nmi())
> >>>>> trace_irq_enable_rcuidle(CALLER_ADDR0, CALLER_ADDR1);
> >>>>> tracer_hardirqs_on(CALLER_ADDR0, CALLER_ADDR1);
> >>>>> this_cpu_write(tracing_irq_cpu, 0);
> >>>>> }
> >>>>>
> >>>>> lockdep_hardirqs_on(CALLER_ADDR0);
> >>>>> }
> >>>>> EXPORT_SYMBOL(trace_hardirqs_on);
> >>>>> NOKPROBE_SYMBOL(trace_hardirqs_on);
> >>>>>
> >>>>> But this calls trace_irq_enable_rcuidle(), and that's the part of the
> >>>>> macro maze I got lost in. I found:
> >>>>>
> >>>>> #ifdef CONFIG_TRACE_IRQFLAGS
> >>>>> DEFINE_EVENT(preemptirq_template, irq_disable,
> >>>>> TP_PROTO(unsigned long ip, unsigned long parent_ip),
> >>>>> TP_ARGS(ip, parent_ip));
> >>>>>
> >>>>> DEFINE_EVENT(preemptirq_template, irq_enable,
> >>>>> TP_PROTO(unsigned long ip, unsigned long parent_ip),
> >>>>> TP_ARGS(ip, parent_ip));
> >>>>> #else
> >>>>> #define trace_irq_enable(...)
> >>>>> #define trace_irq_disable(...)
> >>>>> #define trace_irq_enable_rcuidle(...)
> >>>>> #define trace_irq_disable_rcuidle(...)
> >>>>> #endif
> >>>>>
> >>>>> But the DEFINE_EVENT doesn't have the "_rcuidle" part. And that's
> >>>>> where I got lost in the macro maze. I looked at the gcc asm output,
> >>>>> and there is, indeed:
> >>>>
> >>>> DEFINE_EVENT
> >>>> DECLARE_TRACE
> >>>> __DECLARE_TRACE
> >>>> __DECLARE_TRACE_RCU
> >>>> static inline void trace_##name##_rcuidle(proto)
> >>>> __DO_TRACE
> >>>> if (rcuidle)
> >>>> ....
> >>>>
> >>>>> But I also don't see why this is any different from any other tracepoint.
> >>>>
> >>>> Indeed. I took a wrong turn at some point in the macro jungle :)
> >>>>
> >>>> So tracing itself is fine, but then if you have probes or bpf programs
> >>>> attached to a tracepoint these use rcu_read_lock()/unlock() which is
> >>>> obviosly wrong in rcuidle context.
> >>>
> >>> Definitely, any such code needs to use tricks similar to that of the
> >>> tracing code. Or instead use something like SRCU, which is OK with
> >>> readers from idle. Or use something like Steve Rostedt's workqueue-based
> >>> approach, though please be very careful with this latter, lest the
> >>> battery-powered embedded guys come after you for waking up idle CPUs
> >>> too often. ;-)
> >>
> >> Are we okay if we somehow ensure that all the entry code before
> >> enter_from_user_mode() only does rcuidle tracing variants and has
> >> kprobes off? Including for BPF use cases?
> >
> > That would work, though if BPF used SRCU instead of RCU, this would
> > be unnecessary. Sadly, SRCU has full memory barriers in each of
> > srcu_read_lock() and srcu_read_unlock(), but we are working on it.
> > (As always, no promises!)
> >
> >> It would be *really* nice if we could statically verify this, as has
> >> been mentioned elsewhere in the thread. It would also probably be
> >> good enough if we could do it at runtime. Maybe with lockdep on, we
> >> verify rcu state in tracepoints even if the tracepoint isn't active?
> >> And we could plausibly have some widget that could inject something
> >> into *every* kprobeable function to check rcu state.
> >
> > Or just have at least one testing step that activates all tracepoints,
> > but with lockdep enabled?
>
> Also kprobe.

I didn't realize we could test all kprobe points easily, but yes,
that would also be good.

> I don’t suppose we could make notrace imply nokprobe. Then all kprobeable functions would also have entry/exit tracepoints, right?

On this, I must defer to the tracing and kprobes people.

Thanx, Paul

2020-03-02 00:35:52

by Steven Rostedt

[permalink] [raw]
Subject: Re: [patch 4/8] x86/entry: Move irq tracing on syscall entry to C-code

On Sun, 1 Mar 2020 11:39:42 -0800
Andy Lutomirski <[email protected]> wrote:

> > On Mar 1, 2020, at 11:30 AM, Paul E. McKenney <[email protected]> wrote:
> >
> > On Sun, Mar 01, 2020 at 10:54:23AM -0800, Andy Lutomirski wrote:
> >>> On Sun, Mar 1, 2020 at 10:26 AM Paul E. McKenney <[email protected]> wrote:
> >>>
> >>> On Sun, Mar 01, 2020 at 07:12:25PM +0100, Thomas Gleixner wrote:
> >>>> Andy Lutomirski <[email protected]> writes:
> >>>>> On Sun, Mar 1, 2020 at 7:21 AM Thomas Gleixner <[email protected]> wrote:
> >>>>>> Andy Lutomirski <[email protected]> writes:
> >>>>>>>> On Mar 1, 2020, at 2:16 AM, Thomas Gleixner <[email protected]> wrote:
> >>>>>>>> Ok, but for the time being anything before/after CONTEXT_KERNEL is unsafe
> >>>>>>>> except trace_hardirq_off/on() as those trace functions do not allow to
> >>>>>>>> attach anything AFAICT.
> >>>>>>>
> >>>>>>> Can you point to whatever makes those particular functions special? I
> >>>>>>> failed to follow the macro maze.
> >>>>>>
> >>>>>> Those are not tracepoints and not going through the macro maze. See
> >>>>>> kernel/trace/trace_preemptirq.c
> >>>>>
> >>>>> That has:
> >>>>>
> >>>>> void trace_hardirqs_on(void)
> >>>>> {
> >>>>> if (this_cpu_read(tracing_irq_cpu)) {
> >>>>> if (!in_nmi())
> >>>>> trace_irq_enable_rcuidle(CALLER_ADDR0, CALLER_ADDR1);
> >>>>> tracer_hardirqs_on(CALLER_ADDR0, CALLER_ADDR1);
> >>>>> this_cpu_write(tracing_irq_cpu, 0);
> >>>>> }
> >>>>>
> >>>>> lockdep_hardirqs_on(CALLER_ADDR0);
> >>>>> }
> >>>>> EXPORT_SYMBOL(trace_hardirqs_on);
> >>>>> NOKPROBE_SYMBOL(trace_hardirqs_on);
> >>>>>
> >>>>> But this calls trace_irq_enable_rcuidle(), and that's the part of the
> >>>>> macro maze I got lost in. I found:
> >>>>>
> >>>>> #ifdef CONFIG_TRACE_IRQFLAGS
> >>>>> DEFINE_EVENT(preemptirq_template, irq_disable,
> >>>>> TP_PROTO(unsigned long ip, unsigned long parent_ip),
> >>>>> TP_ARGS(ip, parent_ip));
> >>>>>
> >>>>> DEFINE_EVENT(preemptirq_template, irq_enable,
> >>>>> TP_PROTO(unsigned long ip, unsigned long parent_ip),
> >>>>> TP_ARGS(ip, parent_ip));
> >>>>> #else
> >>>>> #define trace_irq_enable(...)
> >>>>> #define trace_irq_disable(...)
> >>>>> #define trace_irq_enable_rcuidle(...)
> >>>>> #define trace_irq_disable_rcuidle(...)
> >>>>> #endif
> >>>>>
> >>>>> But the DEFINE_EVENT doesn't have the "_rcuidle" part. And that's
> >>>>> where I got lost in the macro maze. I looked at the gcc asm output,
> >>>>> and there is, indeed:
> >>>>
> >>>> DEFINE_EVENT
> >>>> DECLARE_TRACE
> >>>> __DECLARE_TRACE
> >>>> __DECLARE_TRACE_RCU
> >>>> static inline void trace_##name##_rcuidle(proto)
> >>>> __DO_TRACE
> >>>> if (rcuidle)
> >>>> ....
> >>>>
> >>>>> But I also don't see why this is any different from any other tracepoint.
> >>>>
> >>>> Indeed. I took a wrong turn at some point in the macro jungle :)
> >>>>
> >>>> So tracing itself is fine, but then if you have probes or bpf programs
> >>>> attached to a tracepoint these use rcu_read_lock()/unlock() which is
> >>>> obviosly wrong in rcuidle context.
> >>>
> >>> Definitely, any such code needs to use tricks similar to that of the
> >>> tracing code. Or instead use something like SRCU, which is OK with
> >>> readers from idle. Or use something like Steve Rostedt's workqueue-based
> >>> approach, though please be very careful with this latter, lest the
> >>> battery-powered embedded guys come after you for waking up idle CPUs
> >>> too often. ;-)
> >>
> >> Are we okay if we somehow ensure that all the entry code before
> >> enter_from_user_mode() only does rcuidle tracing variants and has
> >> kprobes off? Including for BPF use cases?
> >
> > That would work, though if BPF used SRCU instead of RCU, this would
> > be unnecessary. Sadly, SRCU has full memory barriers in each of
> > srcu_read_lock() and srcu_read_unlock(), but we are working on it.
> > (As always, no promises!)
> >
> >> It would be *really* nice if we could statically verify this, as has
> >> been mentioned elsewhere in the thread. It would also probably be
> >> good enough if we could do it at runtime. Maybe with lockdep on, we
> >> verify rcu state in tracepoints even if the tracepoint isn't active?
> >> And we could plausibly have some widget that could inject something
> >> into *every* kprobeable function to check rcu state.
> >
> > Or just have at least one testing step that activates all tracepoints,
> > but with lockdep enabled?
>
> Also kprobe.
>
> I don’t suppose we could make notrace imply nokprobe. Then all
> kprobeable functions would also have entry/exit tracepoints, right?

There was some code before that prevented a kprobe from being allowed
in something that was not in the ftrace mcount table (which would make
this happen). But I think that was changed because it was too
restrictive.

Masami?

-- Steve

2020-03-02 01:16:57

by Joel Fernandes

[permalink] [raw]
Subject: Re: [patch 4/8] x86/entry: Move irq tracing on syscall entry to C-code

On Sun, Mar 01, 2020 at 10:54:23AM -0800, Andy Lutomirski wrote:
> On Sun, Mar 1, 2020 at 10:26 AM Paul E. McKenney <[email protected]> wrote:
> >
> > On Sun, Mar 01, 2020 at 07:12:25PM +0100, Thomas Gleixner wrote:
> > > Andy Lutomirski <[email protected]> writes:
> > > > On Sun, Mar 1, 2020 at 7:21 AM Thomas Gleixner <[email protected]> wrote:
> > > >> Andy Lutomirski <[email protected]> writes:
> > > >> >> On Mar 1, 2020, at 2:16 AM, Thomas Gleixner <[email protected]> wrote:
> > > >> >> Ok, but for the time being anything before/after CONTEXT_KERNEL is unsafe
> > > >> >> except trace_hardirq_off/on() as those trace functions do not allow to
> > > >> >> attach anything AFAICT.
> > > >> >
> > > >> > Can you point to whatever makes those particular functions special? I
> > > >> > failed to follow the macro maze.
> > > >>
> > > >> Those are not tracepoints and not going through the macro maze. See
> > > >> kernel/trace/trace_preemptirq.c
> > > >
> > > > That has:
> > > >
> > > > void trace_hardirqs_on(void)
> > > > {
> > > > if (this_cpu_read(tracing_irq_cpu)) {
> > > > if (!in_nmi())
> > > > trace_irq_enable_rcuidle(CALLER_ADDR0, CALLER_ADDR1);
> > > > tracer_hardirqs_on(CALLER_ADDR0, CALLER_ADDR1);
> > > > this_cpu_write(tracing_irq_cpu, 0);
> > > > }
> > > >
> > > > lockdep_hardirqs_on(CALLER_ADDR0);
> > > > }
> > > > EXPORT_SYMBOL(trace_hardirqs_on);
> > > > NOKPROBE_SYMBOL(trace_hardirqs_on);
> > > >
> > > > But this calls trace_irq_enable_rcuidle(), and that's the part of the
> > > > macro maze I got lost in. I found:
> > > >
> > > > #ifdef CONFIG_TRACE_IRQFLAGS
> > > > DEFINE_EVENT(preemptirq_template, irq_disable,
> > > > TP_PROTO(unsigned long ip, unsigned long parent_ip),
> > > > TP_ARGS(ip, parent_ip));
> > > >
> > > > DEFINE_EVENT(preemptirq_template, irq_enable,
> > > > TP_PROTO(unsigned long ip, unsigned long parent_ip),
> > > > TP_ARGS(ip, parent_ip));
> > > > #else
> > > > #define trace_irq_enable(...)
> > > > #define trace_irq_disable(...)
> > > > #define trace_irq_enable_rcuidle(...)
> > > > #define trace_irq_disable_rcuidle(...)
> > > > #endif
> > > >
> > > > But the DEFINE_EVENT doesn't have the "_rcuidle" part. And that's
> > > > where I got lost in the macro maze. I looked at the gcc asm output,
> > > > and there is, indeed:
> > >
> > > DEFINE_EVENT
> > > DECLARE_TRACE
> > > __DECLARE_TRACE
> > > __DECLARE_TRACE_RCU
> > > static inline void trace_##name##_rcuidle(proto)
> > > __DO_TRACE
> > > if (rcuidle)
> > > ....
> > >
> > > > But I also don't see why this is any different from any other tracepoint.
> > >
> > > Indeed. I took a wrong turn at some point in the macro jungle :)
> > >
> > > So tracing itself is fine, but then if you have probes or bpf programs
> > > attached to a tracepoint these use rcu_read_lock()/unlock() which is
> > > obviosly wrong in rcuidle context.
> >
> > Definitely, any such code needs to use tricks similar to that of the
> > tracing code. Or instead use something like SRCU, which is OK with
> > readers from idle. Or use something like Steve Rostedt's workqueue-based
> > approach, though please be very careful with this latter, lest the
> > battery-powered embedded guys come after you for waking up idle CPUs
> > too often. ;-)
> >
>
> Are we okay if we somehow ensure that all the entry code before
> enter_from_user_mode() only does rcuidle tracing variants and has
> kprobes off? Including for BPF use cases?
>
> It would be *really* nice if we could statically verify this, as has
> been mentioned elsewhere in the thread. It would also probably be
> good enough if we could do it at runtime. Maybe with lockdep on, we
> verify rcu state in tracepoints even if the tracepoint isn't active?
> And we could plausibly have some widget that could inject something
> into *every* kprobeable function to check rcu state.

You are talking about verifying that a non-rcuidle tracepoint is not called
into when RCU is not watching right? I think that's fine, though I feel
lockdep kernels should not be slowed down any more than they already are. I
feel over time if we add too many checks to lockdep enabled kernels, then it
becomes too slow even for "debug" kernels. May be it is time for a
CONFIG_LOCKDEP_SLOW or some such? And then anyone who wants to go crazy on
runtime checking can do so. I myself want to add a few.

Note that the checking is being added into "non rcu-idle" tracepoints many of
which are probably always called when RCU is watching, making such checking
useless for those tracepoints (and slowing them down however less).

Also another note would be that the whole reason we are getting rid of the
"make RCU watch when rcuidle" logic in DO_TRACE is because it is slow for
tracepoints that are frequently called into. Another reason to do it is
because tracepoint callbacks are expected to know what they are doing and
turn on RCU watching as appropriate (as consensus on the matter suggests).

thanks,

- Joel

2020-03-02 02:19:49

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [patch 4/8] x86/entry: Move irq tracing on syscall entry to C-code



> On Mar 1, 2020, at 5:10 PM, Joel Fernandes <[email protected]> wrote:
>
> On Sun, Mar 01, 2020 at 10:54:23AM -0800, Andy Lutomirski wrote:
>>> On Sun, Mar 1, 2020 at 10:26 AM Paul E. McKenney <[email protected]> wrote:
>>>
>>> On Sun, Mar 01, 2020 at 07:12:25PM +0100, Thomas Gleixner wrote:
>>>> Andy Lutomirski <[email protected]> writes:
>>>>> On Sun, Mar 1, 2020 at 7:21 AM Thomas Gleixner <[email protected]> wrote:
>>>>>> Andy Lutomirski <[email protected]> writes:
>>>>>>>> On Mar 1, 2020, at 2:16 AM, Thomas Gleixner <[email protected]> wrote:
>>>>>>>> Ok, but for the time being anything before/after CONTEXT_KERNEL is unsafe
>>>>>>>> except trace_hardirq_off/on() as those trace functions do not allow to
>>>>>>>> attach anything AFAICT.
>>>>>>>
>>>>>>> Can you point to whatever makes those particular functions special? I
>>>>>>> failed to follow the macro maze.
>>>>>>
>>>>>> Those are not tracepoints and not going through the macro maze. See
>>>>>> kernel/trace/trace_preemptirq.c
>>>>>
>>>>> That has:
>>>>>
>>>>> void trace_hardirqs_on(void)
>>>>> {
>>>>> if (this_cpu_read(tracing_irq_cpu)) {
>>>>> if (!in_nmi())
>>>>> trace_irq_enable_rcuidle(CALLER_ADDR0, CALLER_ADDR1);
>>>>> tracer_hardirqs_on(CALLER_ADDR0, CALLER_ADDR1);
>>>>> this_cpu_write(tracing_irq_cpu, 0);
>>>>> }
>>>>>
>>>>> lockdep_hardirqs_on(CALLER_ADDR0);
>>>>> }
>>>>> EXPORT_SYMBOL(trace_hardirqs_on);
>>>>> NOKPROBE_SYMBOL(trace_hardirqs_on);
>>>>>
>>>>> But this calls trace_irq_enable_rcuidle(), and that's the part of the
>>>>> macro maze I got lost in. I found:
>>>>>
>>>>> #ifdef CONFIG_TRACE_IRQFLAGS
>>>>> DEFINE_EVENT(preemptirq_template, irq_disable,
>>>>> TP_PROTO(unsigned long ip, unsigned long parent_ip),
>>>>> TP_ARGS(ip, parent_ip));
>>>>>
>>>>> DEFINE_EVENT(preemptirq_template, irq_enable,
>>>>> TP_PROTO(unsigned long ip, unsigned long parent_ip),
>>>>> TP_ARGS(ip, parent_ip));
>>>>> #else
>>>>> #define trace_irq_enable(...)
>>>>> #define trace_irq_disable(...)
>>>>> #define trace_irq_enable_rcuidle(...)
>>>>> #define trace_irq_disable_rcuidle(...)
>>>>> #endif
>>>>>
>>>>> But the DEFINE_EVENT doesn't have the "_rcuidle" part. And that's
>>>>> where I got lost in the macro maze. I looked at the gcc asm output,
>>>>> and there is, indeed:
>>>>
>>>> DEFINE_EVENT
>>>> DECLARE_TRACE
>>>> __DECLARE_TRACE
>>>> __DECLARE_TRACE_RCU
>>>> static inline void trace_##name##_rcuidle(proto)
>>>> __DO_TRACE
>>>> if (rcuidle)
>>>> ....
>>>>
>>>>> But I also don't see why this is any different from any other tracepoint.
>>>>
>>>> Indeed. I took a wrong turn at some point in the macro jungle :)
>>>>
>>>> So tracing itself is fine, but then if you have probes or bpf programs
>>>> attached to a tracepoint these use rcu_read_lock()/unlock() which is
>>>> obviosly wrong in rcuidle context.
>>>
>>> Definitely, any such code needs to use tricks similar to that of the
>>> tracing code. Or instead use something like SRCU, which is OK with
>>> readers from idle. Or use something like Steve Rostedt's workqueue-based
>>> approach, though please be very careful with this latter, lest the
>>> battery-powered embedded guys come after you for waking up idle CPUs
>>> too often. ;-)
>>>
>>
>> Are we okay if we somehow ensure that all the entry code before
>> enter_from_user_mode() only does rcuidle tracing variants and has
>> kprobes off? Including for BPF use cases?
>>
>> It would be *really* nice if we could statically verify this, as has
>> been mentioned elsewhere in the thread. It would also probably be
>> good enough if we could do it at runtime. Maybe with lockdep on, we
>> verify rcu state in tracepoints even if the tracepoint isn't active?
>> And we could plausibly have some widget that could inject something
>> into *every* kprobeable function to check rcu state.
>
> You are talking about verifying that a non-rcuidle tracepoint is not called
> into when RCU is not watching right? I think that's fine, though I feel
> lockdep kernels should not be slowed down any more than they already are. I
> feel over time if we add too many checks to lockdep enabled kernels, then it
> becomes too slow even for "debug" kernels. May be it is time for a
> CONFIG_LOCKDEP_SLOW or some such? And then anyone who wants to go crazy on
> runtime checking can do so. I myself want to add a few.
>
> Note that the checking is being added into "non rcu-idle" tracepoints many of
> which are probably always called when RCU is watching, making such checking
> useless for those tracepoints (and slowing them down however less).
>

Indeed. Static analysis would help a lot here.

> Also another note would be that the whole reason we are getting rid of the
> "make RCU watch when rcuidle" logic in DO_TRACE is because it is slow for
> tracepoints that are frequently called into. Another reason to do it is
> because tracepoint callbacks are expected to know what they are doing and
> turn on RCU watching as appropriate (as consensus on the matter suggests).

Whoa there. We arch people need crystal clear rules as to what tracepoints can be called in what contexts and what is the responsibility of the callbacks.

>
> thanks,
>
> - Joel
>

2020-03-02 02:37:32

by Joel Fernandes

[permalink] [raw]
Subject: Re: [patch 4/8] x86/entry: Move irq tracing on syscall entry to C-code

On Sun, Mar 01, 2020 at 06:18:51PM -0800, Andy Lutomirski wrote:
>
>
> > On Mar 1, 2020, at 5:10 PM, Joel Fernandes <[email protected]> wrote:
> >
> > On Sun, Mar 01, 2020 at 10:54:23AM -0800, Andy Lutomirski wrote:
> >>> On Sun, Mar 1, 2020 at 10:26 AM Paul E. McKenney <[email protected]> wrote:
> >>>
> >>> On Sun, Mar 01, 2020 at 07:12:25PM +0100, Thomas Gleixner wrote:
> >>>> Andy Lutomirski <[email protected]> writes:
> >>>>> On Sun, Mar 1, 2020 at 7:21 AM Thomas Gleixner <[email protected]> wrote:
> >>>>>> Andy Lutomirski <[email protected]> writes:
> >>>>>>>> On Mar 1, 2020, at 2:16 AM, Thomas Gleixner <[email protected]> wrote:
> >>>>>>>> Ok, but for the time being anything before/after CONTEXT_KERNEL is unsafe
> >>>>>>>> except trace_hardirq_off/on() as those trace functions do not allow to
> >>>>>>>> attach anything AFAICT.
> >>>>>>>
> >>>>>>> Can you point to whatever makes those particular functions special? I
> >>>>>>> failed to follow the macro maze.
> >>>>>>
> >>>>>> Those are not tracepoints and not going through the macro maze. See
> >>>>>> kernel/trace/trace_preemptirq.c
> >>>>>
> >>>>> That has:
> >>>>>
> >>>>> void trace_hardirqs_on(void)
> >>>>> {
> >>>>> if (this_cpu_read(tracing_irq_cpu)) {
> >>>>> if (!in_nmi())
> >>>>> trace_irq_enable_rcuidle(CALLER_ADDR0, CALLER_ADDR1);
> >>>>> tracer_hardirqs_on(CALLER_ADDR0, CALLER_ADDR1);
> >>>>> this_cpu_write(tracing_irq_cpu, 0);
> >>>>> }
> >>>>>
> >>>>> lockdep_hardirqs_on(CALLER_ADDR0);
> >>>>> }
> >>>>> EXPORT_SYMBOL(trace_hardirqs_on);
> >>>>> NOKPROBE_SYMBOL(trace_hardirqs_on);
> >>>>>
> >>>>> But this calls trace_irq_enable_rcuidle(), and that's the part of the
> >>>>> macro maze I got lost in. I found:
> >>>>>
> >>>>> #ifdef CONFIG_TRACE_IRQFLAGS
> >>>>> DEFINE_EVENT(preemptirq_template, irq_disable,
> >>>>> TP_PROTO(unsigned long ip, unsigned long parent_ip),
> >>>>> TP_ARGS(ip, parent_ip));
> >>>>>
> >>>>> DEFINE_EVENT(preemptirq_template, irq_enable,
> >>>>> TP_PROTO(unsigned long ip, unsigned long parent_ip),
> >>>>> TP_ARGS(ip, parent_ip));
> >>>>> #else
> >>>>> #define trace_irq_enable(...)
> >>>>> #define trace_irq_disable(...)
> >>>>> #define trace_irq_enable_rcuidle(...)
> >>>>> #define trace_irq_disable_rcuidle(...)
> >>>>> #endif
> >>>>>
> >>>>> But the DEFINE_EVENT doesn't have the "_rcuidle" part. And that's
> >>>>> where I got lost in the macro maze. I looked at the gcc asm output,
> >>>>> and there is, indeed:
> >>>>
> >>>> DEFINE_EVENT
> >>>> DECLARE_TRACE
> >>>> __DECLARE_TRACE
> >>>> __DECLARE_TRACE_RCU
> >>>> static inline void trace_##name##_rcuidle(proto)
> >>>> __DO_TRACE
> >>>> if (rcuidle)
> >>>> ....
> >>>>
> >>>>> But I also don't see why this is any different from any other tracepoint.
> >>>>
> >>>> Indeed. I took a wrong turn at some point in the macro jungle :)
> >>>>
> >>>> So tracing itself is fine, but then if you have probes or bpf programs
> >>>> attached to a tracepoint these use rcu_read_lock()/unlock() which is
> >>>> obviosly wrong in rcuidle context.
> >>>
> >>> Definitely, any such code needs to use tricks similar to that of the
> >>> tracing code. Or instead use something like SRCU, which is OK with
> >>> readers from idle. Or use something like Steve Rostedt's workqueue-based
> >>> approach, though please be very careful with this latter, lest the
> >>> battery-powered embedded guys come after you for waking up idle CPUs
> >>> too often. ;-)
> >>>
> >>
> >> Are we okay if we somehow ensure that all the entry code before
> >> enter_from_user_mode() only does rcuidle tracing variants and has
> >> kprobes off? Including for BPF use cases?
> >>
> >> It would be *really* nice if we could statically verify this, as has
> >> been mentioned elsewhere in the thread. It would also probably be
> >> good enough if we could do it at runtime. Maybe with lockdep on, we
> >> verify rcu state in tracepoints even if the tracepoint isn't active?
> >> And we could plausibly have some widget that could inject something
> >> into *every* kprobeable function to check rcu state.
> >
> > You are talking about verifying that a non-rcuidle tracepoint is not called
> > into when RCU is not watching right? I think that's fine, though I feel
> > lockdep kernels should not be slowed down any more than they already are. I
> > feel over time if we add too many checks to lockdep enabled kernels, then it
> > becomes too slow even for "debug" kernels. May be it is time for a
> > CONFIG_LOCKDEP_SLOW or some such? And then anyone who wants to go crazy on
> > runtime checking can do so. I myself want to add a few.
> >
> > Note that the checking is being added into "non rcu-idle" tracepoints many of
> > which are probably always called when RCU is watching, making such checking
> > useless for those tracepoints (and slowing them down however less).
> >
>
> Indeed. Static analysis would help a lot here.
>
> > Also another note would be that the whole reason we are getting rid of the
> > "make RCU watch when rcuidle" logic in DO_TRACE is because it is slow for
> > tracepoints that are frequently called into. Another reason to do it is
> > because tracepoint callbacks are expected to know what they are doing and
> > turn on RCU watching as appropriate (as consensus on the matter suggests).
>
> Whoa there. We arch people need crystal clear rules as to what tracepoints
> can be called in what contexts and what is the responsibility of the
> callbacks.
>

The direction that Peter, Mathieu and Steve are going is that callbacks
registered on "rcu idle" tracepoints need to turn on "RCU watching"
themselves. Such as perf. Turning on "RCU watching" is non-free as I tested
in 2017 and we removed it back then, but it was added right back when perf
started splatting. Now it is being removed again, and the turning on of RCU's
eyes happens in the perf callback itself since perf uses RCU.

If you are calling trace_.._rcuidle(), can you not ensure that RCU is
watching by calling the appropriate RCU APIs in your callbacks? Or did I miss
the point?

thanks,

- Joel

2020-03-02 05:42:19

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [patch 4/8] x86/entry: Move irq tracing on syscall entry to C-code



> On Mar 1, 2020, at 6:36 PM, Joel Fernandes <[email protected]> wrote:
>
> On Sun, Mar 01, 2020 at 06:18:51PM -0800, Andy Lutomirski wrote:
>>
>>
>>>> On Mar 1, 2020, at 5:10 PM, Joel Fernandes <[email protected]> wrote:
>>>
>>> On Sun, Mar 01, 2020 at 10:54:23AM -0800, Andy Lutomirski wrote:
>>>>> On Sun, Mar 1, 2020 at 10:26 AM Paul E. McKenney <[email protected]> wrote:
>>>>>
>>>>> On Sun, Mar 01, 2020 at 07:12:25PM +0100, Thomas Gleixner wrote:
>>>>>> Andy Lutomirski <[email protected]> writes:
>>>>>>> On Sun, Mar 1, 2020 at 7:21 AM Thomas Gleixner <[email protected]> wrote:
>>>>>>>> Andy Lutomirski <[email protected]> writes:
>>>>>>>>>> On Mar 1, 2020, at 2:16 AM, Thomas Gleixner <[email protected]> wrote:
>>>>>>>>>> Ok, but for the time being anything before/after CONTEXT_KERNEL is unsafe
>>>>>>>>>> except trace_hardirq_off/on() as those trace functions do not allow to
>>>>>>>>>> attach anything AFAICT.
>>>>>>>>>
>>>>>>>>> Can you point to whatever makes those particular functions special? I
>>>>>>>>> failed to follow the macro maze.
>>>>>>>>
>>>>>>>> Those are not tracepoints and not going through the macro maze. See
>>>>>>>> kernel/trace/trace_preemptirq.c
>>>>>>>
>>>>>>> That has:
>>>>>>>
>>>>>>> void trace_hardirqs_on(void)
>>>>>>> {
>>>>>>> if (this_cpu_read(tracing_irq_cpu)) {
>>>>>>> if (!in_nmi())
>>>>>>> trace_irq_enable_rcuidle(CALLER_ADDR0, CALLER_ADDR1);
>>>>>>> tracer_hardirqs_on(CALLER_ADDR0, CALLER_ADDR1);
>>>>>>> this_cpu_write(tracing_irq_cpu, 0);
>>>>>>> }
>>>>>>>
>>>>>>> lockdep_hardirqs_on(CALLER_ADDR0);
>>>>>>> }
>>>>>>> EXPORT_SYMBOL(trace_hardirqs_on);
>>>>>>> NOKPROBE_SYMBOL(trace_hardirqs_on);
>>>>>>>
>>>>>>> But this calls trace_irq_enable_rcuidle(), and that's the part of the
>>>>>>> macro maze I got lost in. I found:
>>>>>>>
>>>>>>> #ifdef CONFIG_TRACE_IRQFLAGS
>>>>>>> DEFINE_EVENT(preemptirq_template, irq_disable,
>>>>>>> TP_PROTO(unsigned long ip, unsigned long parent_ip),
>>>>>>> TP_ARGS(ip, parent_ip));
>>>>>>>
>>>>>>> DEFINE_EVENT(preemptirq_template, irq_enable,
>>>>>>> TP_PROTO(unsigned long ip, unsigned long parent_ip),
>>>>>>> TP_ARGS(ip, parent_ip));
>>>>>>> #else
>>>>>>> #define trace_irq_enable(...)
>>>>>>> #define trace_irq_disable(...)
>>>>>>> #define trace_irq_enable_rcuidle(...)
>>>>>>> #define trace_irq_disable_rcuidle(...)
>>>>>>> #endif
>>>>>>>
>>>>>>> But the DEFINE_EVENT doesn't have the "_rcuidle" part. And that's
>>>>>>> where I got lost in the macro maze. I looked at the gcc asm output,
>>>>>>> and there is, indeed:
>>>>>>
>>>>>> DEFINE_EVENT
>>>>>> DECLARE_TRACE
>>>>>> __DECLARE_TRACE
>>>>>> __DECLARE_TRACE_RCU
>>>>>> static inline void trace_##name##_rcuidle(proto)
>>>>>> __DO_TRACE
>>>>>> if (rcuidle)
>>>>>> ....
>>>>>>
>>>>>>> But I also don't see why this is any different from any other tracepoint.
>>>>>>
>>>>>> Indeed. I took a wrong turn at some point in the macro jungle :)
>>>>>>
>>>>>> So tracing itself is fine, but then if you have probes or bpf programs
>>>>>> attached to a tracepoint these use rcu_read_lock()/unlock() which is
>>>>>> obviosly wrong in rcuidle context.
>>>>>
>>>>> Definitely, any such code needs to use tricks similar to that of the
>>>>> tracing code. Or instead use something like SRCU, which is OK with
>>>>> readers from idle. Or use something like Steve Rostedt's workqueue-based
>>>>> approach, though please be very careful with this latter, lest the
>>>>> battery-powered embedded guys come after you for waking up idle CPUs
>>>>> too often. ;-)
>>>>>
>>>>
>>>> Are we okay if we somehow ensure that all the entry code before
>>>> enter_from_user_mode() only does rcuidle tracing variants and has
>>>> kprobes off? Including for BPF use cases?
>>>>
>>>> It would be *really* nice if we could statically verify this, as has
>>>> been mentioned elsewhere in the thread. It would also probably be
>>>> good enough if we could do it at runtime. Maybe with lockdep on, we
>>>> verify rcu state in tracepoints even if the tracepoint isn't active?
>>>> And we could plausibly have some widget that could inject something
>>>> into *every* kprobeable function to check rcu state.
>>>
>>> You are talking about verifying that a non-rcuidle tracepoint is not called
>>> into when RCU is not watching right? I think that's fine, though I feel
>>> lockdep kernels should not be slowed down any more than they already are. I
>>> feel over time if we add too many checks to lockdep enabled kernels, then it
>>> becomes too slow even for "debug" kernels. May be it is time for a
>>> CONFIG_LOCKDEP_SLOW or some such? And then anyone who wants to go crazy on
>>> runtime checking can do so. I myself want to add a few.
>>>
>>> Note that the checking is being added into "non rcu-idle" tracepoints many of
>>> which are probably always called when RCU is watching, making such checking
>>> useless for those tracepoints (and slowing them down however less).
>>>
>>
>> Indeed. Static analysis would help a lot here.
>>
>>> Also another note would be that the whole reason we are getting rid of the
>>> "make RCU watch when rcuidle" logic in DO_TRACE is because it is slow for
>>> tracepoints that are frequently called into. Another reason to do it is
>>> because tracepoint callbacks are expected to know what they are doing and
>>> turn on RCU watching as appropriate (as consensus on the matter suggests).
>>
>> Whoa there. We arch people need crystal clear rules as to what tracepoints
>> can be called in what contexts and what is the responsibility of the
>> callbacks.
>>
>
> The direction that Peter, Mathieu and Steve are going is that callbacks
> registered on "rcu idle" tracepoints need to turn on "RCU watching"
> themselves. Such as perf. Turning on "RCU watching" is non-free as I tested
> in 2017 and we removed it back then, but it was added right back when perf
> started splatting. Now it is being removed again, and the turning on of RCU's
> eyes happens in the perf callback itself since perf uses RCU.
>
> If you are calling trace_.._rcuidle(), can you not ensure that RCU is
> watching by calling the appropriate RCU APIs in your callbacks? Or did I miss
> the point?

Unless I’ve misunderstood you, the problem is that the callback and the caller are totally different subsystems. I can count the number of people who know which tracepoints need RCU turned on in their callbacks without using any fingers (tglx and I are still at least a bit confused— I’m not convinced *anyone* knows for sure).

I think we need some credible rule for which events need special handling in their callbacks, and then we need to ensure that all subsystems that install callbacks, including BPF, follow the rule. It should not be up to the author of a BPF program to get this right.

>
> thanks,
>
> - Joel
>

2020-03-02 06:48:03

by Masami Hiramatsu

[permalink] [raw]
Subject: Re: [patch 4/8] x86/entry: Move irq tracing on syscall entry to C-code

On Sun, 1 Mar 2020 19:35:01 -0500
Steven Rostedt <[email protected]> wrote:

> On Sun, 1 Mar 2020 11:39:42 -0800
> Andy Lutomirski <[email protected]> wrote:
>
> > > On Mar 1, 2020, at 11:30 AM, Paul E. McKenney <[email protected]> wrote:
> > >
> > > On Sun, Mar 01, 2020 at 10:54:23AM -0800, Andy Lutomirski wrote:
> > >>> On Sun, Mar 1, 2020 at 10:26 AM Paul E. McKenney <[email protected]> wrote:
> > >>>
> > >>> On Sun, Mar 01, 2020 at 07:12:25PM +0100, Thomas Gleixner wrote:
> > >>>> Andy Lutomirski <[email protected]> writes:
> > >>>>> On Sun, Mar 1, 2020 at 7:21 AM Thomas Gleixner <[email protected]> wrote:
> > >>>>>> Andy Lutomirski <[email protected]> writes:
> > >>>>>>>> On Mar 1, 2020, at 2:16 AM, Thomas Gleixner <[email protected]> wrote:
> > >>>>>>>> Ok, but for the time being anything before/after CONTEXT_KERNEL is unsafe
> > >>>>>>>> except trace_hardirq_off/on() as those trace functions do not allow to
> > >>>>>>>> attach anything AFAICT.
> > >>>>>>>
> > >>>>>>> Can you point to whatever makes those particular functions special? I
> > >>>>>>> failed to follow the macro maze.
> > >>>>>>
> > >>>>>> Those are not tracepoints and not going through the macro maze. See
> > >>>>>> kernel/trace/trace_preemptirq.c
> > >>>>>
> > >>>>> That has:
> > >>>>>
> > >>>>> void trace_hardirqs_on(void)
> > >>>>> {
> > >>>>> if (this_cpu_read(tracing_irq_cpu)) {
> > >>>>> if (!in_nmi())
> > >>>>> trace_irq_enable_rcuidle(CALLER_ADDR0, CALLER_ADDR1);
> > >>>>> tracer_hardirqs_on(CALLER_ADDR0, CALLER_ADDR1);
> > >>>>> this_cpu_write(tracing_irq_cpu, 0);
> > >>>>> }
> > >>>>>
> > >>>>> lockdep_hardirqs_on(CALLER_ADDR0);
> > >>>>> }
> > >>>>> EXPORT_SYMBOL(trace_hardirqs_on);
> > >>>>> NOKPROBE_SYMBOL(trace_hardirqs_on);
> > >>>>>
> > >>>>> But this calls trace_irq_enable_rcuidle(), and that's the part of the
> > >>>>> macro maze I got lost in. I found:
> > >>>>>
> > >>>>> #ifdef CONFIG_TRACE_IRQFLAGS
> > >>>>> DEFINE_EVENT(preemptirq_template, irq_disable,
> > >>>>> TP_PROTO(unsigned long ip, unsigned long parent_ip),
> > >>>>> TP_ARGS(ip, parent_ip));
> > >>>>>
> > >>>>> DEFINE_EVENT(preemptirq_template, irq_enable,
> > >>>>> TP_PROTO(unsigned long ip, unsigned long parent_ip),
> > >>>>> TP_ARGS(ip, parent_ip));
> > >>>>> #else
> > >>>>> #define trace_irq_enable(...)
> > >>>>> #define trace_irq_disable(...)
> > >>>>> #define trace_irq_enable_rcuidle(...)
> > >>>>> #define trace_irq_disable_rcuidle(...)
> > >>>>> #endif
> > >>>>>
> > >>>>> But the DEFINE_EVENT doesn't have the "_rcuidle" part. And that's
> > >>>>> where I got lost in the macro maze. I looked at the gcc asm output,
> > >>>>> and there is, indeed:
> > >>>>
> > >>>> DEFINE_EVENT
> > >>>> DECLARE_TRACE
> > >>>> __DECLARE_TRACE
> > >>>> __DECLARE_TRACE_RCU
> > >>>> static inline void trace_##name##_rcuidle(proto)
> > >>>> __DO_TRACE
> > >>>> if (rcuidle)
> > >>>> ....
> > >>>>
> > >>>>> But I also don't see why this is any different from any other tracepoint.
> > >>>>
> > >>>> Indeed. I took a wrong turn at some point in the macro jungle :)
> > >>>>
> > >>>> So tracing itself is fine, but then if you have probes or bpf programs
> > >>>> attached to a tracepoint these use rcu_read_lock()/unlock() which is
> > >>>> obviosly wrong in rcuidle context.
> > >>>
> > >>> Definitely, any such code needs to use tricks similar to that of the
> > >>> tracing code. Or instead use something like SRCU, which is OK with
> > >>> readers from idle. Or use something like Steve Rostedt's workqueue-based
> > >>> approach, though please be very careful with this latter, lest the
> > >>> battery-powered embedded guys come after you for waking up idle CPUs
> > >>> too often. ;-)
> > >>
> > >> Are we okay if we somehow ensure that all the entry code before
> > >> enter_from_user_mode() only does rcuidle tracing variants and has
> > >> kprobes off? Including for BPF use cases?
> > >
> > > That would work, though if BPF used SRCU instead of RCU, this would
> > > be unnecessary. Sadly, SRCU has full memory barriers in each of
> > > srcu_read_lock() and srcu_read_unlock(), but we are working on it.
> > > (As always, no promises!)
> > >
> > >> It would be *really* nice if we could statically verify this, as has
> > >> been mentioned elsewhere in the thread. It would also probably be
> > >> good enough if we could do it at runtime. Maybe with lockdep on, we
> > >> verify rcu state in tracepoints even if the tracepoint isn't active?
> > >> And we could plausibly have some widget that could inject something
> > >> into *every* kprobeable function to check rcu state.

I'm still not clear about this point, should I check rcuidle in kprobes
int3 handler or jump optimized handler? (int3 handler will run in
irq context so is not able to use srcu anyway...) Maybe I missed the point.

> > >
> > > Or just have at least one testing step that activates all tracepoints,
> > > but with lockdep enabled?
> >
> > Also kprobe.
> >
> > I don’t suppose we could make notrace imply nokprobe. Then all
> > kprobeable functions would also have entry/exit tracepoints, right?
>
> There was some code before that prevented a kprobe from being allowed
> in something that was not in the ftrace mcount table (which would make
> this happen). But I think that was changed because it was too
> restrictive.

Would you mean CONFIG_KPROBE_EVENTS_ON_NOTRACE? By default notrace
means noprobe too now. With CONFIG_KPROBE_EVENTS_ON_NOTRACE=y, we can
put kprobe events on notrace functions. So if you unsure, we can put
a kprobe on those functions and see what happens.
(Note that this is only for kprobe event, not kprobes itself)

It is actually very restrictive, but it is hard to make a whitelist
maually, especially if the CC_FLAGS_FTRACE is removed while building.

Thank you,

--
Masami Hiramatsu <[email protected]>

2020-03-02 08:11:31

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 4/8] x86/entry: Move irq tracing on syscall entry to C-code

Andy Lutomirski <[email protected]> writes:
> On Sun, Mar 1, 2020 at 10:26 AM Paul E. McKenney <[email protected]> wrote:
>> > So tracing itself is fine, but then if you have probes or bpf programs
>> > attached to a tracepoint these use rcu_read_lock()/unlock() which is
>> > obviosly wrong in rcuidle context.
>>
>> Definitely, any such code needs to use tricks similar to that of the
>> tracing code. Or instead use something like SRCU, which is OK with
>> readers from idle. Or use something like Steve Rostedt's workqueue-based
>> approach, though please be very careful with this latter, lest the
>> battery-powered embedded guys come after you for waking up idle CPUs
>> too often. ;-)
>>
>
> Are we okay if we somehow ensure that all the entry code before
> enter_from_user_mode() only does rcuidle tracing variants and has
> kprobes off? Including for BPF use cases?

I think this is the right thing to do. The only requirement we have
_before_ enter_from_user_mode() is to tell lockdep that interrupts are
off. There is not even the need for a real tracepoint IMO. The fact that
the lockdep call is hidden in that tracepoint is just an implementation
detail.

That would clearly set the rules straight: Anything low level entry code
before enter_from_user_mode() returns is neither probable nor
traceable.

I know that some people will argue that this is too restrictive in terms
of instrumentation, but OTOH the whole low level entry code has to be
excluded from instrumentation anyway, so having a dozen instructions
more excluded does not matter at all. Keep it simple!

> It would be *really* nice if we could statically verify this, as has
> been mentioned elsewhere in the thread. It would also probably be
> good enough if we could do it at runtime. Maybe with lockdep on, we
> verify rcu state in tracepoints even if the tracepoint isn't active?
> And we could plausibly have some widget that could inject something
> into *every* kprobeable function to check rcu state.

That surely would be useful.

Thanks,

tglx