2020-03-08 23:26:56

by Thomas Gleixner

[permalink] [raw]
Subject: [patch part-II V2 09/13] x86/entry/common: Split hardirq tracing into lockdep and ftrace parts

trace_hardirqs_off() is in fact a tracepoint which can be utilized by BPF,
which is unsafe before calling enter_from_user_mode(), which in turn
invokes context tracking. trace_hardirqs_off() also invokes
lockdep_hardirqs_off() under the hood.

OTOH lockdep needs to know about the interrupts disabled state before
enter_from_user_mode(). lockdep_hardirqs_off() is safe to call at this
point.

Split it so lockdep knows about the state and invoke the tracepoint after
the context is set straight.

Even if the functions attached to a tracepoint would all be safe to be
called in rcuidle having it split up is still giving a performance
advantage because rcu_read_lock_sched() is avoiding the whole dance of:

scru_read_lock();
rcu_irq_enter_irqson();
...
rcu_irq_exit_irqson();
scru_read_unlock();

around the tracepoint function invocation just to have the C entry points
of syscalls call enter_from_user_mode() right after the above dance.

Signed-off-by: Thomas Gleixner <[email protected]>
---
V2: New patch
---
arch/x86/entry/common.c | 13 +++++++++++--
1 file changed, 11 insertions(+), 2 deletions(-)

--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -60,10 +60,19 @@ static __always_inline void syscall_entr
{
/*
* Usermode is traced as interrupts enabled, but the syscall entry
- * mechanisms disable interrupts. Tell the tracer.
+ * mechanisms disable interrupts. Tell lockdep before calling
+ * enter_from_user_mode(). This is safe vs. RCU while the
+ * tracepoint is not.
*/
- trace_hardirqs_off();
+ lockdep_hardirqs_on(CALLER_ADDR0);
+
enter_from_user_mode();
+
+ /*
+ * Tell the tracer about the irq state as well before enabling
+ * interrupts.
+ */
+ __trace_hardirqs_off();
local_irq_enable();
}



2020-03-10 11:21:12

by Borislav Petkov

[permalink] [raw]
Subject: Re: [patch part-II V2 09/13] x86/entry/common: Split hardirq tracing into lockdep and ftrace parts

On Sun, Mar 08, 2020 at 11:24:08PM +0100, Thomas Gleixner wrote:
> trace_hardirqs_off() is in fact a tracepoint which can be utilized by BPF,
> which is unsafe before calling enter_from_user_mode(), which in turn
> invokes context tracking. trace_hardirqs_off() also invokes
> lockdep_hardirqs_off() under the hood.
>
> OTOH lockdep needs to know about the interrupts disabled state before
> enter_from_user_mode(). lockdep_hardirqs_off() is safe to call at this
> point.
>
> Split it so lockdep knows about the state and invoke the tracepoint after
> the context is set straight.
>
> Even if the functions attached to a tracepoint would all be safe to be
> called in rcuidle having it split up is still giving a performance
> advantage because rcu_read_lock_sched() is avoiding the whole dance of:
>
> scru_read_lock();
> rcu_irq_enter_irqson();
> ...
> rcu_irq_exit_irqson();
> scru_read_unlock();
>
> around the tracepoint function invocation just to have the C entry points
> of syscalls call enter_from_user_mode() right after the above dance.
>
> Signed-off-by: Thomas Gleixner <[email protected]>
> ---
> V2: New patch
> ---
> arch/x86/entry/common.c | 13 +++++++++++--
> 1 file changed, 11 insertions(+), 2 deletions(-)
>
> --- a/arch/x86/entry/common.c
> +++ b/arch/x86/entry/common.c
> @@ -60,10 +60,19 @@ static __always_inline void syscall_entr
> {
> /*
> * Usermode is traced as interrupts enabled, but the syscall entry
> - * mechanisms disable interrupts. Tell the tracer.
> + * mechanisms disable interrupts. Tell lockdep before calling
> + * enter_from_user_mode(). This is safe vs. RCU while the
> + * tracepoint is not.
> */
> - trace_hardirqs_off();
> + lockdep_hardirqs_on(CALLER_ADDR0);
> +
> enter_from_user_mode();
> +
> + /*
> + * Tell the tracer about the irq state as well before enabling
> + * interrupts.
> + */
> + __trace_hardirqs_off();

I wonder if those "__" variants should be named something else to
denote better the difference between __trace_hardirqs_{on,off} and
trace_hardirqs_{on,off}. Latter does the _rcuidle variant and lockdep
annotation but

trace_hardirqs_{on,off}_rcuidle_lockdep()

sounds yuck.

Maybe lockdep_trace_hardirqs_{on,off}()...

Blergh, I can't think of a good name ATM.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2020-03-10 13:31:59

by Alexandre Chartre

[permalink] [raw]
Subject: Re: [patch part-II V2 09/13] x86/entry/common: Split hardirq tracing into lockdep and ftrace parts


On 3/8/20 11:24 PM, Thomas Gleixner wrote:
> trace_hardirqs_off() is in fact a tracepoint which can be utilized by BPF,
> which is unsafe before calling enter_from_user_mode(), which in turn
> invokes context tracking. trace_hardirqs_off() also invokes
> lockdep_hardirqs_off() under the hood.
>
> OTOH lockdep needs to know about the interrupts disabled state before
> enter_from_user_mode(). lockdep_hardirqs_off() is safe to call at this
> point.
>
> Split it so lockdep knows about the state and invoke the tracepoint after
> the context is set straight.
>
> Even if the functions attached to a tracepoint would all be safe to be
> called in rcuidle having it split up is still giving a performance
> advantage because rcu_read_lock_sched() is avoiding the whole dance of:
>
> scru_read_lock();
> rcu_irq_enter_irqson();
> ...
> rcu_irq_exit_irqson();
> scru_read_unlock();
>
> around the tracepoint function invocation just to have the C entry points
> of syscalls call enter_from_user_mode() right after the above dance.
>
> Signed-off-by: Thomas Gleixner <[email protected]>
> ---
> V2: New patch
> ---
> arch/x86/entry/common.c | 13 +++++++++++--
> 1 file changed, 11 insertions(+), 2 deletions(-)


Reviewed-by: Alexandre Chartre <[email protected]>

alex.


> --- a/arch/x86/entry/common.c
> +++ b/arch/x86/entry/common.c
> @@ -60,10 +60,19 @@ static __always_inline void syscall_entr
> {
> /*
> * Usermode is traced as interrupts enabled, but the syscall entry
> - * mechanisms disable interrupts. Tell the tracer.
> + * mechanisms disable interrupts. Tell lockdep before calling
> + * enter_from_user_mode(). This is safe vs. RCU while the
> + * tracepoint is not.
> */
> - trace_hardirqs_off();
> + lockdep_hardirqs_on(CALLER_ADDR0);
> +
> enter_from_user_mode();
> +
> + /*
> + * Tell the tracer about the irq state as well before enabling
> + * interrupts.
> + */
> + __trace_hardirqs_off();
> local_irq_enable();
> }
>
>

2020-03-10 13:43:54

by Steven Rostedt

[permalink] [raw]
Subject: Re: [patch part-II V2 09/13] x86/entry/common: Split hardirq tracing into lockdep and ftrace parts

On Tue, 10 Mar 2020 12:20:45 +0100
Borislav Petkov <[email protected]> wrote:

> > +
> > + /*
> > + * Tell the tracer about the irq state as well before enabling
> > + * interrupts.
> > + */
> > + __trace_hardirqs_off();
>
> I wonder if those "__" variants should be named something else to
> denote better the difference between __trace_hardirqs_{on,off} and
> trace_hardirqs_{on,off}. Latter does the _rcuidle variant and lockdep
> annotation but
>
> trace_hardirqs_{on,off}_rcuidle_lockdep()
>
> sounds yuck.
>
> Maybe lockdep_trace_hardirqs_{on,off}()...
>
> Blergh, I can't think of a good name ATM.

Kernel developers are not good at naming ;-) This is one of the original
pieces of code that came in with the original addition of tracing, where we
had the "Ingo notation" of something like:

trace() {
[..]
_trace();
[..]
}

_trace() {
[..]
__trace();
[..]
}

__trace() {
[..]
___trace();
[..]
}

___trace() {
[..]
____trace();
[..]
}

____trace() {
[..]
_____trace();
[..]
}

_____trace() {
[..]
}

-- Steve

2020-03-23 09:09:34

by Chen, Rong A

[permalink] [raw]
Subject: [x86/entry/common] bae397f6e7: WARNING:at_kernel/sched/cputime.c:#get_vtime_delta

FYI, we noticed the following commit (built with gcc-7):

commit: bae397f6e7d17448fa1a8b87c1d58e6dbb77f42a ("[patch part-II V2 09/13] x86/entry/common: Split hardirq tracing into lockdep and ftrace parts")
url: https://github.com/0day-ci/linux/commits/Thomas-Gleixner/x86-entry-Consolidation-Part-II-syscalls/20200309-072900


in testcase: trinity
with following parameters:

runtime: 300s

test-description: Trinity is a linux system call fuzz tester.
test-url: http://codemonkey.org.uk/projects/trinity/


on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 8G

caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):


+----------------------------------------------------+------------+------------+
| | e03ca08c94 | bae397f6e7 |
+----------------------------------------------------+------------+------------+
| boot_successes | 0 | 0 |
| boot_failures | 12 | 12 |
| BUG:workqueue_lockup-pool | 7 | 6 |
| BUG:kernel_in_stage | 12 | 12 |
| BUG:soft_lockup-CPU##stuck_for#s![dma-fence:#:#] | 3 | 1 |
| RIP:lock_is_held_type | 3 | |
| Kernel_panic-not_syncing:softlockup:hung_tasks | 3 | 1 |
| WARNING:at_kernel/sched/cputime.c:#get_vtime_delta | 0 | 11 |
| RIP:get_vtime_delta | 0 | 11 |
| RIP:kthread_should_stop | 0 | 1 |
+----------------------------------------------------+------------+------------+


If you fix the issue, kindly add following tag
Reported-by: kernel test robot <[email protected]>


[ 15.999846] WARNING: CPU: 0 PID: 199 at kernel/sched/cputime.c:259 get_vtime_delta+0x62/0xe0
[ 15.999846] Modules linked in:
[ 15.999846] CPU: 0 PID: 199 Comm: modprobe Not tainted 5.5.0-rc5-00015-gbae397f6e7d17 #1
[ 15.999846] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
[ 15.999846] RIP: 0010:get_vtime_delta+0x62/0xe0
[ 15.999846] Code: 85 f6 75 28 8b 88 20 07 00 00 85 c9 74 1e 80 3d 87 bd b0 01 00 75 15 48 c7 c7 60 79 53 b5 c6 05 77 bd b0 01 01 e8 be 41 fd ff <0f> 0b e9 02 00 00 00 eb 31 31 ff e8 ee b8 f1 ff 66 90 48 2b 05 5d
[ 15.999846] RSP: 0018:ffff8dcad803fea0 EFLAGS: 00010086
[ 15.999846] RAX: 000000000000001d RBX: 000000000000080c RCX: 0000000000000000
[ 15.999846] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[ 15.999846] RBP: ffff8dcb36fea938 R08: 0000000000000001 R09: 0000000000000001
[ 15.999846] R10: 0000000000000001 R11: ffff8dcad803fd50 R12: ffff8dcb36fea938
[ 15.999846] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 15.999846] FS: 0000000000000000(0000) GS:ffffffffb5851000(0000) knlGS:0000000000000000
[ 15.999846] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 15.999846] CR2: 00007fdf8f4e10e8 CR3: 0000000236ff4000 CR4: 00000000000406f0
[ 15.999846] Call Trace:
[ 15.999846] vtime_user_exit+0x3f/0x90
[ 15.999846] ? __context_tracking_exit+0x42/0xe0
[ 15.999846] __context_tracking_exit+0x42/0xe0
[ 15.999846] do_syscall_64+0x4a/0x5c0
[ 15.999846] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 15.999846] RIP: 0033:0x7fdf8f2d69cc
[ 15.999846] Code: f3 c3 49 8d 50 01 41 c6 00 00 eb f1 66 90 50 58 bf 7f 00 00 00 48 83 ec 08 e8 30 0e 00 00 b9 0c 00 00 00 48 89 fe 89 c8 0f 05 <48> 3d 00 f0 ff ff 48 89 c2 77 29 48 89 05 ca a6 20 00 31 c0 48 39
[ 15.999846] RSP: 002b:00007ffe3de63fd8 EFLAGS: 00000246 ORIG_RAX: 000000000000000c
[ 15.999846] RAX: ffffffffffffffda RBX: 0000000000400040 RCX: 00007fdf8f2d69cc
[ 15.999846] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[ 15.999846] RBP: 0000000000000009 R08: 000000000000037f R09: 0000000000000064
[ 15.999846] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffe3de64259
[ 15.999846] R13: 0000000000000000 R14: 0000000000402625 R15: 0000000000001000
[ 15.999846] irq event stamp: 272
[ 15.999846] hardirqs last enabled at (271): [<ffffffffb3e01928>] trace_hardirqs_on_thunk+0x1a/0x1c
[ 15.999846] hardirqs last disabled at (272): [<ffffffffb3f49b77>] vprintk_emit+0xf7/0x320
[ 15.999846] softirqs last enabled at (114): [<ffffffffb4e00343>] __do_softirq+0x343/0x37d
[ 15.999846] softirqs last disabled at (109): [<ffffffffb3f03c20>] irq_exit+0x50/0x80
[ 15.999846] ---[ end trace 7b8a309e4ee96050 ]---


To reproduce:

# build kernel
cd linux
cp config-5.5.0-rc5-00015-gbae397f6e7d17 .config
make HOSTCC=gcc-7 CC=gcc-7 ARCH=x86_64 olddefconfig prepare modules_prepare bzImage

git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp qemu -k <bzImage> job-script # job-script is attached in this email



Thanks,
Rong Chen


Attachments:
(No filename) (5.02 kB)
config-5.5.0-rc5-00015-gbae397f6e7d17 (144.17 kB)
job-script (4.41 kB)
dmesg.xz (11.48 kB)
Download all attachments