2014-12-03 18:19:36

by Dave Jones

[permalink] [raw]
Subject: audit: rcu_read_lock() used illegally while idle

I'm not sure why this only just started complaining, because this code
hasn't changed in years, but I don't recall seeing this before.
This gets spewed during bootup since I put 3.18-rc7 on my firewall.
Previously it was running rc4 where I didn't see this.

Did something in RCU change recently ?

===============================
[ INFO: suspicious RCU usage. ]
3.18.0-rc7+ #93 Not tainted
-------------------------------
include/linux/rcupdate.h:883 rcu_read_lock() used illegally while idle!

other info that might help us debug this:


RCU used illegally from idle CPU!
rcu_scheduler_active = 1, debug_locks = 0
RCU used illegally from extended quiescent state!
1 lock held by systemd-sysctl/557:
#0: (rcu_read_lock){......}, at: [<ffffffff890f1320>] audit_filter_type+0x0/0x240

stack backtrace:
CPU: 0 PID: 557 Comm: systemd-sysctl Not tainted 3.18.0-rc7+ #93
0000000000000000 0000000063c50efc ffff88021f51fd28 ffffffff895abec3
0000000000000000 ffff880234758000 ffff88021f51fd58 ffffffff890a5280
ffff880221d9c548 00007fd651f320d0 0000000000000514 00000000ffff9012
Call Trace:
[<ffffffff895abec3>] dump_stack+0x4e/0x68
[<ffffffff890a5280>] lockdep_rcu_suspicious+0xf0/0x110
[<ffffffff890f150e>] audit_filter_type+0x1ee/0x240
[<ffffffff890f1320>] ? audit_filter_user+0x340/0x340
[<ffffffff890ecd39>] audit_log_start+0x49/0x4a0
[<ffffffff8908f4f5>] ? local_clock+0x25/0x30
[<ffffffff890a397f>] ? lock_release_holdtime.part.30+0xf/0x190
[<ffffffff890f1973>] audit_log_exit+0x53/0xcf0
[<ffffffff8908f4f5>] ? local_clock+0x25/0x30
[<ffffffff890a397f>] ? lock_release_holdtime.part.30+0xf/0x190
[<ffffffff895b4f0c>] ? sysret_signal+0x5/0x43
[<ffffffff890f4a65>] __audit_syscall_exit+0x245/0x2a0
[<ffffffff895b4f61>] sysret_audit+0x17/0x21


2014-12-03 19:29:25

by Paul E. McKenney

[permalink] [raw]
Subject: Re: audit: rcu_read_lock() used illegally while idle

On Wed, Dec 03, 2014 at 01:19:22PM -0500, Dave Jones wrote:
> I'm not sure why this only just started complaining, because this code
> hasn't changed in years, but I don't recall seeing this before.
> This gets spewed during bootup since I put 3.18-rc7 on my firewall.
> Previously it was running rc4 where I didn't see this.

That is quite strange. I wonder if NO_HZ_FULL has told RCU that the
CPU is idle before the sysret_audit hook is called.

Adding Frederic for his thoughts on this.

> Did something in RCU change recently ?

Not since -rc1, as far as I know, anyway.

Thanx, Paul

> ===============================
> [ INFO: suspicious RCU usage. ]
> 3.18.0-rc7+ #93 Not tainted
> -------------------------------
> include/linux/rcupdate.h:883 rcu_read_lock() used illegally while idle!
>
> other info that might help us debug this:
>
>
> RCU used illegally from idle CPU!
> rcu_scheduler_active = 1, debug_locks = 0
> RCU used illegally from extended quiescent state!
> 1 lock held by systemd-sysctl/557:
> #0: (rcu_read_lock){......}, at: [<ffffffff890f1320>] audit_filter_type+0x0/0x240
>
> stack backtrace:
> CPU: 0 PID: 557 Comm: systemd-sysctl Not tainted 3.18.0-rc7+ #93
> 0000000000000000 0000000063c50efc ffff88021f51fd28 ffffffff895abec3
> 0000000000000000 ffff880234758000 ffff88021f51fd58 ffffffff890a5280
> ffff880221d9c548 00007fd651f320d0 0000000000000514 00000000ffff9012
> Call Trace:
> [<ffffffff895abec3>] dump_stack+0x4e/0x68
> [<ffffffff890a5280>] lockdep_rcu_suspicious+0xf0/0x110
> [<ffffffff890f150e>] audit_filter_type+0x1ee/0x240
> [<ffffffff890f1320>] ? audit_filter_user+0x340/0x340
> [<ffffffff890ecd39>] audit_log_start+0x49/0x4a0
> [<ffffffff8908f4f5>] ? local_clock+0x25/0x30
> [<ffffffff890a397f>] ? lock_release_holdtime.part.30+0xf/0x190
> [<ffffffff890f1973>] audit_log_exit+0x53/0xcf0
> [<ffffffff8908f4f5>] ? local_clock+0x25/0x30
> [<ffffffff890a397f>] ? lock_release_holdtime.part.30+0xf/0x190
> [<ffffffff895b4f0c>] ? sysret_signal+0x5/0x43
> [<ffffffff890f4a65>] __audit_syscall_exit+0x245/0x2a0
> [<ffffffff895b4f61>] sysret_audit+0x17/0x21
>

2014-12-03 20:07:01

by Andy Lutomirski

[permalink] [raw]
Subject: Re: audit: rcu_read_lock() used illegally while idle

On 12/03/2014 11:29 AM, Paul E. McKenney wrote:
> On Wed, Dec 03, 2014 at 01:19:22PM -0500, Dave Jones wrote:
>> I'm not sure why this only just started complaining, because this code
>> hasn't changed in years, but I don't recall seeing this before.
>> This gets spewed during bootup since I put 3.18-rc7 on my firewall.
>> Previously it was running rc4 where I didn't see this.
>
> That is quite strange. I wonder if NO_HZ_FULL has told RCU that the
> CPU is idle before the sysret_audit hook is called.
>
> Adding Frederic for his thoughts on this.
>
>> Did something in RCU change recently ?
>
> Not since -rc1, as far as I know, anyway.

There was a cute little bug that probably prevented sysret_audit from
being called if TIF_NOHZ was set:

b5e212a3051b x86, syscall: Fix _TIF_NOHZ handling in
syscall_trace_enter_phase1

IOW, between 3.18-rc1 and 3.18-rc6 (approximately), sysret_audit would
never happen in NO_HZ_FULL mode due to that bug.

Looking at the current code, I'm not sure I understand why we don't just
infinite-loop. Hmm, maybe we do -- could this be the lockup that no one
understands yet? (I doubt it, but maybe.)

The loop would be:

sysret_check:

Load ti.flags into edx (assume that _TIF_SYSCALL_AUDIT and some other
_TIF_ALLWORK_MASK but is set).

Jump to sysret_careful. Eventually get to sysret_signal. Notice that
TIF_SYSCALL_AUDIT is set in edx. Jump to sysret_audit.

sysret_audit then clears the _TIF_SYSCALL_AUDIT bit from edi. edi is
not edx. Hmm.

Jump to sysret_check, where edx is the same as before. Repeat.

This doesn't explain how we end up screwing up context tracking. But I
don't understand why we don't execute the audit exit hook *twice* if
TIF_NOHZ is set.

I have patches to delete this whole fscking sysret fast but not really
fast path. I'll resend them for 3.19. In the mean time, can you test
this patch by itself:

https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/entry&id=1072a16a8d4ad1b11b8062f76e3236b9771b0fb6

Applying just that patch will result in a fairly large performance hit
if auditing is on, but given the shear number of bugs that the syscall
audit hooks seem to cause, I'm think it may be a reasonable tradeoff. I
don't really like it as a last-minute fix, though. (To get the
performance back, you need the two patches before it, but those are
*definitely* not last-minute material.)


On another note, all those TIF_FOO_BAR_BAZ masks are incomprehensible
and probably wrong in various interesting ways.

--Andy

>
> Thanx, Paul
>
>> ===============================
>> [ INFO: suspicious RCU usage. ]
>> 3.18.0-rc7+ #93 Not tainted
>> -------------------------------
>> include/linux/rcupdate.h:883 rcu_read_lock() used illegally while idle!
>>
>> other info that might help us debug this:
>>
>>
>> RCU used illegally from idle CPU!
>> rcu_scheduler_active = 1, debug_locks = 0
>> RCU used illegally from extended quiescent state!
>> 1 lock held by systemd-sysctl/557:
>> #0: (rcu_read_lock){......}, at: [<ffffffff890f1320>] audit_filter_type+0x0/0x240
>>
>> stack backtrace:
>> CPU: 0 PID: 557 Comm: systemd-sysctl Not tainted 3.18.0-rc7+ #93
>> 0000000000000000 0000000063c50efc ffff88021f51fd28 ffffffff895abec3
>> 0000000000000000 ffff880234758000 ffff88021f51fd58 ffffffff890a5280
>> ffff880221d9c548 00007fd651f320d0 0000000000000514 00000000ffff9012
>> Call Trace:
>> [<ffffffff895abec3>] dump_stack+0x4e/0x68
>> [<ffffffff890a5280>] lockdep_rcu_suspicious+0xf0/0x110
>> [<ffffffff890f150e>] audit_filter_type+0x1ee/0x240
>> [<ffffffff890f1320>] ? audit_filter_user+0x340/0x340
>> [<ffffffff890ecd39>] audit_log_start+0x49/0x4a0
>> [<ffffffff8908f4f5>] ? local_clock+0x25/0x30
>> [<ffffffff890a397f>] ? lock_release_holdtime.part.30+0xf/0x190
>> [<ffffffff890f1973>] audit_log_exit+0x53/0xcf0
>> [<ffffffff8908f4f5>] ? local_clock+0x25/0x30
>> [<ffffffff890a397f>] ? lock_release_holdtime.part.30+0xf/0x190
>> [<ffffffff895b4f0c>] ? sysret_signal+0x5/0x43
>> [<ffffffff890f4a65>] __audit_syscall_exit+0x245/0x2a0
>> [<ffffffff895b4f61>] sysret_audit+0x17/0x21
>>
>

2014-12-03 20:21:04

by Dave Jones

[permalink] [raw]
Subject: Re: audit: rcu_read_lock() used illegally while idle

On Wed, Dec 03, 2014 at 12:06:56PM -0800, Andy Lutomirski wrote:

> >> Did something in RCU change recently ?
> >
> > Not since -rc1, as far as I know, anyway.
>
> I have patches to delete this whole fscking sysret fast but not really
> fast path. I'll resend them for 3.19. In the mean time, can you test
> this patch by itself:
>
> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/entry&id=1072a16a8d4ad1b11b8062f76e3236b9771b0fb6

With that applied, I no longer see the trace.

thanks,

Dave

2014-12-03 20:39:00

by Andy Lutomirski

[permalink] [raw]
Subject: Re: audit: rcu_read_lock() used illegally while idle

On Wed, Dec 3, 2014 at 12:19 PM, Dave Jones <[email protected]> wrote:
> On Wed, Dec 03, 2014 at 12:06:56PM -0800, Andy Lutomirski wrote:
>
> > >> Did something in RCU change recently ?
> > >
> > > Not since -rc1, as far as I know, anyway.
> >
> > I have patches to delete this whole fscking sysret fast but not really
> > fast path. I'll resend them for 3.19. In the mean time, can you test
> > this patch by itself:
> >
> > https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/entry&id=1072a16a8d4ad1b11b8062f76e3236b9771b0fb6
>
> With that applied, I no longer see the trace.
>

Thanks.

The bug is that SCHEDULE_USER in sysret_schedule is wrong. I'd
suggest adding a warning to schedule_user that fires if context
tracking thinks we're already in the kernel.

FWIW, I think that the rest of the SCHEDULE_USER calls may be wrong,
too. In particular, the one in int_careful looks wrong as well, so I
don't see why my patch made a difference if I'm right.

Frédéric, any ideas here? As a stopgap measure, making SCHEDULE_USER
restore the previous state might make sense for 3.18.

--Andy

> thanks,
>
> Dave
>



--
Andy Lutomirski
AMA Capital Management, LLC

2014-12-03 22:08:41

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: audit: rcu_read_lock() used illegally while idle

On Wed, Dec 03, 2014 at 12:38:36PM -0800, Andy Lutomirski wrote:
> On Wed, Dec 3, 2014 at 12:19 PM, Dave Jones <[email protected]> wrote:
> > On Wed, Dec 03, 2014 at 12:06:56PM -0800, Andy Lutomirski wrote:
> >
> > > >> Did something in RCU change recently ?
> > > >
> > > > Not since -rc1, as far as I know, anyway.
> > >
> > > I have patches to delete this whole fscking sysret fast but not really
> > > fast path. I'll resend them for 3.19. In the mean time, can you test
> > > this patch by itself:
> > >
> > > https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/entry&id=1072a16a8d4ad1b11b8062f76e3236b9771b0fb6
> >
> > With that applied, I no longer see the trace.
> >
>
> Thanks.
>
> The bug is that SCHEDULE_USER in sysret_schedule is wrong. I'd
> suggest adding a warning to schedule_user that fires if context
> tracking thinks we're already in the kernel.
>
> FWIW, I think that the rest of the SCHEDULE_USER calls may be wrong,
> too. In particular, the one in int_careful looks wrong as well, so I
> don't see why my patch made a difference if I'm right.
>
> Fr?d?ric, any ideas here? As a stopgap measure, making SCHEDULE_USER
> restore the previous state might make sense for 3.18.

I don't know. It's possible that something went wrong with the recent entry_64.S
and ptrace.c rework.

Previously we expected to set context tracking to user state from syscall_trace_exit()
and to kernel state from syscall_trace_enter(). And if anything using RCU
was called between syscall_trace_exit() and the actual return to userspace, the code
had to be wrapped between user_exit() *code* user_enter().

So it looked like this:


syscall {
//enter kernel
syscall_trace_enter() {
user_exit();
}

syscall()

syscall_trace_enter() {
user_enter();
}

while (test_thread_flag(TIF_EXIT_WORK)) {
if (need_resched()) {
schedule_user() {
user_exit();
schedule()
user_enter();
}
}

if ( need signal ) {
do_notify_resume() {
user_exit()
handle signal and stuff
user_enter()
}
}
}
}

This is suboptimal but it doesn't impact the syscall fastpath
and it's correct from cputime accounting and RCU point of views.

Now maybe the recent logic rework broke the above assumptions?

2014-12-03 22:13:08

by Andy Lutomirski

[permalink] [raw]
Subject: Re: audit: rcu_read_lock() used illegally while idle

On Wed, Dec 3, 2014 at 2:08 PM, Frederic Weisbecker <[email protected]> wrote:
> On Wed, Dec 03, 2014 at 12:38:36PM -0800, Andy Lutomirski wrote:
>> On Wed, Dec 3, 2014 at 12:19 PM, Dave Jones <[email protected]> wrote:
>> > On Wed, Dec 03, 2014 at 12:06:56PM -0800, Andy Lutomirski wrote:
>> >
>> > > >> Did something in RCU change recently ?
>> > > >
>> > > > Not since -rc1, as far as I know, anyway.
>> > >
>> > > I have patches to delete this whole fscking sysret fast but not really
>> > > fast path. I'll resend them for 3.19. In the mean time, can you test
>> > > this patch by itself:
>> > >
>> > > https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/entry&id=1072a16a8d4ad1b11b8062f76e3236b9771b0fb6
>> >
>> > With that applied, I no longer see the trace.
>> >
>>
>> Thanks.
>>
>> The bug is that SCHEDULE_USER in sysret_schedule is wrong. I'd
>> suggest adding a warning to schedule_user that fires if context
>> tracking thinks we're already in the kernel.
>>
>> FWIW, I think that the rest of the SCHEDULE_USER calls may be wrong,
>> too. In particular, the one in int_careful looks wrong as well, so I
>> don't see why my patch made a difference if I'm right.
>>
>> Frédéric, any ideas here? As a stopgap measure, making SCHEDULE_USER
>> restore the previous state might make sense for 3.18.
>
> I don't know. It's possible that something went wrong with the recent entry_64.S
> and ptrace.c rework.
>
> Previously we expected to set context tracking to user state from syscall_trace_exit()
> and to kernel state from syscall_trace_enter(). And if anything using RCU
> was called between syscall_trace_exit() and the actual return to userspace, the code
> had to be wrapped between user_exit() *code* user_enter().
>
> So it looked like this:
>
>
> syscall {
> //enter kernel
> syscall_trace_enter() {
> user_exit();
> }
>
> syscall()
>
> syscall_trace_enter() {

Do you mean syscall_trace_leave()? But syscall_trace_leave isn't called here...

> user_enter();
> }
>
> while (test_thread_flag(TIF_EXIT_WORK)) {
> if (need_resched()) {
> schedule_user() {
> user_exit();
> schedule()
> user_enter();
> }
> }
>
> if ( need signal ) {
> do_notify_resume() {
> user_exit()
> handle signal and stuff
> user_enter()
> }
> }

... it's called hereabouts or so.

> }
> }
>
> This is suboptimal but it doesn't impact the syscall fastpath
> and it's correct from cputime accounting and RCU point of views.
>
> Now maybe the recent logic rework broke the above assumptions?

The big rework was entry, not exit, so I don't see the issue.

In any case, might it make sense to add warnings to user_exit and
user_enter to ensure that they're called in the state in which they
should be called?

--Andy

--
Andy Lutomirski
AMA Capital Management, LLC

2014-12-03 23:18:56

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH] context_tracking: Restore previous state in schedule_user

It appears that some SCHEDULE_USER (asm for schedule_user) callers
in arch/x86/kernel/entry_64.S are called from RCU kernel context,
and schedule_user will return in RCU user context. This causes RCU
warnings and possible failures.

This is intended to be a minimal fix suitable for 3.18.

Reported-by: Dave Jones <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Frédéric Weisbecker <[email protected]>
Cc: Paul McKenney <[email protected]>
Signed-off-by: Andy Lutomirski <[email protected]>
---

Hi all-

This is intended to be a suitable last-minute fix for the RCU issue that
Dave saw.

Dave, can you confirm that this fixes it?

Frédéric, can you confirm that you think that this will have no effect
on correct callers of schedule_user and that will do the right thing
for incorrect callers of schedule_user?

I don't like the x86 asm that calls this at all, and I don't really
like the fragility of the mechanism is general, but I think that this
improves the situation enough to avoid problems in the short term.

With the obvious warning added, I get:

[ 0.751022] ------------[ cut here ]------------
[ 0.751937] WARNING: CPU: 0 PID: 72 at kernel/sched/core.c:2883 schedule_user+0xcf/0xe0()
[ 0.753477] Modules linked in:
[ 0.754089] CPU: 0 PID: 72 Comm: mount Not tainted 3.18.0-rc7+ #653
[ 0.755258] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[ 0.757655] 0000000000000009 ffff880005c13f00 ffffffff81741dca ffff8800069f5a50
[ 0.759228] 0000000000000000 ffff880005c13f40 ffffffff8108e781 0000000000000246
[ 0.760758] 0000000000000000 00007fff970441c8 00007fff97043fd0 00007f67794ebcc8
[ 0.762294] Call Trace:
[ 0.762775] [<ffffffff81741dca>] dump_stack+0x46/0x58
[ 0.763739] [<ffffffff8108e781>] warn_slowpath_common+0x81/0xa0
[ 0.764865] [<ffffffff8108e85a>] warn_slowpath_null+0x1a/0x20
[ 0.765958] [<ffffffff8174565f>] schedule_user+0xcf/0xe0
[ 0.766974] [<ffffffff8174ae69>] sysret_careful+0x19/0x1c
[ 0.768011] ---[ end trace 329f34db2b3be966 ]---

So, yes, we have a bug, and this could cause any number of strange
problems.

kernel/sched/core.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 24beb9bb4c3e..39d9d95331b7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2874,10 +2874,14 @@ asmlinkage __visible void __sched schedule_user(void)
* or we have been woken up remotely but the IPI has not yet arrived,
* we haven't yet exited the RCU idle mode. Do it here manually until
* we find a better solution.
+ *
+ * NB: There are buggy callers of this function. Ideally we
+ * should warn if prev_state != IN_USER, but that will trigger
+ * to frequently to make sense yet.
*/
- user_exit();
+ enum ctx_state prev_state = exception_enter();
schedule();
- user_enter();
+ exception_exit(prev_state);
}
#endif

--
1.9.3

2014-12-03 23:26:25

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH] context_tracking: Restore previous state in schedule_user

[adding Dave -- apparently git doesn't pick up Reported-By]

--Andy

On Wed, Dec 3, 2014 at 3:18 PM, Andy Lutomirski <[email protected]> wrote:
> It appears that some SCHEDULE_USER (asm for schedule_user) callers
> in arch/x86/kernel/entry_64.S are called from RCU kernel context,
> and schedule_user will return in RCU user context. This causes RCU
> warnings and possible failures.
>
> This is intended to be a minimal fix suitable for 3.18.
>
> Reported-by: Dave Jones <[email protected]>
> Cc: Oleg Nesterov <[email protected]>
> Cc: Frédéric Weisbecker <[email protected]>
> Cc: Paul McKenney <[email protected]>
> Signed-off-by: Andy Lutomirski <[email protected]>
> ---
>
> Hi all-
>
> This is intended to be a suitable last-minute fix for the RCU issue that
> Dave saw.
>
> Dave, can you confirm that this fixes it?
>
> Frédéric, can you confirm that you think that this will have no effect
> on correct callers of schedule_user and that will do the right thing
> for incorrect callers of schedule_user?
>
> I don't like the x86 asm that calls this at all, and I don't really
> like the fragility of the mechanism is general, but I think that this
> improves the situation enough to avoid problems in the short term.
>
> With the obvious warning added, I get:
>
> [ 0.751022] ------------[ cut here ]------------
> [ 0.751937] WARNING: CPU: 0 PID: 72 at kernel/sched/core.c:2883 schedule_user+0xcf/0xe0()
> [ 0.753477] Modules linked in:
> [ 0.754089] CPU: 0 PID: 72 Comm: mount Not tainted 3.18.0-rc7+ #653
> [ 0.755258] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
> [ 0.757655] 0000000000000009 ffff880005c13f00 ffffffff81741dca ffff8800069f5a50
> [ 0.759228] 0000000000000000 ffff880005c13f40 ffffffff8108e781 0000000000000246
> [ 0.760758] 0000000000000000 00007fff970441c8 00007fff97043fd0 00007f67794ebcc8
> [ 0.762294] Call Trace:
> [ 0.762775] [<ffffffff81741dca>] dump_stack+0x46/0x58
> [ 0.763739] [<ffffffff8108e781>] warn_slowpath_common+0x81/0xa0
> [ 0.764865] [<ffffffff8108e85a>] warn_slowpath_null+0x1a/0x20
> [ 0.765958] [<ffffffff8174565f>] schedule_user+0xcf/0xe0
> [ 0.766974] [<ffffffff8174ae69>] sysret_careful+0x19/0x1c
> [ 0.768011] ---[ end trace 329f34db2b3be966 ]---
>
> So, yes, we have a bug, and this could cause any number of strange
> problems.
>
> kernel/sched/core.c | 8 ++++++--
> 1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 24beb9bb4c3e..39d9d95331b7 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2874,10 +2874,14 @@ asmlinkage __visible void __sched schedule_user(void)
> * or we have been woken up remotely but the IPI has not yet arrived,
> * we haven't yet exited the RCU idle mode. Do it here manually until
> * we find a better solution.
> + *
> + * NB: There are buggy callers of this function. Ideally we
> + * should warn if prev_state != IN_USER, but that will trigger
> + * to frequently to make sense yet.
> */
> - user_exit();
> + enum ctx_state prev_state = exception_enter();
> schedule();
> - user_enter();
> + exception_exit(prev_state);
> }
> #endif
>
> --
> 1.9.3
>



--
Andy Lutomirski
AMA Capital Management, LLC

2014-12-03 23:31:40

by Dave Jones

[permalink] [raw]
Subject: Re: [PATCH] context_tracking: Restore previous state in schedule_user

On Wed, Dec 03, 2014 at 03:18:41PM -0800, Andy Lutomirski wrote:
> It appears that some SCHEDULE_USER (asm for schedule_user) callers
> in arch/x86/kernel/entry_64.S are called from RCU kernel context,
> and schedule_user will return in RCU user context. This causes RCU
> warnings and possible failures.
>
> This is intended to be a minimal fix suitable for 3.18.
>
> Reported-by: Dave Jones <[email protected]>
> Cc: Oleg Nesterov <[email protected]>
> Cc: Fr?d?ric Weisbecker <[email protected]>
> Cc: Paul McKenney <[email protected]>
> Signed-off-by: Andy Lutomirski <[email protected]>
> ---
>
> Hi all-
>
> This is intended to be a suitable last-minute fix for the RCU issue that
> Dave saw.
>
> Dave, can you confirm that this fixes it?

ACK, feel free to add my Tested-by:

Dave

2014-12-03 23:37:19

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v2] context_tracking: Restore previous state in schedule_user

It appears that some SCHEDULE_USER (asm for schedule_user) callers
in arch/x86/kernel/entry_64.S are called from RCU kernel context,
and schedule_user will return in RCU user context. This causes RCU
warnings and possible failures.

This is intended to be a minimal fix suitable for 3.18.

Reported-and-tested-by: Dave Jones <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Frédéric Weisbecker <[email protected]>
Cc: Paul McKenney <[email protected]>
Signed-off-by: Andy Lutomirski <[email protected]>
---

Hi all-

This is intended to be a suitable last-minute fix for the RCU issue that
Dave saw.

Dave, can you confirm that this fixes it?

Frédéric, can you confirm that you think that this will have no effect
on correct callers of schedule_user and that will do the right thing
for incorrect callers of schedule_user?

I don't like the x86 asm that calls this at all, and I don't really
like the fragility of the mechanism is general, but I think that this
improves the situation enough to avoid problems in the short term.

With the obvious warning added, I get:

[ 0.751022] ------------[ cut here ]------------
[ 0.751937] WARNING: CPU: 0 PID: 72 at kernel/sched/core.c:2883 schedule_user+0xcf/0xe0()
[ 0.753477] Modules linked in:
[ 0.754089] CPU: 0 PID: 72 Comm: mount Not tainted 3.18.0-rc7+ #653
[ 0.755258] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[ 0.757655] 0000000000000009 ffff880005c13f00 ffffffff81741dca ffff8800069f5a50
[ 0.759228] 0000000000000000 ffff880005c13f40 ffffffff8108e781 0000000000000246
[ 0.760758] 0000000000000000 00007fff970441c8 00007fff97043fd0 00007f67794ebcc8
[ 0.762294] Call Trace:
[ 0.762775] [<ffffffff81741dca>] dump_stack+0x46/0x58
[ 0.763739] [<ffffffff8108e781>] warn_slowpath_common+0x81/0xa0
[ 0.764865] [<ffffffff8108e85a>] warn_slowpath_null+0x1a/0x20
[ 0.765958] [<ffffffff8174565f>] schedule_user+0xcf/0xe0
[ 0.766974] [<ffffffff8174ae69>] sysret_careful+0x19/0x1c
[ 0.768011] ---[ end trace 329f34db2b3be966 ]---

So, yes, we have a bug, and this could cause any number of strange
problems.

Changes from v1:
- Added Dave's Tested-by.
- Fixed a comment typo.

kernel/sched/core.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 24beb9bb4c3e..89e7283015a6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2874,10 +2874,14 @@ asmlinkage __visible void __sched schedule_user(void)
* or we have been woken up remotely but the IPI has not yet arrived,
* we haven't yet exited the RCU idle mode. Do it here manually until
* we find a better solution.
+ *
+ * NB: There are buggy callers of this function. Ideally we
+ * should warn if prev_state != IN_USER, but that will trigger
+ * too frequently to make sense yet.
*/
- user_exit();
+ enum ctx_state prev_state = exception_enter();
schedule();
- user_enter();
+ exception_exit(prev_state);
}
#endif

--
1.9.3

2014-12-03 23:49:28

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: audit: rcu_read_lock() used illegally while idle

On Wed, Dec 03, 2014 at 02:12:43PM -0800, Andy Lutomirski wrote:
> On Wed, Dec 3, 2014 at 2:08 PM, Frederic Weisbecker <[email protected]> wrote:
> > I don't know. It's possible that something went wrong with the recent entry_64.S
> > and ptrace.c rework.
> >
> > Previously we expected to set context tracking to user state from syscall_trace_exit()
> > and to kernel state from syscall_trace_enter(). And if anything using RCU
> > was called between syscall_trace_exit() and the actual return to userspace, the code
> > had to be wrapped between user_exit() *code* user_enter().
> >
> > So it looked like this:
> >
> >
> > syscall {
> > //enter kernel
> > syscall_trace_enter() {
> > user_exit();
> > }
> >
> > syscall()
> >
> > syscall_trace_enter() {
>
> Do you mean syscall_trace_leave()? But syscall_trace_leave isn't called here...

Right :-)

>
> > user_enter();
> > }
> >
> > while (test_thread_flag(TIF_EXIT_WORK)) {
> > if (need_resched()) {
> > schedule_user() {
> > user_exit();
> > schedule()
> > user_enter();
> > }
> > }
> >
> > if ( need signal ) {
> > do_notify_resume() {
> > user_exit()
> > handle signal and stuff
> > user_enter()
> > }
> > }
>
> ... it's called hereabouts or so.
>
> > }
> > }
> >
> > This is suboptimal but it doesn't impact the syscall fastpath
> > and it's correct from cputime accounting and RCU point of views.
> >
> > Now maybe the recent logic rework broke the above assumptions?
>
> The big rework was entry, not exit, so I don't see the issue.

And you're right actually :-) I just rewinded to the times when I added
SCHEDULE_USER and actually things happen a bit differently than I thought
and it looks like things haven't changed much since then

syscall {
//enter kernel
syscall_trace_enter() {
user_exit();
}

syscall()

while (test_thread_flag(TIF_ALLWORK_MASK)) {
if (need_resched()) {
schedule_user() {
user_exit();
schedule()
user_enter();
}
} else {
if (test_thread_flag(TIF_WORK_SYSCALL_EXIT)) {
syscall_trace_leave() {
user_enter();
}
} else if (test_thread_flag(TIF_DO_NOTIFY_MASK) {)
do_notify_resume() {
user_exit()
handle signal and stuff
user_enter()
} else {
//ignored but unexpected, should we warn?
}
}

}

So schedule_user() may well be called before syscall_trace_leave() after all.
This mean that schedule_user() can call user_exit() whereas we are already in
the kernel from context tracking POV. Hence we have a context tracking imbalance
or a double call to user_exit() if you prefer.

Things probably happened to work somehow because double user_foo() calls are simply ignored,
and we've been lucky enough that it didn't explode is most scenarios.

The fix would be to change schedule_user() to handle random context tracking states.
exception_enter/exit() act like context_tracking_save()/context_tracking_restore()
so they fit pretty well there:

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 24beb9b..6fe82fb 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2869,15 +2869,18 @@ EXPORT_SYMBOL(schedule);
#ifdef CONFIG_CONTEXT_TRACKING
asmlinkage __visible void __sched schedule_user(void)
{
+ enum ctx_state prev_ctx;
+
/*
* If we come here after a random call to set_need_resched(),
* or we have been woken up remotely but the IPI has not yet arrived,
- * we haven't yet exited the RCU idle mode. Do it here manually until
- * we find a better solution.
+ * the context tracking is in a random state depending on which stage
+ * we are on resuming to userspace. Exception_enter/exit() handle that
+ * well by saving and restoring the current context tracking state.
*/
- user_exit();
+ prev_ctx = exception_enter();
schedule();
- user_enter();
+ exception_exit(prev_ctx);
}
#endif


> In any case, might it make sense to add warnings to user_exit and
> user_enter to ensure that they're called in the state in which they
> should be called?

Yeah I think we need to do that. We'll detect more easily issues like this
one.

2014-12-03 23:50:12

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v2] context_tracking: Restore previous state in schedule_user

On Wed, Dec 03, 2014 at 03:37:08PM -0800, Andy Lutomirski wrote:
> It appears that some SCHEDULE_USER (asm for schedule_user) callers
> in arch/x86/kernel/entry_64.S are called from RCU kernel context,
> and schedule_user will return in RCU user context. This causes RCU
> warnings and possible failures.
>
> This is intended to be a minimal fix suitable for 3.18.
>
> Reported-and-tested-by: Dave Jones <[email protected]>
> Cc: Oleg Nesterov <[email protected]>
> Cc: Fr?d?ric Weisbecker <[email protected]>
> Cc: Paul McKenney <[email protected]>
> Signed-off-by: Andy Lutomirski <[email protected]>

Looks like the RCU entry/exit stuff is properly accounted for, so:

Acked-by: Paul E. McKenney <[email protected]>

> ---
>
> Hi all-
>
> This is intended to be a suitable last-minute fix for the RCU issue that
> Dave saw.
>
> Dave, can you confirm that this fixes it?
>
> Fr?d?ric, can you confirm that you think that this will have no effect
> on correct callers of schedule_user and that will do the right thing
> for incorrect callers of schedule_user?
>
> I don't like the x86 asm that calls this at all, and I don't really
> like the fragility of the mechanism is general, but I think that this
> improves the situation enough to avoid problems in the short term.
>
> With the obvious warning added, I get:
>
> [ 0.751022] ------------[ cut here ]------------
> [ 0.751937] WARNING: CPU: 0 PID: 72 at kernel/sched/core.c:2883 schedule_user+0xcf/0xe0()
> [ 0.753477] Modules linked in:
> [ 0.754089] CPU: 0 PID: 72 Comm: mount Not tainted 3.18.0-rc7+ #653
> [ 0.755258] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
> [ 0.757655] 0000000000000009 ffff880005c13f00 ffffffff81741dca ffff8800069f5a50
> [ 0.759228] 0000000000000000 ffff880005c13f40 ffffffff8108e781 0000000000000246
> [ 0.760758] 0000000000000000 00007fff970441c8 00007fff97043fd0 00007f67794ebcc8
> [ 0.762294] Call Trace:
> [ 0.762775] [<ffffffff81741dca>] dump_stack+0x46/0x58
> [ 0.763739] [<ffffffff8108e781>] warn_slowpath_common+0x81/0xa0
> [ 0.764865] [<ffffffff8108e85a>] warn_slowpath_null+0x1a/0x20
> [ 0.765958] [<ffffffff8174565f>] schedule_user+0xcf/0xe0
> [ 0.766974] [<ffffffff8174ae69>] sysret_careful+0x19/0x1c
> [ 0.768011] ---[ end trace 329f34db2b3be966 ]---
>
> So, yes, we have a bug, and this could cause any number of strange
> problems.
>
> Changes from v1:
> - Added Dave's Tested-by.
> - Fixed a comment typo.
>
> kernel/sched/core.c | 8 ++++++--
> 1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 24beb9bb4c3e..89e7283015a6 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2874,10 +2874,14 @@ asmlinkage __visible void __sched schedule_user(void)
> * or we have been woken up remotely but the IPI has not yet arrived,
> * we haven't yet exited the RCU idle mode. Do it here manually until
> * we find a better solution.
> + *
> + * NB: There are buggy callers of this function. Ideally we
> + * should warn if prev_state != IN_USER, but that will trigger
> + * too frequently to make sense yet.
> */
> - user_exit();
> + enum ctx_state prev_state = exception_enter();
> schedule();
> - user_enter();
> + exception_exit(prev_state);
> }
> #endif
>
> --
> 1.9.3
>

2014-12-03 23:58:41

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH] context_tracking: Restore previous state in schedule_user

On Wed, Dec 03, 2014 at 03:18:41PM -0800, Andy Lutomirski wrote:
> It appears that some SCHEDULE_USER (asm for schedule_user) callers
> in arch/x86/kernel/entry_64.S are called from RCU kernel context,
> and schedule_user will return in RCU user context. This causes RCU
> warnings and possible failures.
>
> This is intended to be a minimal fix suitable for 3.18.
>
> Reported-by: Dave Jones <[email protected]>
> Cc: Oleg Nesterov <[email protected]>
> Cc: Fr?d?ric Weisbecker <[email protected]>
> Cc: Paul McKenney <[email protected]>
> Signed-off-by: Andy Lutomirski <[email protected]>

Ah, we sent it about at the same time :-)

Might be too late for 3.18 though because it's not a regression.

> ---
>
> Hi all-
>
> This is intended to be a suitable last-minute fix for the RCU issue that
> Dave saw.
>
> Dave, can you confirm that this fixes it?
>
> Fr?d?ric, can you confirm that you think that this will have no effect
> on correct callers of schedule_user and that will do the right thing
> for incorrect callers of schedule_user?

Yes it should be fine.

>
> I don't like the x86 asm that calls this at all, and I don't really
> like the fragility of the mechanism is general, but I think that this
> improves the situation enough to avoid problems in the short term.

At best we should have only one call to user_enter() at the end of the
syscall and exception path once we've completed everything (pending reschedule,
tracing, signals, ...) instead of context tracking fixups on functions that
can be called after syscall_trace_leave(), but that would impact the fastpath.

Although it should be possible to tweak the slow path to do that...

>
> With the obvious warning added, I get:
>
> [ 0.751022] ------------[ cut here ]------------
> [ 0.751937] WARNING: CPU: 0 PID: 72 at kernel/sched/core.c:2883 schedule_user+0xcf/0xe0()
> [ 0.753477] Modules linked in:
> [ 0.754089] CPU: 0 PID: 72 Comm: mount Not tainted 3.18.0-rc7+ #653
> [ 0.755258] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
> [ 0.757655] 0000000000000009 ffff880005c13f00 ffffffff81741dca ffff8800069f5a50
> [ 0.759228] 0000000000000000 ffff880005c13f40 ffffffff8108e781 0000000000000246
> [ 0.760758] 0000000000000000 00007fff970441c8 00007fff97043fd0 00007f67794ebcc8
> [ 0.762294] Call Trace:
> [ 0.762775] [<ffffffff81741dca>] dump_stack+0x46/0x58
> [ 0.763739] [<ffffffff8108e781>] warn_slowpath_common+0x81/0xa0
> [ 0.764865] [<ffffffff8108e85a>] warn_slowpath_null+0x1a/0x20
> [ 0.765958] [<ffffffff8174565f>] schedule_user+0xcf/0xe0
> [ 0.766974] [<ffffffff8174ae69>] sysret_careful+0x19/0x1c
> [ 0.768011] ---[ end trace 329f34db2b3be966 ]---
>
> So, yes, we have a bug, and this could cause any number of strange
> problems.
>
> kernel/sched/core.c | 8 ++++++--
> 1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 24beb9bb4c3e..39d9d95331b7 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2874,10 +2874,14 @@ asmlinkage __visible void __sched schedule_user(void)
> * or we have been woken up remotely but the IPI has not yet arrived,
> * we haven't yet exited the RCU idle mode. Do it here manually until
> * we find a better solution.

Just need to fix the above comment.

> + *
> + * NB: There are buggy callers of this function. Ideally we
> + * should warn if prev_state != IN_USER, but that will trigger
> + * to frequently to make sense yet.

It's not really the callers of this function that are buggy but the
way we handled context tracking.

> */
> - user_exit();
> + enum ctx_state prev_state = exception_enter();
> schedule();
> - user_enter();
> + exception_exit(prev_state);
> }
> #endif
>
> --
> 1.9.3
>

2014-12-04 00:01:43

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH v2] context_tracking: Restore previous state in schedule_user

On Wed, Dec 03, 2014 at 03:50:02PM -0800, Paul E. McKenney wrote:
> On Wed, Dec 03, 2014 at 03:37:08PM -0800, Andy Lutomirski wrote:
> > It appears that some SCHEDULE_USER (asm for schedule_user) callers
> > in arch/x86/kernel/entry_64.S are called from RCU kernel context,
> > and schedule_user will return in RCU user context. This causes RCU
> > warnings and possible failures.
> >
> > This is intended to be a minimal fix suitable for 3.18.
> >
> > Reported-and-tested-by: Dave Jones <[email protected]>
> > Cc: Oleg Nesterov <[email protected]>
> > Cc: Fr?d?ric Weisbecker <[email protected]>
> > Cc: Paul McKenney <[email protected]>
> > Signed-off-by: Andy Lutomirski <[email protected]>
>
> Looks like the RCU entry/exit stuff is properly accounted for, so:
>
> Acked-by: Paul E. McKenney <[email protected]>

Thanks a lot guys!

I'll take the patch. This just won't go for 3.18 I think, the issue
is there since the early days of context tracking.

2014-12-04 00:04:54

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH] context_tracking: Restore previous state in schedule_user

On Wed, Dec 3, 2014 at 3:58 PM, Frederic Weisbecker <[email protected]> wrote:
> On Wed, Dec 03, 2014 at 03:18:41PM -0800, Andy Lutomirski wrote:
>> It appears that some SCHEDULE_USER (asm for schedule_user) callers
>> in arch/x86/kernel/entry_64.S are called from RCU kernel context,
>> and schedule_user will return in RCU user context. This causes RCU
>> warnings and possible failures.
>>
>> This is intended to be a minimal fix suitable for 3.18.
>>
>> Reported-by: Dave Jones <[email protected]>
>> Cc: Oleg Nesterov <[email protected]>
>> Cc: Frédéric Weisbecker <[email protected]>
>> Cc: Paul McKenney <[email protected]>
>> Signed-off-by: Andy Lutomirski <[email protected]>
>
> Ah, we sent it about at the same time :-)
>
> Might be too late for 3.18 though because it's not a regression.
>
>> ---
>>
>> Hi all-
>>
>> This is intended to be a suitable last-minute fix for the RCU issue that
>> Dave saw.
>>
>> Dave, can you confirm that this fixes it?
>>
>> Frédéric, can you confirm that you think that this will have no effect
>> on correct callers of schedule_user and that will do the right thing
>> for incorrect callers of schedule_user?
>
> Yes it should be fine.
>
>>
>> I don't like the x86 asm that calls this at all, and I don't really
>> like the fragility of the mechanism is general, but I think that this
>> improves the situation enough to avoid problems in the short term.
>
> At best we should have only one call to user_enter() at the end of the
> syscall and exception path once we've completed everything (pending reschedule,
> tracing, signals, ...) instead of context tracking fixups on functions that
> can be called after syscall_trace_leave(), but that would impact the fastpath.
>
> Although it should be possible to tweak the slow path to do that...

My eventual goal for x86 is rewrite the entire slow path in C. Step
1: delete sysret_audit, etc.

>
>>
>> With the obvious warning added, I get:
>>
>> [ 0.751022] ------------[ cut here ]------------
>> [ 0.751937] WARNING: CPU: 0 PID: 72 at kernel/sched/core.c:2883 schedule_user+0xcf/0xe0()
>> [ 0.753477] Modules linked in:
>> [ 0.754089] CPU: 0 PID: 72 Comm: mount Not tainted 3.18.0-rc7+ #653
>> [ 0.755258] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
>> [ 0.757655] 0000000000000009 ffff880005c13f00 ffffffff81741dca ffff8800069f5a50
>> [ 0.759228] 0000000000000000 ffff880005c13f40 ffffffff8108e781 0000000000000246
>> [ 0.760758] 0000000000000000 00007fff970441c8 00007fff97043fd0 00007f67794ebcc8
>> [ 0.762294] Call Trace:
>> [ 0.762775] [<ffffffff81741dca>] dump_stack+0x46/0x58
>> [ 0.763739] [<ffffffff8108e781>] warn_slowpath_common+0x81/0xa0
>> [ 0.764865] [<ffffffff8108e85a>] warn_slowpath_null+0x1a/0x20
>> [ 0.765958] [<ffffffff8174565f>] schedule_user+0xcf/0xe0
>> [ 0.766974] [<ffffffff8174ae69>] sysret_careful+0x19/0x1c
>> [ 0.768011] ---[ end trace 329f34db2b3be966 ]---
>>
>> So, yes, we have a bug, and this could cause any number of strange
>> problems.
>>
>> kernel/sched/core.c | 8 ++++++--
>> 1 file changed, 6 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 24beb9bb4c3e..39d9d95331b7 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -2874,10 +2874,14 @@ asmlinkage __visible void __sched schedule_user(void)
>> * or we have been woken up remotely but the IPI has not yet arrived,
>> * we haven't yet exited the RCU idle mode. Do it here manually until
>> * we find a better solution.
>
> Just need to fix the above comment.
>
>> + *
>> + * NB: There are buggy callers of this function. Ideally we
>> + * should warn if prev_state != IN_USER, but that will trigger
>> + * to frequently to make sense yet.
>
> It's not really the callers of this function that are buggy but the
> way we handled context tracking.

Yeah, one could debate exactly where the bug is.

Anyway, if you're doing this for 3.19, adding a WARN_ON_ONCE and
trying to fix the callers might make sense.

--Andy

>
>> */
>> - user_exit();
>> + enum ctx_state prev_state = exception_enter();
>> schedule();
>> - user_enter();
>> + exception_exit(prev_state);
>> }
>> #endif
>>
>> --
>> 1.9.3
>>



--
Andy Lutomirski
AMA Capital Management, LLC

2014-12-04 00:30:39

by Dave Jones

[permalink] [raw]
Subject: Re: [PATCH] context_tracking: Restore previous state in schedule_user

On Wed, Dec 03, 2014 at 04:04:31PM -0800, Andy Lutomirski wrote:
> On Wed, Dec 3, 2014 at 3:58 PM, Frederic Weisbecker <[email protected]> wrote:
> > On Wed, Dec 03, 2014 at 03:18:41PM -0800, Andy Lutomirski wrote:
> >> It appears that some SCHEDULE_USER (asm for schedule_user) callers
> >> in arch/x86/kernel/entry_64.S are called from RCU kernel context,
> >> and schedule_user will return in RCU user context. This causes RCU
> >> warnings and possible failures.
> >>
> >> This is intended to be a minimal fix suitable for 3.18.
> >>
> >> Reported-by: Dave Jones <[email protected]>
> >> Cc: Oleg Nesterov <[email protected]>
> >> Cc: Fr?d?ric Weisbecker <[email protected]>
> >> Cc: Paul McKenney <[email protected]>
> >> Signed-off-by: Andy Lutomirski <[email protected]>
> >
> > Ah, we sent it about at the same time :-)
> >
> > Might be too late for 3.18 though because it's not a regression.

Wait, so how come that trace didn't start showing up until recently ?

Dave

2014-12-04 00:39:10

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH] context_tracking: Restore previous state in schedule_user

On Wed, Dec 3, 2014 at 4:30 PM, Dave Jones <[email protected]> wrote:
> On Wed, Dec 03, 2014 at 04:04:31PM -0800, Andy Lutomirski wrote:
> > On Wed, Dec 3, 2014 at 3:58 PM, Frederic Weisbecker <[email protected]> wrote:
> > > On Wed, Dec 03, 2014 at 03:18:41PM -0800, Andy Lutomirski wrote:
> > >> It appears that some SCHEDULE_USER (asm for schedule_user) callers
> > >> in arch/x86/kernel/entry_64.S are called from RCU kernel context,
> > >> and schedule_user will return in RCU user context. This causes RCU
> > >> warnings and possible failures.
> > >>
> > >> This is intended to be a minimal fix suitable for 3.18.
> > >>
> > >> Reported-by: Dave Jones <[email protected]>
> > >> Cc: Oleg Nesterov <[email protected]>
> > >> Cc: Frédéric Weisbecker <[email protected]>
> > >> Cc: Paul McKenney <[email protected]>
> > >> Signed-off-by: Andy Lutomirski <[email protected]>
> > >
> > > Ah, we sent it about at the same time :-)
> > >
> > > Might be too late for 3.18 though because it's not a regression.
>
> Wait, so how come that trace didn't start showing up until recently ?

Looking at the code, it's because int_careful has the same bug, but
syscall_trace_leave does:

/*
* We may come here right after calling schedule_user()
* or do_notify_resume(), in which case we can be in RCU
* user mode.
*/
user_exit();

which means that this issue was anticipated when that comment was written.

Prior to the 3.18 seccomp changes and the _TIF_WORK typo fix, it would
have been difficult to hit sysret_audit when context tracking was on
(you could do it once on the way out from a syscall that enabled
context tracking). So this is 3.18 regression.

The sysret_audit code is still totally screwed up AFAICT. At the very
least, the whole mess rather strongly suggests that, if both context
tracking and audit are on, then __audit_syscall_exit is called *twice*
on each syscall. __audit_syscall_exit seems to be idempotent, so
maybe no one has noticed that little glitch.

I'll ask the x86 people to include my sysret_audit removal for 3.19,
since I think that this schedule_user change is a better last-minute
fix than removing a whole chunk of asm.

--Andy

>
> Dave
>



--
Andy Lutomirski
AMA Capital Management, LLC

2014-12-04 01:13:22

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH] context_tracking: Restore previous state in schedule_user

On Wed, Dec 03, 2014 at 04:38:46PM -0800, Andy Lutomirski wrote:
> On Wed, Dec 3, 2014 at 4:30 PM, Dave Jones <[email protected]> wrote:
> > On Wed, Dec 03, 2014 at 04:04:31PM -0800, Andy Lutomirski wrote:
> > > On Wed, Dec 3, 2014 at 3:58 PM, Frederic Weisbecker <[email protected]> wrote:
> > > > On Wed, Dec 03, 2014 at 03:18:41PM -0800, Andy Lutomirski wrote:
> > > >> It appears that some SCHEDULE_USER (asm for schedule_user) callers
> > > >> in arch/x86/kernel/entry_64.S are called from RCU kernel context,
> > > >> and schedule_user will return in RCU user context. This causes RCU
> > > >> warnings and possible failures.
> > > >>
> > > >> This is intended to be a minimal fix suitable for 3.18.
> > > >>
> > > >> Reported-by: Dave Jones <[email protected]>
> > > >> Cc: Oleg Nesterov <[email protected]>
> > > >> Cc: Fr?d?ric Weisbecker <[email protected]>
> > > >> Cc: Paul McKenney <[email protected]>
> > > >> Signed-off-by: Andy Lutomirski <[email protected]>
> > > >
> > > > Ah, we sent it about at the same time :-)
> > > >
> > > > Might be too late for 3.18 though because it's not a regression.
> >
> > Wait, so how come that trace didn't start showing up until recently ?
>
> Looking at the code, it's because int_careful has the same bug, but
> syscall_trace_leave does:
>
> /*
> * We may come here right after calling schedule_user()
> * or do_notify_resume(), in which case we can be in RCU
> * user mode.
> */
> user_exit();
>
> which means that this issue was anticipated when that comment was written.

Indeed, in fact it was expected to work as long as the code that follows the
syscall is limited to schedule_user(), syscall_trace_leave() and do_notify_resume().
But if anything else is called and uses RCU, this doesn't work anymore.

So user_enter() and user_exit() have been designed to be re-entrant on purpose.

>
> Prior to the 3.18 seccomp changes and the _TIF_WORK typo fix, it would
> have been difficult to hit sysret_audit when context tracking was on
> (you could do it once on the way out from a syscall that enabled
> context tracking). So this is 3.18 regression.

I see now.

So the real problem is not on schedule_user(). It's rather that __audit_syscall_exit()
should we wrapped inside user_exit()/user_enter() or exception_foo(). The latter
is safer in a sensitive patch. That would be the real and simple regression fix.
Tweaking schedule_user() is more risky.

Then, if you like, we can rethink the whole later, define syscall_trace_leave()
as the only place that calls user_enter() and all the other syscall exit functions
(schedule_user(), do_notify_resume(), __audit_syscall_exit()) can just call
exception_enter() - exception_exit() if they can be called after syscall_trace_leave().
Then finally we can make user_enter and user_exit non-reentrant after careful audit
of how other archs use it (sounds scary though).

Or better yet: if you rework the syscall exit slow path, lets call user_enter() at the
very end of the syscall.

>
> The sysret_audit code is still totally screwed up AFAICT. At the very
> least, the whole mess rather strongly suggests that, if both context
> tracking and audit are on, then __audit_syscall_exit is called *twice*
> on each syscall. __audit_syscall_exit seems to be idempotent, so
> maybe no one has noticed that little glitch.
>
> I'll ask the x86 people to include my sysret_audit removal for 3.19,
> since I think that this schedule_user change is a better last-minute
> fix than removing a whole chunk of asm.
>
> --Andy
>
> >
> > Dave
> >
>
>
>
> --
> Andy Lutomirski
> AMA Capital Management, LLC