2013-05-09 01:21:33

by Steven Rostedt

[permalink] [raw]
Subject: [PATCH] x86/sched/dynamic-ticks: Call new schedule_irq_disable() for scheduling in entry_64.S

I started testing the new NOHZ_FULL in the kernel and had some issues,
so I started function tracing and this bug report came out:


[23446.458073] ------------[ cut here ]------------
[23446.461028] WARNING:
at /home/rostedt/work/git/linux-trace.git/kernel/rcutree.c:388
rcu_eqs_enter+0x4b/0x89()
[23446.466096] Modules linked in: ebtables ipt_MASQUERADE sunrpc bridge
stp llc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter
ip6_tables ipv6 uinput snd_hda_codec_id
t snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm
kvm_intel snd_timer snd kvm soundcore shpchp snd_page_alloc microcode
i2c_i801 pata_acpi firewire_ohci firewi
re_core crc_itu_t ata_generic i915 drm_kms_helper drm i2c_algo_bit
i2c_core video
[23446.466096] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.9.0-test+ #11
[23446.466096] Hardware name: To Be Filled By O.E.M. To Be Filled By
O.E.M./To be filled by O.E.M., BIOS SDBLI944.86P 05/08/2007
[23446.466096] ffffffff814879f0 ffffffff81004fa6 ffffffff810337af
ffff88007d400000
[23446.466096] 0000000000000000 0000000000000000 ffff88007d40d900
ffffffff81a01f00
[23446.466096] ffffffff81acba00 ffff88007d6495c0 ffffffff810a1f36
0000000000000a28
[23446.466096] Call Trace:
[23446.466096] [<ffffffff814879f0>] ? dump_stack+0xd/0x17
[23446.466096] [<ffffffff81004fa6>] ? show_stack+0x5/0x3e
[23446.466096] [<ffffffff810337af>] ? warn_slowpath_common+0x64/0x7c
[23446.466096] [<ffffffff810a1f36>] ? rcu_eqs_enter+0x4b/0x89
[23446.466096] [<ffffffff810a1f8e>] ? rcu_idle_enter+0x1a/0x30
[23446.466096] [<ffffffff81491bc5>] ? ftrace_graph_caller+0x85/0x85
[23446.466096] [<ffffffff810707f3>] cpu_startup_entry+0xb3/0x11c
[23446.466096] [<ffffffff81ae9d2e>] ? start_kernel+0x41a/0x425
[23446.466096] [<ffffffff81ae972d>] ? repair_env_string+0x54/0x54
[23446.466096] ---[ end trace ddbb69ae2a0f6687 ]---
[23446.466096] ------------[ cut here ]------------

This is caused by an imbalance of rcu_eqs_enter() and rcu_eqs_exit().
Adding more tracing, I also discovered that there seemed to be an
imbalance in user_exit() and user_enter(). Specifically, I found that
user_enter() was called, and then a migration happened and user_exit()
was never called before the schedule. Then I had noticed this in my
trace:

run-isol-11546 1.N.. 37363.637641: function: schedule_user <-- retint_careful
run-isol-11546 1.N.. 37363.637641: function: __schedule <-- preempt_schedule

The funny part here is that schedule_user does not call
preempt_schedule. But then I also noticed the flags of the call to
schedule_user(): "1.N..". This shows that it was running on cpu 1 with
NEED_RESCHED set (N), but both interrupts and preemption are enabled. If
an interrupt came in here we can schedule out again, but that would call
preempt_schedule_irq, which has code to check if its in "user context
(according to dynamic_ticks)" and would do the user_exit() too. But
something didn't seem right here.

Then I realized that it was actually the function tracer itself, as for
every function it traces, it calls preempt_disable() and
preempt_enable() to record the data on a per_cpu buffer. That
preempt_enable() noticed the NEED_RESCHED set and since preemption is
enabled, it kindly called preempt_schedule() for us!

All this before schedule_user() was able to call user_exit() and take us
out of dynamic tick user context.

I then looked at the code in entry_64.S, and noticed that the calls to
SCHEDULE_USER were all encompassed with ENABLE_INTERRUPTS and
DISABLE_INTERRUPTS before calling schedule. And the SCHEDULE_USER macro
is different if we have CONTEXT_TRACKING enabled or not.

I created a new schedule_irq_disabled() which is similar to
schedule_preempt_disable() and works like preempt_schedule_irq(), where
it includes the exception_enter() and exit() routines that take care of
the accounting of dynamic ticks when interrupting user mode or not. It
also takes care of it before enabling interrupts, so we don't need to
worry about any preemption happening at the wrong time.

Now the code always calls schedule_irq_disabled() whether or not
CONTEXT_TRACKING is configured, which cleans up entry_64.S and removes
the need of the context_tracking.h header file in asm.

Signed-off-by: Steven Rostedt <[email protected]>

Index: linux-trace.git/arch/x86/include/asm/context_tracking.h
===================================================================
--- linux-trace.git.orig/arch/x86/include/asm/context_tracking.h
+++ /dev/null
@@ -1,10 +0,0 @@
-#ifndef _ASM_X86_CONTEXT_TRACKING_H
-#define _ASM_X86_CONTEXT_TRACKING_H
-
-#ifdef CONFIG_CONTEXT_TRACKING
-# define SCHEDULE_USER call schedule_user
-#else
-# define SCHEDULE_USER call schedule
-#endif
-
-#endif
Index: linux-trace.git/arch/x86/kernel/entry_64.S
===================================================================
--- linux-trace.git.orig/arch/x86/kernel/entry_64.S
+++ linux-trace.git/arch/x86/kernel/entry_64.S
@@ -56,7 +56,6 @@
#include <asm/ftrace.h>
#include <asm/percpu.h>
#include <asm/asm.h>
-#include <asm/context_tracking.h>
#include <asm/smap.h>
#include <linux/err.h>

@@ -654,6 +653,7 @@ sysret_check:
LOCKDEP_SYS_EXIT
DISABLE_INTERRUPTS(CLBR_NONE)
TRACE_IRQS_OFF
+sysret_check_irqsoff:
movl TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET),%edx
andl %edi,%edx
jnz sysret_careful
@@ -675,12 +675,10 @@ sysret_check:
sysret_careful:
bt $TIF_NEED_RESCHED,%edx
jnc sysret_signal
- TRACE_IRQS_ON
- ENABLE_INTERRUPTS(CLBR_NONE)
pushq_cfi %rdi
- SCHEDULE_USER
+ call schedule_irq_disabled
popq_cfi %rdi
- jmp sysret_check
+ jmp sysret_check_irqsoff

/* Handle a signal */
sysret_signal:
@@ -788,13 +786,9 @@ GLOBAL(int_with_check)
int_careful:
bt $TIF_NEED_RESCHED,%edx
jnc int_very_careful
- TRACE_IRQS_ON
- ENABLE_INTERRUPTS(CLBR_NONE)
pushq_cfi %rdi
- SCHEDULE_USER
+ call schedule_irq_disabled
popq_cfi %rdi
- DISABLE_INTERRUPTS(CLBR_NONE)
- TRACE_IRQS_OFF
jmp int_with_check

/* handle signals and tracing -- both require a full stack frame */
@@ -1088,14 +1082,10 @@ retint_careful:
CFI_RESTORE_STATE
bt $TIF_NEED_RESCHED,%edx
jnc retint_signal
- TRACE_IRQS_ON
- ENABLE_INTERRUPTS(CLBR_NONE)
pushq_cfi %rdi
- SCHEDULE_USER
+ call schedule_irq_disabled
popq_cfi %rdi
GET_THREAD_INFO(%rcx)
- DISABLE_INTERRUPTS(CLBR_NONE)
- TRACE_IRQS_OFF
jmp retint_check

retint_signal:
@@ -1534,11 +1524,7 @@ paranoid_userspace:
TRACE_IRQS_OFF
jmp paranoid_userspace
paranoid_schedule:
- TRACE_IRQS_ON
- ENABLE_INTERRUPTS(CLBR_ANY)
- SCHEDULE_USER
- DISABLE_INTERRUPTS(CLBR_ANY)
- TRACE_IRQS_OFF
+ call schedule_irq_disabled
jmp paranoid_userspace
CFI_ENDPROC
END(paranoid_exit)
Index: linux-trace.git/kernel/sched/core.c
===================================================================
--- linux-trace.git.orig/kernel/sched/core.c
+++ linux-trace.git/kernel/sched/core.c
@@ -3034,21 +3034,6 @@ asmlinkage void __sched schedule(void)
}
EXPORT_SYMBOL(schedule);

-#ifdef CONFIG_CONTEXT_TRACKING
-asmlinkage void __sched schedule_user(void)
-{
- /*
- * If we come here after a random call to set_need_resched(),
- * or we have been woken up remotely but the IPI has not yet arrived,
- * we haven't yet exited the RCU idle mode. Do it here manually until
- * we find a better solution.
- */
- user_exit();
- schedule();
- user_enter();
-}
-#endif
-
/**
* schedule_preempt_disabled - called with preemption disabled
*
@@ -3127,6 +3112,24 @@ asmlinkage void __sched preempt_schedule

#endif /* CONFIG_PREEMPT */

+/*
+ * this is the entry point to schedule() from kernel irq off context.
+ * Note, that this is called and return with irqs disabled. This will
+ * protect us against recursive calling from irq.
+ */
+asmlinkage void __sched schedule_irq_disabled(void)
+{
+ enum ctx_state prev_state;
+
+ prev_state = exception_enter();
+
+ local_irq_enable();
+ __schedule();
+ local_irq_disable();
+
+ exception_exit(prev_state);
+}
+
int default_wake_function(wait_queue_t *curr, unsigned mode, int wake_flags,
void *key)
{


2013-05-10 00:22:12

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH] x86/sched/dynamic-ticks: Call new schedule_irq_disable() for scheduling in entry_64.S

On Wed, May 08, 2013 at 09:21:20PM -0400, Steven Rostedt wrote:
> I started testing the new NOHZ_FULL in the kernel and had some issues,
> so I started function tracing and this bug report came out:
>
>
> [23446.458073] ------------[ cut here ]------------
> [23446.461028] WARNING:
> at /home/rostedt/work/git/linux-trace.git/kernel/rcutree.c:388
> rcu_eqs_enter+0x4b/0x89()
> [23446.466096] Modules linked in: ebtables ipt_MASQUERADE sunrpc bridge
> stp llc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter
> ip6_tables ipv6 uinput snd_hda_codec_id
> t snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm
> kvm_intel snd_timer snd kvm soundcore shpchp snd_page_alloc microcode
> i2c_i801 pata_acpi firewire_ohci firewi
> re_core crc_itu_t ata_generic i915 drm_kms_helper drm i2c_algo_bit
> i2c_core video
> [23446.466096] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.9.0-test+ #11
> [23446.466096] Hardware name: To Be Filled By O.E.M. To Be Filled By
> O.E.M./To be filled by O.E.M., BIOS SDBLI944.86P 05/08/2007
> [23446.466096] ffffffff814879f0 ffffffff81004fa6 ffffffff810337af
> ffff88007d400000
> [23446.466096] 0000000000000000 0000000000000000 ffff88007d40d900
> ffffffff81a01f00
> [23446.466096] ffffffff81acba00 ffff88007d6495c0 ffffffff810a1f36
> 0000000000000a28
> [23446.466096] Call Trace:
> [23446.466096] [<ffffffff814879f0>] ? dump_stack+0xd/0x17
> [23446.466096] [<ffffffff81004fa6>] ? show_stack+0x5/0x3e
> [23446.466096] [<ffffffff810337af>] ? warn_slowpath_common+0x64/0x7c
> [23446.466096] [<ffffffff810a1f36>] ? rcu_eqs_enter+0x4b/0x89
> [23446.466096] [<ffffffff810a1f8e>] ? rcu_idle_enter+0x1a/0x30
> [23446.466096] [<ffffffff81491bc5>] ? ftrace_graph_caller+0x85/0x85
> [23446.466096] [<ffffffff810707f3>] cpu_startup_entry+0xb3/0x11c
> [23446.466096] [<ffffffff81ae9d2e>] ? start_kernel+0x41a/0x425
> [23446.466096] [<ffffffff81ae972d>] ? repair_env_string+0x54/0x54
> [23446.466096] ---[ end trace ddbb69ae2a0f6687 ]---
> [23446.466096] ------------[ cut here ]------------
>
> This is caused by an imbalance of rcu_eqs_enter() and rcu_eqs_exit().
> Adding more tracing, I also discovered that there seemed to be an
> imbalance in user_exit() and user_enter(). Specifically, I found that
> user_enter() was called, and then a migration happened and user_exit()
> was never called before the schedule. Then I had noticed this in my
> trace:
>
> run-isol-11546 1.N.. 37363.637641: function: schedule_user <-- retint_careful
> run-isol-11546 1.N.. 37363.637641: function: __schedule <-- preempt_schedule
>
> The funny part here is that schedule_user does not call
> preempt_schedule. But then I also noticed the flags of the call to
> schedule_user(): "1.N..". This shows that it was running on cpu 1 with
> NEED_RESCHED set (N), but both interrupts and preemption are enabled. If
> an interrupt came in here we can schedule out again, but that would call
> preempt_schedule_irq, which has code to check if its in "user context
> (according to dynamic_ticks)" and would do the user_exit() too. But
> something didn't seem right here.
>
> Then I realized that it was actually the function tracer itself, as for
> every function it traces, it calls preempt_disable() and
> preempt_enable() to record the data on a per_cpu buffer. That
> preempt_enable() noticed the NEED_RESCHED set and since preemption is
> enabled, it kindly called preempt_schedule() for us!
>
> All this before schedule_user() was able to call user_exit() and take us
> out of dynamic tick user context.
>
> I then looked at the code in entry_64.S, and noticed that the calls to
> SCHEDULE_USER were all encompassed with ENABLE_INTERRUPTS and
> DISABLE_INTERRUPTS before calling schedule. And the SCHEDULE_USER macro
> is different if we have CONTEXT_TRACKING enabled or not.
>
> I created a new schedule_irq_disabled() which is similar to
> schedule_preempt_disable() and works like preempt_schedule_irq(), where
> it includes the exception_enter() and exit() routines that take care of
> the accounting of dynamic ticks when interrupting user mode or not. It
> also takes care of it before enabling interrupts, so we don't need to
> worry about any preemption happening at the wrong time.
>
> Now the code always calls schedule_irq_disabled() whether or not
> CONTEXT_TRACKING is configured, which cleans up entry_64.S and removes
> the need of the context_tracking.h header file in asm.
>
> Signed-off-by: Steven Rostedt <[email protected]>
>
> Index: linux-trace.git/arch/x86/include/asm/context_tracking.h
> ===================================================================
> --- linux-trace.git.orig/arch/x86/include/asm/context_tracking.h
> +++ /dev/null
> @@ -1,10 +0,0 @@
> -#ifndef _ASM_X86_CONTEXT_TRACKING_H
> -#define _ASM_X86_CONTEXT_TRACKING_H
> -
> -#ifdef CONFIG_CONTEXT_TRACKING
> -# define SCHEDULE_USER call schedule_user
> -#else
> -# define SCHEDULE_USER call schedule
> -#endif
> -
> -#endif
> Index: linux-trace.git/arch/x86/kernel/entry_64.S
> ===================================================================
> --- linux-trace.git.orig/arch/x86/kernel/entry_64.S
> +++ linux-trace.git/arch/x86/kernel/entry_64.S
> @@ -56,7 +56,6 @@
> #include <asm/ftrace.h>
> #include <asm/percpu.h>
> #include <asm/asm.h>
> -#include <asm/context_tracking.h>
> #include <asm/smap.h>
> #include <linux/err.h>
>
> @@ -654,6 +653,7 @@ sysret_check:
> LOCKDEP_SYS_EXIT
> DISABLE_INTERRUPTS(CLBR_NONE)
> TRACE_IRQS_OFF
> +sysret_check_irqsoff:
> movl TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET),%edx
> andl %edi,%edx
> jnz sysret_careful
> @@ -675,12 +675,10 @@ sysret_check:
> sysret_careful:
> bt $TIF_NEED_RESCHED,%edx
> jnc sysret_signal
> - TRACE_IRQS_ON
> - ENABLE_INTERRUPTS(CLBR_NONE)
> pushq_cfi %rdi
> - SCHEDULE_USER
> + call schedule_irq_disabled
> popq_cfi %rdi
> - jmp sysret_check
> + jmp sysret_check_irqsoff

Don't we still want to call LOCKDEP_SYS_EXIT?, the previous
task scheduled might have taken some locks.

>
> /* Handle a signal */
> sysret_signal:
> @@ -788,13 +786,9 @@ GLOBAL(int_with_check)
> int_careful:
> bt $TIF_NEED_RESCHED,%edx
> jnc int_very_careful
> - TRACE_IRQS_ON
> - ENABLE_INTERRUPTS(CLBR_NONE)
> pushq_cfi %rdi
> - SCHEDULE_USER
> + call schedule_irq_disabled
> popq_cfi %rdi
> - DISABLE_INTERRUPTS(CLBR_NONE)
> - TRACE_IRQS_OFF
> jmp int_with_check
>
> /* handle signals and tracing -- both require a full stack frame */
> @@ -1088,14 +1082,10 @@ retint_careful:
> CFI_RESTORE_STATE
> bt $TIF_NEED_RESCHED,%edx
> jnc retint_signal
> - TRACE_IRQS_ON
> - ENABLE_INTERRUPTS(CLBR_NONE)
> pushq_cfi %rdi
> - SCHEDULE_USER
> + call schedule_irq_disabled
> popq_cfi %rdi
> GET_THREAD_INFO(%rcx)
> - DISABLE_INTERRUPTS(CLBR_NONE)
> - TRACE_IRQS_OFF
> jmp retint_check
>
> retint_signal:
> @@ -1534,11 +1524,7 @@ paranoid_userspace:
> TRACE_IRQS_OFF
> jmp paranoid_userspace
> paranoid_schedule:
> - TRACE_IRQS_ON
> - ENABLE_INTERRUPTS(CLBR_ANY)
> - SCHEDULE_USER
> - DISABLE_INTERRUPTS(CLBR_ANY)
> - TRACE_IRQS_OFF
> + call schedule_irq_disabled
> jmp paranoid_userspace
> CFI_ENDPROC
> END(paranoid_exit)
> Index: linux-trace.git/kernel/sched/core.c
> ===================================================================
> --- linux-trace.git.orig/kernel/sched/core.c
> +++ linux-trace.git/kernel/sched/core.c
> @@ -3034,21 +3034,6 @@ asmlinkage void __sched schedule(void)
> }
> EXPORT_SYMBOL(schedule);
>
> -#ifdef CONFIG_CONTEXT_TRACKING
> -asmlinkage void __sched schedule_user(void)
> -{
> - /*
> - * If we come here after a random call to set_need_resched(),
> - * or we have been woken up remotely but the IPI has not yet arrived,
> - * we haven't yet exited the RCU idle mode. Do it here manually until
> - * we find a better solution.
> - */
> - user_exit();
> - schedule();
> - user_enter();
> -}
> -#endif
> -
> /**
> * schedule_preempt_disabled - called with preemption disabled
> *
> @@ -3127,6 +3112,24 @@ asmlinkage void __sched preempt_schedule
>
> #endif /* CONFIG_PREEMPT */
>
> +/*
> + * this is the entry point to schedule() from kernel irq off context.
> + * Note, that this is called and return with irqs disabled. This will
> + * protect us against recursive calling from irq.
> + */
> +asmlinkage void __sched schedule_irq_disabled(void)
> +{
> + enum ctx_state prev_state;
> +
> + prev_state = exception_enter();

So why have you turned user_enter to exception_enter?
Is it to make it work anywhere and not just preemption on
user resume time?

> +
> + local_irq_enable();
> + __schedule();

Are you sure it's fine to call __schedule() instead of schedule()?
I don't know much about the blk work to handle there but it may be
worth considering.

> + local_irq_disable();
> +
> + exception_exit(prev_state);
> +}

The patch looks pretty good otherwise.

Thanks.

> +
> int default_wake_function(wait_queue_t *curr, unsigned mode, int wake_flags,
> void *key)
> {
>
>

2013-05-10 00:47:13

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH] x86/sched/dynamic-ticks: Call new schedule_irq_disable() for scheduling in entry_64.S

On Fri, 2013-05-10 at 02:22 +0200, Frederic Weisbecker wrote:
>
> > @@ -654,6 +653,7 @@ sysret_check:
> > LOCKDEP_SYS_EXIT
> > DISABLE_INTERRUPTS(CLBR_NONE)
> > TRACE_IRQS_OFF
> > +sysret_check_irqsoff:
> > movl TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET),%edx
> > andl %edi,%edx
> > jnz sysret_careful
> > @@ -675,12 +675,10 @@ sysret_check:
> > sysret_careful:
> > bt $TIF_NEED_RESCHED,%edx
> > jnc sysret_signal
> > - TRACE_IRQS_ON
> > - ENABLE_INTERRUPTS(CLBR_NONE)
> > pushq_cfi %rdi
> > - SCHEDULE_USER
> > + call schedule_irq_disabled
> > popq_cfi %rdi
> > - jmp sysret_check
> > + jmp sysret_check_irqsoff
>
> Don't we still want to call LOCKDEP_SYS_EXIT?, the previous
> task scheduled might have taken some locks.

Sure, but would that still catch this? I thought LOCKDEP_SYS_EXIT just
cared about going into user land if the current task had locks. Where it
shouldn't here. And you are concerned about the previous task. Well, the
scheduler should have caught that, as you shouldn't have any spin locks
there. And it doesn't matter if the previous task had a mutex. The
previous task is still in kernel space.

>
> > +/*
> > + * this is the entry point to schedule() from kernel irq off context.
> > + * Note, that this is called and return with irqs disabled. This will
> > + * protect us against recursive calling from irq.
> > + */
> > +asmlinkage void __sched schedule_irq_disabled(void)
> > +{
> > + enum ctx_state prev_state;
> > +
> > + prev_state = exception_enter();
>
> So why have you turned user_enter to exception_enter?
> Is it to make it work anywhere and not just preemption on
> user resume time?

Pretty much. I just copied the way preempt_schedule_irq() did this.

>
> > +
> > + local_irq_enable();
> > + __schedule();
>
> Are you sure it's fine to call __schedule() instead of schedule()?
> I don't know much about the blk work to handle there but it may be
> worth considering.

Note, this isn't a call to schedule that was made by normal kernel
space. This is very much like a preemption. In fact, it is a preemption
and my first patch used preempt_schedule_irq() instead, but as that's
made for CONFIG_PREEMPT enabled, it seemed a bit hacky to take that out
of that context.

This is only called when the task is being preempted. But because its
going to user land, its OK. The blk code there is more worried about
something that did some work, set itself to TASK_UNINTERRUPTIBLE and
then called schedule() (think completions). That is, the blk code that
needs to be submitted will be the code that wakes up this task.

We should never be in anything but TASK_RUNNING when going into user
space.

>
> > + local_irq_disable();
> > +
> > + exception_exit(prev_state);
> > +}
>
> The patch looks pretty good otherwise.

Thanks!

-- Steve

2013-05-10 00:51:34

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH] x86/sched/dynamic-ticks: Call new schedule_irq_disable() for scheduling in entry_64.S

On Thu, 2013-05-09 at 20:47 -0400, Steven Rostedt wrote:

> >
> > > +
> > > + local_irq_enable();
> > > + __schedule();
> >
> > Are you sure it's fine to call __schedule() instead of schedule()?
> > I don't know much about the blk work to handle there but it may be
> > worth considering.
>
> Note, this isn't a call to schedule that was made by normal kernel
> space. This is very much like a preemption. In fact, it is a preemption
> and my first patch used preempt_schedule_irq() instead, but as that's
> made for CONFIG_PREEMPT enabled, it seemed a bit hacky to take that out
> of that context.
>
> This is only called when the task is being preempted. But because its
> going to user land, its OK. The blk code there is more worried about
> something that did some work, set itself to TASK_UNINTERRUPTIBLE and
> then called schedule() (think completions). That is, the blk code that
> needs to be submitted will be the code that wakes up this task.
>
> We should never be in anything but TASK_RUNNING when going into user
> space.

Maybe I should rename this function to schedule_user() to explicitly
show that this is for scheduling when we came from or going to user
space.

-- Steve

2013-05-11 01:26:25

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH] x86/sched/dynamic-ticks: Call new schedule_irq_disable() for scheduling in entry_64.S

On Thu, May 09, 2013 at 08:47:09PM -0400, Steven Rostedt wrote:
> On Fri, 2013-05-10 at 02:22 +0200, Frederic Weisbecker wrote:
> >
> > > @@ -654,6 +653,7 @@ sysret_check:
> > > LOCKDEP_SYS_EXIT
> > > DISABLE_INTERRUPTS(CLBR_NONE)
> > > TRACE_IRQS_OFF
> > > +sysret_check_irqsoff:
> > > movl TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET),%edx
> > > andl %edi,%edx
> > > jnz sysret_careful
> > > @@ -675,12 +675,10 @@ sysret_check:
> > > sysret_careful:
> > > bt $TIF_NEED_RESCHED,%edx
> > > jnc sysret_signal
> > > - TRACE_IRQS_ON
> > > - ENABLE_INTERRUPTS(CLBR_NONE)
> > > pushq_cfi %rdi
> > > - SCHEDULE_USER
> > > + call schedule_irq_disabled
> > > popq_cfi %rdi
> > > - jmp sysret_check
> > > + jmp sysret_check_irqsoff
> >
> > Don't we still want to call LOCKDEP_SYS_EXIT?, the previous
> > task scheduled might have taken some locks.
>
> Sure, but would that still catch this? I thought LOCKDEP_SYS_EXIT just
> cared about going into user land if the current task had locks. Where it
> shouldn't here. And you are concerned about the previous task. Well, the
> scheduler should have caught that, as you shouldn't have any spin locks
> there. And it doesn't matter if the previous task had a mutex. The
> previous task is still in kernel space.

Ah you're right. But still the current task had several opportunities to take locks
since the context switch. It's still relevant to check that.

> >
> > > +
> > > + local_irq_enable();
> > > + __schedule();
> >
> > Are you sure it's fine to call __schedule() instead of schedule()?
> > I don't know much about the blk work to handle there but it may be
> > worth considering.
>
> Note, this isn't a call to schedule that was made by normal kernel
> space. This is very much like a preemption. In fact, it is a preemption
> and my first patch used preempt_schedule_irq() instead, but as that's
> made for CONFIG_PREEMPT enabled, it seemed a bit hacky to take that out
> of that context.
>
> This is only called when the task is being preempted. But because its
> going to user land, its OK. The blk code there is more worried about
> something that did some work, set itself to TASK_UNINTERRUPTIBLE and
> then called schedule() (think completions). That is, the blk code that
> needs to be submitted will be the code that wakes up this task.
>
> We should never be in anything but TASK_RUNNING when going into user
> space.

Ok, I haven't checked the details of what that blk_ thing is doing so,
if it really doesn't care about any kind of preemption, that's ok.