2008-11-05 12:57:37

by Zdenek Kabelac

[permalink] [raw]
Subject: sys_sched_yield keeps locked irq before call of schedule()

Hi

With recent 2.6.28-rc3 kernel I've noticed that schedule() is
sometime being called with locked IRQ

Call Trace:
[<ffffffff81334592>] _spin_lock_irq+0x92/0xa0
[<ffffffff8133126b>] schedule+0x13b/0x4cb
[<ffffffff81013c10>] ? native_sched_clock+0x70/0xa0
[<ffffffff81040930>] ? sys_sched_yield+0x0/0x80
[<ffffffff810409a5>] sys_sched_yield+0x75/0x80
[<ffffffff8100c6bb>] system_call_fastpath+0x16/0x1b


Which is a result of the function sys_sched_yield() that calls
schedule() while it keeps irq.

Is it correct to call the function schedule() which 'usually' expects
irq being unlocked and do some 'lenghty'
operations (i.e. debug statistics) which do not need to keep irq
locked for such a long time?

I've no idea whether this bug or feature and for now I'm using this
simple patch to fix this issue.

diff --git a/kernel/sched.c b/kernel/sched.c
index e8819bc..80eb633 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5546,6 +5546,7 @@ asmlinkage long sys_sched_yield(void)
spin_release(&rq->lock.dep_map, 1, _THIS_IP_);
_raw_spin_unlock(&rq->lock);
preempt_enable_no_resched();
+ local_irq_enable();

schedule();


Zdenek


2008-11-05 13:01:28

by Ingo Molnar

[permalink] [raw]
Subject: Re: sys_sched_yield keeps locked irq before call of schedule()


* Zdenek Kabelac <[email protected]> wrote:

> Hi
>
> With recent 2.6.28-rc3 kernel I've noticed that schedule() is
> sometime being called with locked IRQ
>
> Call Trace:
> [<ffffffff81334592>] _spin_lock_irq+0x92/0xa0
> [<ffffffff8133126b>] schedule+0x13b/0x4cb
> [<ffffffff81013c10>] ? native_sched_clock+0x70/0xa0
> [<ffffffff81040930>] ? sys_sched_yield+0x0/0x80
> [<ffffffff810409a5>] sys_sched_yield+0x75/0x80
> [<ffffffff8100c6bb>] system_call_fastpath+0x16/0x1b

is this some special warning you added? The stock kernel certainly
does not emit this warning.

> Which is a result of the function sys_sched_yield() that calls
> schedule() while it keeps irq.
>
> Is it correct to call the function schedule() which 'usually'
> expects irq being unlocked and do some 'lenghty' operations (i.e.
> debug statistics) which do not need to keep irq locked for such a
> long time?

the debug statistics are all fast and we disable interrupts straight
away when entering the schedule().

Ingo

2008-11-05 13:09:47

by Zdenek Kabelac

[permalink] [raw]
Subject: Re: sys_sched_yield keeps locked irq before call of schedule()

2008/11/5 Ingo Molnar <[email protected]>:
>
> * Zdenek Kabelac <[email protected]> wrote:
>
>> Hi
>>
>> With recent 2.6.28-rc3 kernel I've noticed that schedule() is
>> sometime being called with locked IRQ
>>
>> Call Trace:
>> [<ffffffff81334592>] _spin_lock_irq+0x92/0xa0
>> [<ffffffff8133126b>] schedule+0x13b/0x4cb
>> [<ffffffff81013c10>] ? native_sched_clock+0x70/0xa0
>> [<ffffffff81040930>] ? sys_sched_yield+0x0/0x80
>> [<ffffffff810409a5>] sys_sched_yield+0x75/0x80
>> [<ffffffff8100c6bb>] system_call_fastpath+0x16/0x1b
>
> is this some special warning you added? The stock kernel certainly
> does not emit this warning.

Yes - it's my personal debug checker that tries to watch wether irq &
irqsafe are running in pair.
So it shows a warning if there is a call of spin_lock_irq and irq is
already dissabled.

>
>> Which is a result of the function sys_sched_yield() that calls
>> schedule() while it keeps irq.
>>
>> Is it correct to call the function schedule() which 'usually'
>> expects irq being unlocked and do some 'lenghty' operations (i.e.
>> debug statistics) which do not need to keep irq locked for such a
>> long time?
>
> the debug statistics are all fast and we disable interrupts straight
> away when entering the schedule().

IRQ is disabled with spin_lock_irc() inside schedule() and it looks
like there is quite a few lines above which receive 'extra' bonus with
being run without enabled irq, but as I said - I'm just checking
whether this is intentional or not.

Zdenek

2008-11-05 14:05:32

by Arnd Bergmann

[permalink] [raw]
Subject: Re: sys_sched_yield keeps locked irq before call of schedule()

On Wednesday 05 November 2008, Zdenek Kabelac wrote:
> > is this some special warning you added? The stock kernel certainly
> > does not emit this warning.
>
> Yes - it's my personal debug checker that tries to watch wether irq &
> irqsafe are running in pair.
> So it shows a warning if there is a call of spin_lock_irq and irq is
> already dissabled.

I've done such a checker before as well, but back then it showed
far too many false positives. If you have a working version, can you
post your patch here? I think that would be very useful in the kernel.

Arnd <><

2008-11-05 14:37:34

by Zdenek Kabelac

[permalink] [raw]
Subject: Re: sys_sched_yield keeps locked irq before call of schedule()

2008/11/5 Arnd Bergmann <[email protected]>:
> On Wednesday 05 November 2008, Zdenek Kabelac wrote:
>> > is this some special warning you added? The stock kernel certainly
>> > does not emit this warning.
>>
>> Yes - it's my personal debug checker that tries to watch wether irq &
>> irqsafe are running in pair.
>> So it shows a warning if there is a call of spin_lock_irq and irq is
>> already dissabled.
>
> I've done such a checker before as well, but back then it showed
> far too many false positives. If you have a working version, can you
> post your patch here? I think that would be very useful in the kernel.

It's really nothing 'perfect' and has no aspiration to be a part of
kernel at all - I'm just using it for checking correctness of my own
development. Also to avoid lots of weird positives I've added couple
'fixes' where I'm using irqsave() instead of plain 'irq()' lock - but
it's mostly codepath, where kernel prints some oopses....

So here is my simple patch - which is most probably horribly broken :)
but it helps in some cases to detect potential problem.

diff --git a/drivers/char/tty_audit.c b/drivers/char/tty_audit.c
index 5787249..2fe0846 100644
--- a/drivers/char/tty_audit.c
+++ b/drivers/char/tty_audit.c
@@ -121,11 +121,12 @@ static void tty_audit_buf_push_current(struct
tty_audit_buf *buf)
void tty_audit_exit(void)
{
struct tty_audit_buf *buf;
+ unsigned long flags;

- spin_lock_irq(&current->sighand->siglock);
+ spin_lock_irqsave(&current->sighand->siglock, flags);
buf = current->signal->tty_audit_buf;
current->signal->tty_audit_buf = NULL;
- spin_unlock_irq(&current->sighand->siglock);
+ spin_unlock_irqrestore(&current->sighand->siglock, flags);
if (!buf)
return;

diff --git a/kernel/acct.c b/kernel/acct.c
index f6006a6..1a03be3 100644
--- a/kernel/acct.c
+++ b/kernel/acct.c
@@ -598,6 +598,7 @@ void acct_collect(long exitcode, int group_dead)
{
struct pacct_struct *pacct = &current->signal->pacct;
unsigned long vsize = 0;
+ unsigned long flags;

if (group_dead && current->mm) {
struct vm_area_struct *vma;
@@ -610,7 +611,7 @@ void acct_collect(long exitcode, int group_dead)
up_read(&current->mm->mmap_sem);
}

- spin_lock_irq(&current->sighand->siglock);
+ spin_lock_irqsave(&current->sighand->siglock, flags);
if (group_dead)
pacct->ac_mem = vsize / 1024;
if (thread_group_leader(current)) {
@@ -628,7 +629,7 @@ void acct_collect(long exitcode, int group_dead)
pacct->ac_stime = cputime_add(pacct->ac_stime, current->stime);
pacct->ac_minflt += current->min_flt;
pacct->ac_majflt += current->maj_flt;
- spin_unlock_irq(&current->sighand->siglock);
+ spin_unlock_irqrestore(&current->sighand->siglock, flags);
}

static void acct_process_in_ns(struct pid_namespace *ns)
diff --git a/kernel/exit.c b/kernel/exit.c
index 80137a5..104cbfc 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -158,11 +158,12 @@ void release_task(struct task_struct * p)
{
struct task_struct *leader;
int zap_leader;
+ unsigned long flags;
repeat:
tracehook_prepare_release_task(p);
atomic_dec(&p->user->processes);
proc_flush_task(p);
- write_lock_irq(&tasklist_lock);
+ write_lock_irqsave(&tasklist_lock, flags);
tracehook_finish_release_task(p);
__exit_signal(p);

@@ -194,7 +195,7 @@ repeat:
leader->exit_state = EXIT_DEAD;
}

- write_unlock_irq(&tasklist_lock);
+ write_unlock_irqrestore(&tasklist_lock, flags);
release_thread(p);
call_rcu(&p->rcu, delayed_put_task_struct);

@@ -873,10 +874,11 @@ static struct task_struct
*find_new_reaper(struct task_struct *father)

static void forget_original_parent(struct task_struct *father)
{
+ unsigned long flags;
struct task_struct *p, *n, *reaper;
LIST_HEAD(ptrace_dead);

- write_lock_irq(&tasklist_lock);
+ write_lock_irqsave(&tasklist_lock, flags);
reaper = find_new_reaper(father);
/*
* First clean up ptrace if we were using it.
@@ -892,7 +894,7 @@ static void forget_original_parent(struct
task_struct *father)
reparent_thread(p, father);
}

- write_unlock_irq(&tasklist_lock);
+ write_unlock_irqrestore(&tasklist_lock, flags);
BUG_ON(!list_empty(&father->children));

ptrace_exit_finish(father, &ptrace_dead);
@@ -906,6 +908,7 @@ static void exit_notify(struct task_struct *tsk,
int group_dead)
{
int signal;
void *cookie;
+ unsigned long flags;

/*
* This does two things:
@@ -918,7 +921,7 @@ static void exit_notify(struct task_struct *tsk,
int group_dead)
forget_original_parent(tsk);
exit_task_namespaces(tsk);

- write_lock_irq(&tasklist_lock);
+ write_lock_irqsave(&tasklist_lock, flags);
if (group_dead)
kill_orphaned_pgrp(tsk->group_leader, NULL);

@@ -954,7 +957,7 @@ static void exit_notify(struct task_struct *tsk,
int group_dead)
tsk->signal->notify_count < 0)
wake_up_process(tsk->signal->group_exit_task);

- write_unlock_irq(&tasklist_lock);
+ write_unlock_irqrestore(&tasklist_lock, flags);

tracehook_report_death(tsk, signal, cookie, group_dead);

diff --git a/kernel/spinlock.c b/kernel/spinlock.c
index 29ab207..342bec6 100644
--- a/kernel/spinlock.c
+++ b/kernel/spinlock.c
@@ -98,6 +98,10 @@ EXPORT_SYMBOL(_spin_lock_irqsave);

void __lockfunc _spin_lock_irq(spinlock_t *lock)
{
+ if (raw_irqs_disabled()) {
+ printk("SPIN IRQ ALREADY DISABLED\n");
+ dump_stack();
+ }
local_irq_disable();
preempt_disable();
spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
@@ -128,6 +132,10 @@ EXPORT_SYMBOL(_read_lock_irqsave);

void __lockfunc _read_lock_irq(rwlock_t *lock)
{
+ if (raw_irqs_disabled()) {
+ printk("READ IRQ ALREADY DISABLED\n");
+ dump_stack();
+ }
local_irq_disable();
preempt_disable();
rwlock_acquire_read(&lock->dep_map, 0, 0, _RET_IP_);
@@ -158,6 +166,10 @@ EXPORT_SYMBOL(_write_lock_irqsave);

void __lockfunc _write_lock_irq(rwlock_t *lock)
{
+ if (raw_irqs_disabled()) {
+ printk("WRITE IRQ ALREADY DISABLED\n");
+ dump_stack();
+ }
local_irq_disable();
preempt_disable();
rwlock_acquire(&lock->dep_map, 0, 0, _RET_IP_);
@@ -350,6 +362,10 @@ EXPORT_SYMBOL(_read_unlock);

void __lockfunc _spin_unlock_irqrestore(spinlock_t *lock, unsigned long flags)
{
+ if (!raw_irqs_disabled()) {
+ printk("SPIN IRQ ALREADY ENABLED\n");
+ dump_stack();
+ }
spin_release(&lock->dep_map, 1, _RET_IP_);
_raw_spin_unlock(lock);
local_irq_restore(flags);
@@ -359,6 +375,10 @@ EXPORT_SYMBOL(_spin_unlock_irqrestore);

void __lockfunc _spin_unlock_irq(spinlock_t *lock)
{
+ if (!raw_irqs_disabled()) {
+ printk("SPIN IRQ ALREADY ENABLED\n");
+ dump_stack();
+ }
spin_release(&lock->dep_map, 1, _RET_IP_);
_raw_spin_unlock(lock);
local_irq_enable();
@@ -377,6 +397,10 @@ EXPORT_SYMBOL(_spin_unlock_bh);

void __lockfunc _read_unlock_irqrestore(rwlock_t *lock, unsigned long flags)
{
+ if (!raw_irqs_disabled()) {
+ printk("READ IRQ ALREADY ENABLED\n");
+ dump_stack();
+ }
rwlock_release(&lock->dep_map, 1, _RET_IP_);
_raw_read_unlock(lock);
local_irq_restore(flags);
@@ -386,6 +410,10 @@ EXPORT_SYMBOL(_read_unlock_irqrestore);

void __lockfunc _read_unlock_irq(rwlock_t *lock)
{
+ if (!raw_irqs_disabled()) {
+ printk("READ IRQ ALREADY ENABLED\n");
+ dump_stack();
+ }
rwlock_release(&lock->dep_map, 1, _RET_IP_);
_raw_read_unlock(lock);
local_irq_enable();
@@ -404,6 +432,10 @@ EXPORT_SYMBOL(_read_unlock_bh);

void __lockfunc _write_unlock_irqrestore(rwlock_t *lock, unsigned long flags)
{
+ if (!raw_irqs_disabled()) {
+ printk("WRITE IRQ ALREADY ENABLED\n");
+ dump_stack();
+ }
rwlock_release(&lock->dep_map, 1, _RET_IP_);
_raw_write_unlock(lock);
local_irq_restore(flags);
@@ -413,6 +445,10 @@ EXPORT_SYMBOL(_write_unlock_irqrestore);

void __lockfunc _write_unlock_irq(rwlock_t *lock)
{
+ if (!raw_irqs_disabled()) {
+ printk("WRITE IRQ ALREADY ENABLED\n");
+ dump_stack();
+ }
rwlock_release(&lock->dep_map, 1, _RET_IP_);
_raw_write_unlock(lock);
local_irq_enable();

2008-11-06 07:58:16

by Ingo Molnar

[permalink] [raw]
Subject: Re: sys_sched_yield keeps locked irq before call of schedule()


* Zdenek Kabelac <[email protected]> wrote:

> > is this some special warning you added? The stock kernel certainly
> > does not emit this warning.
>
> Yes - it's my personal debug checker that tries to watch wether irq
> & irqsafe are running in pair. So it shows a warning if there is a
> call of spin_lock_irq and irq is already dissabled.

btw., i have added lockdep instrumentation for this two years ago,
it's in the upstream kernel:

/proc/lockdep_stats: redundant hardirq ons: 7744
/proc/lockdep_stats: redundant hardirq offs: 1873736
/proc/lockdep_stats: redundant softirq ons: 0
/proc/lockdep_stats: redundant softirq offs: 0

you could extend the code to trace it. See
kernel/lockdep.c:trace_hardirqs_on_caller(). This bit does it:

if (unlikely(curr->hardirqs_enabled)) {
debug_atomic_inc(&redundant_hardirqs_on);
return;
}

you could add a ftrace_printk("redundant hardirq") call right there,
enable ftrace, and enable stack tracing of each trace entry:

echo stacktrace > /debug/tracing/iter_ctrl

then you should see all the call sites in /debug/tracing/trace.

Ingo

2008-11-06 13:46:52

by Steven Rostedt

[permalink] [raw]
Subject: Re: sys_sched_yield keeps locked irq before call of schedule()


On Thu, 6 Nov 2008, Ingo Molnar wrote:
>
> * Zdenek Kabelac <[email protected]> wrote:
>
> > > is this some special warning you added? The stock kernel certainly
> > > does not emit this warning.
> >
> > Yes - it's my personal debug checker that tries to watch wether irq
> > & irqsafe are running in pair. So it shows a warning if there is a
> > call of spin_lock_irq and irq is already dissabled.
>
> btw., i have added lockdep instrumentation for this two years ago,
> it's in the upstream kernel:
>
> /proc/lockdep_stats: redundant hardirq ons: 7744
> /proc/lockdep_stats: redundant hardirq offs: 1873736
> /proc/lockdep_stats: redundant softirq ons: 0
> /proc/lockdep_stats: redundant softirq offs: 0
>
> you could extend the code to trace it. See
> kernel/lockdep.c:trace_hardirqs_on_caller(). This bit does it:
>
> if (unlikely(curr->hardirqs_enabled)) {
> debug_atomic_inc(&redundant_hardirqs_on);
> return;
> }
>
> you could add a ftrace_printk("redundant hardirq") call right there,
> enable ftrace, and enable stack tracing of each trace entry:

Quick note. Currently ftrace_printk only shows up if it is set in the
iter_ctrl.

echo trace_printk > /debug/tracing/iter_ctrl

I'll send out a patch to have it on by default.

-- Steve

>
> echo stacktrace > /debug/tracing/iter_ctrl
>
> then you should see all the call sites in /debug/tracing/trace.
>
> Ingo
>
>

2008-11-06 14:34:49

by Ingo Molnar

[permalink] [raw]
Subject: Re: sys_sched_yield keeps locked irq before call of schedule()


* Steven Rostedt <[email protected]> wrote:

>
> On Thu, 6 Nov 2008, Ingo Molnar wrote:
> >
> > * Zdenek Kabelac <[email protected]> wrote:
> >
> > > > is this some special warning you added? The stock kernel certainly
> > > > does not emit this warning.
> > >
> > > Yes - it's my personal debug checker that tries to watch wether irq
> > > & irqsafe are running in pair. So it shows a warning if there is a
> > > call of spin_lock_irq and irq is already dissabled.
> >
> > btw., i have added lockdep instrumentation for this two years ago,
> > it's in the upstream kernel:
> >
> > /proc/lockdep_stats: redundant hardirq ons: 7744
> > /proc/lockdep_stats: redundant hardirq offs: 1873736
> > /proc/lockdep_stats: redundant softirq ons: 0
> > /proc/lockdep_stats: redundant softirq offs: 0
> >
> > you could extend the code to trace it. See
> > kernel/lockdep.c:trace_hardirqs_on_caller(). This bit does it:
> >
> > if (unlikely(curr->hardirqs_enabled)) {
> > debug_atomic_inc(&redundant_hardirqs_on);
> > return;
> > }
> >
> > you could add a ftrace_printk("redundant hardirq") call right there,
> > enable ftrace, and enable stack tracing of each trace entry:
>
> Quick note. Currently ftrace_printk only shows up if it is set in the
> iter_ctrl.
>
> echo trace_printk > /debug/tracing/iter_ctrl
>
> I'll send out a patch to have it on by default.

ok, agreed - but in that case please make the NOP tracer
tracing_enabled=0 by default. (we'll get a lot of ftrace_printk
tracepoints by default otherwise)

Ingo

2008-11-06 14:47:19

by Steven Rostedt

[permalink] [raw]
Subject: Re: sys_sched_yield keeps locked irq before call of schedule()


On Thu, 6 Nov 2008, Ingo Molnar wrote:
> >
> > Quick note. Currently ftrace_printk only shows up if it is set in the
> > iter_ctrl.
> >
> > echo trace_printk > /debug/tracing/iter_ctrl
> >
> > I'll send out a patch to have it on by default.
>
> ok, agreed - but in that case please make the NOP tracer
> tracing_enabled=0 by default. (we'll get a lot of ftrace_printk
> tracepoints by default otherwise)

Hmm, I think it may be good to leave it on even for the nop tracer.

ftrace_printk's should not be out in open code unless someone is actually
debugging it. You can use ftrace_printk inside a tracer to write stuff,
but that printk should be protected somehow by the encompassing tracer.

i.e.

some_tracer(void) {
if (!tracer_enabled)
return;
ftrace_printk("my special stuff\n");
}

I would think if someone is debugging their code by using ftrace_printk,
they probably want it to show up right away, without having to remember to
do a...


echo trace_printk > /debug/tracing/iter_ctrl

And still have it show up in the nop tracer.

-- Steve