Hi scheduler hackers,
I'm very occasionally seeing the lockdep warning below on our boxes
running 2.6.39 (PREEMPT=n, so "unfair" _double_lock_balance()). I
think I see the explanation, and it's probably not even worth fixing:
On the unlock side, we have:
static inline void double_unlock_balance(struct rq *this_rq, struct rq *busiest)
__releases(busiest->lock)
{
raw_spin_unlock(&busiest->lock);
lock_set_subclass(&this_rq->lock.dep_map, 0, _RET_IP_);
}
while on the lock side we have:
static int _double_lock_balance(struct rq *this_rq, struct rq *busiest)
__releases(this_rq->lock)
__acquires(busiest->lock)
__acquires(this_rq->lock)
{
int ret = 0;
if (unlikely(!raw_spin_trylock(&busiest->lock))) {
if (busiest < this_rq) {
raw_spin_unlock(&this_rq->lock);
raw_spin_lock(&busiest->lock);
raw_spin_lock_nested(&this_rq->lock,
SINGLE_DEPTH_NESTING);
ret = 1;
} else
raw_spin_lock_nested(&busiest->lock,
SINGLE_DEPTH_NESTING);
}
return ret;
}
So it seems we have the following (purely lockdep-related) race:
unlock: lock:
if (unlikely(!raw_spin_trylock(&busiest->lock))) { //fail to lock
raw_spin_unlock(&busiest->lock);
if (busiest < this_rq) { //not true
} else
raw_spin_lock_nested(&busiest->lock,
SINGLE_DEPTH_NESTING);
lock_set_subclass(&this_rq->lock.dep_map, 0, _RET_IP_); //too late
where we end up trying to take a second lock with SINGLE_DEPTH_NESTING
before we've promoted our first lock to subclass 0.
I'm not sure this is easily fixable (if we flipped the
lock_set_subclass() to before the raw_spin_unlock(), would we
introduce other lockdep problems?). Mostly I want to make sure
my diagnosis is correct and there's not some actual deadlock
that could happen.
So does this make sense?
Thanks,
Roland
Here's the actual lockdep warning:
[89945.638847] =============================================
[89945.638974] [ INFO: possible recursive locking detected ]
[89945.639033] 2.6.39.3-dbg+ #13245
[89945.639079] ---------------------------------------------
[89945.639131] foed/7820 is trying to acquire lock:
[89945.639180] (&rq->lock/1){..-.-.}, at: [<ffffffff8103fa1a>] double_lock_balance+0x5a/0x90
[89945.639291]
[89945.639292] but task is already holding lock:
[89945.639374] (&rq->lock/1){..-.-.}, at: [<ffffffff8103fa3b>] double_lock_balance+0x7b/0x90
[89945.639480]
[89945.639481] other info that might help us debug this:
[89945.639566] 2 locks held by foed/7820:
[89945.639611] #0: (&mm->mmap_sem){++++++}, at: [<ffffffff8156a445>] do_page_fault+0x1f5/0x560
[89945.639717] #1: (&rq->lock/1){..-.-.}, at: [<ffffffff8103fa3b>] double_lock_balance+0x7b/0x90
[89945.639829]
[89945.639830] stack backtrace:
[89945.639908] Pid: 7820, comm: foed Tainted: G W 2.6.39.3-dbg+ #13245
[89945.639965] Call Trace:
[89945.640011] [<ffffffff8109293f>] __lock_acquire+0x153f/0x1da0
[89945.640069] [<ffffffff81009235>] ? native_sched_clock+0x15/0x70
[89945.640126] [<ffffffff810812ef>] ? local_clock+0x6f/0x80
[89945.640179] [<ffffffff8103fa3b>] ? double_lock_balance+0x7b/0x90
[89945.640235] [<ffffffff8109384d>] lock_acquire+0x9d/0x130
[89945.640288] [<ffffffff8103fa1a>] ? double_lock_balance+0x5a/0x90
[89945.640345] [<ffffffff810fe47f>] ? cpupri_find+0xcf/0x140
[89945.640401] [<ffffffff81565ce4>] _raw_spin_lock_nested+0x34/0x70
[89945.640456] [<ffffffff8103fa1a>] ? double_lock_balance+0x5a/0x90
[89945.640512] [<ffffffff8103fa1a>] double_lock_balance+0x5a/0x90
[89945.640568] [<ffffffff8104c546>] push_rt_task+0xc6/0x290
[89945.640621] [<ffffffff8104c7c0>] push_rt_tasks+0x20/0x30
[89945.640674] [<ffffffff8104c83e>] post_schedule_rt+0xe/0x10
[89945.640729] [<ffffffff81562d8c>] schedule+0x53c/0xa70
[89945.640781] [<ffffffff810941c8>] ? mark_held_locks+0x78/0xb0
[89945.640781] [<ffffffff810941c8>] ? mark_held_locks+0x78/0xb0
[89945.640836] [<ffffffff81565b15>] rwsem_down_failed_common+0xb5/0x150
[89945.640893] [<ffffffff8108122d>] ? sched_clock_cpu+0xbd/0x110
[89945.640949] [<ffffffff8108e33d>] ? trace_hardirqs_off+0xd/0x10
[89945.641004] [<ffffffff81565be5>] rwsem_down_read_failed+0x15/0x17
[89945.641061] [<ffffffff812c0a54>] call_rwsem_down_read_failed+0x14/0x30
[89945.641120] [<ffffffff81139565>] ? sys_mprotect+0xe5/0x230
[89945.641174] [<ffffffff81564f1e>] ? down_read+0x7e/0xa0
[89945.641227] [<ffffffff8156a445>] ? do_page_fault+0x1f5/0x560
[89945.641281] [<ffffffff8156a445>] do_page_fault+0x1f5/0x560
[89945.641336] [<ffffffff812bb5fc>] ? rwsem_wake+0x4c/0x60
[89945.641388] [<ffffffff812c0aa7>] ? call_rwsem_wake+0x17/0x30
[89945.641443] [<ffffffff812c0b5d>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[89945.641501] [<ffffffff81566d25>] page_fault+0x25/0x30
On Wed, 2012-05-16 at 13:00 -0700, Roland Dreier wrote:
> Hi scheduler hackers,
>
> I'm very occasionally seeing the lockdep warning below on our boxes
> running 2.6.39 (PREEMPT=n, so "unfair" _double_lock_balance()). I
> think I see the explanation, and it's probably not even worth fixing:
>
> On the unlock side, we have:
<snip>
> while on the lock side we have:
<snip>
> So it seems we have the following (purely lockdep-related) race:
>
> unlock: lock:
>
> if (unlikely(!raw_spin_trylock(&busiest->lock))) { //fail to lock
>
> raw_spin_unlock(&busiest->lock);
>
> if (busiest < this_rq) { //not true
> } else
> raw_spin_lock_nested(&busiest->lock,
> SINGLE_DEPTH_NESTING);
>
> lock_set_subclass(&this_rq->lock.dep_map, 0, _RET_IP_); //too late
>
> where we end up trying to take a second lock with SINGLE_DEPTH_NESTING
> before we've promoted our first lock to subclass 0.
*phew* you actually made me think there ;-)
Anyway, it all sounds very plausible, which is what threw me, but its
wrong :-)
The race you describe exists, except that's not how lockdep works. Both
cpu's would have a different task (one would hope to presume) and the
held lock stack is per task. So even if busiest_rq on cpu1 (lock case)
is the same lock as this_rq on cpu0 (unlock case), they're in different
stacks with different states.
> So does this make sense?
Almost :-)
> Here's the actual lockdep warning:
> [89945.640512] [<ffffffff8103fa1a>] double_lock_balance+0x5a/0x90
> [89945.640568] [<ffffffff8104c546>] push_rt_task+0xc6/0x290
this is the clue.. if you look at that code you'll find the
double_lock_balance() in question is the one in find_lock_lowest_rq()
[yay for inlining].
Now find_lock_lowest_rq() has a bug.. it fails to use
double_unlock_balance() in one exit path, if this results in a retry in
push_rt_task() we'll call double_lock_balance() again, at which point
we'll run into said issue.
Presumably this is all rather rare..
Something like this should fix it I think..
---
kernel/sched/rt.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index c5565c3..b649108 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1556,7 +1556,7 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task, struct rq *rq)
task_running(rq, task) ||
!task->on_rq)) {
- raw_spin_unlock(&lowest_rq->lock);
+ double_unlock_balance(rq, lowest_rq);
lowest_rq = NULL;
break;
}
On Thu, 2012-05-17 at 21:19 +0200, Peter Zijlstra wrote:
> Something like this should fix it I think..
>
>
> ---
> kernel/sched/rt.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index c5565c3..b649108 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1556,7 +1556,7 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task, struct rq *rq)
> task_running(rq, task) ||
> !task->on_rq)) {
>
> - raw_spin_unlock(&lowest_rq->lock);
> + double_unlock_balance(rq, lowest_rq);
Acked-by: Steven Rostedt <[email protected]>
-- Steve
> lowest_rq = NULL;
> break;
> }
> *phew* you actually made me think there ;-)
Sorry for that.
> The race you describe exists, except that's not how lockdep works. Both
> cpu's would have a different task (one would hope to presume) and the
> held lock stack is per task. So even if busiest_rq on cpu1 (lock case)
> is the same lock as this_rq on cpu0 (unlock case), they're in different
> stacks with different states.
Got it. I confused myself between user processes (which might have
multiple threads on multiple CPUs) and kernel tasks.
> Something like this should fix it I think..
Thanks, I'll throw this in and let you know if we see this again.
But it's very rare so hard to know one way or another for quite
a while.
- R.
Commit-ID: 7f1b43936f0ecad14770634c021cf4a929aec74d
Gitweb: http://git.kernel.org/tip/7f1b43936f0ecad14770634c021cf4a929aec74d
Author: Peter Zijlstra <[email protected]>
AuthorDate: Thu, 17 May 2012 21:19:46 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Wed, 6 Jun 2012 16:52:26 +0200
sched/rt: Fix lockdep annotation within find_lock_lowest_rq()
Roland Dreier reported spurious, hard to trigger lockdep warnings
within the scheduler - without any real lockup.
This bit gives us the right clue:
> [89945.640512] [<ffffffff8103fa1a>] double_lock_balance+0x5a/0x90
> [89945.640568] [<ffffffff8104c546>] push_rt_task+0xc6/0x290
if you look at that code you'll find the double_lock_balance() in
question is the one in find_lock_lowest_rq() [yay for inlining].
Now find_lock_lowest_rq() has a bug.. it fails to use
double_unlock_balance() in one exit path, if this results in a retry in
push_rt_task() we'll call double_lock_balance() again, at which point
we'll run into said lockdep confusion.
Reported-by: Roland Dreier <[email protected]>
Acked-by: Steven Rostedt <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/1337282386.4281.77.camel@twins
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/rt.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 2a4e8df..573e1ca 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1562,7 +1562,7 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task, struct rq *rq)
task_running(rq, task) ||
!task->on_rq)) {
- raw_spin_unlock(&lowest_rq->lock);
+ double_unlock_balance(rq, lowest_rq);
lowest_rq = NULL;
break;
}