2009-01-04 17:42:14

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: [BUG] 2.6.28-git LOCKDEP: Possible recursive rq->lock

Hi Ingo,

Kernbench runs on latest Linux git tree stalled with the following
lockdep warning.

Lockdep warning and lockup on Jan 3 Linus git tree
commit 7d3b56ba37a95f1f370f50258ed3954c304c524b

kernbench run with two threads stalled. sched_mc was zero.
x86_64 system with 8 logical CPUs in dual socket quad core
configuration.

I will post more information as I debug this warning/bug.

--Vaidy


kernel: =============================================
kernel: [ INFO: possible recursive locking detected ]
kernel: 2.6.28-autotest-sv #1
kernel: ---------------------------------------------
kernel: autotest/6230 is trying to acquire lock:
kernel: (&rq->lock){++..}, at: [<ffffffff80229240>] task_rq_lock+0x44/0x7e
kernel:
kernel: but task is already holding lock:
kernel: (&rq->lock){++..}, at: [<ffffffff805d0fd4>] schedule+0x144/0x9d6
kernel:
kernel: other info that might help us debug this:
kernel: 1 lock held by autotest/6230:
kernel: #0: (&rq->lock){++..}, at: [<ffffffff805d0fd4>] schedule+0x144/0x9d6
kernel:
kernel: stack backtrace:
kernel: Pid: 6230, comm: autotest Not tainted 2.6.28-autotest-sv #1
kernel: Call Trace:
kernel: [<ffffffff802560d8>] __lock_acquire+0xe2a/0x15ed
kernel: [<ffffffff80256920>] lock_acquire+0x85/0xa9
kernel: [<ffffffff80229240>] task_rq_lock+0x44/0x7e
kernel: [<ffffffff805d3f4c>] _spin_lock+0x2d/0x5f
kernel: [<ffffffff80229240>] task_rq_lock+0x44/0x7e
kernel: [<ffffffff80229240>] task_rq_lock+0x44/0x7e
kernel: [<ffffffff8022de2e>] try_to_wake_up+0x83/0x26f
kernel: [<ffffffff805d0fd4>] schedule+0x144/0x9d6
kernel: [<ffffffff805d13e8>] schedule+0x558/0x9d6
kernel: [<ffffffff802549c8>] trace_hardirqs_on_caller+0x120/0x14a
kernel: [<ffffffff8024a241>] hrtimer_start_range_ns+0x143/0x156
kernel: [<ffffffff805d2b82>] schedule_hrtimeout_range+0xcc/0xff
kernel: [<ffffffff80249ba0>] hrtimer_wakeup+0x0/0x22
kernel: [<ffffffff805d2b64>] schedule_hrtimeout_range+0xae/0xff
kernel: [<ffffffff802aa6e8>] do_select+0x502/0x55d
kernel: [<ffffffff802aa1e6>] do_select+0x0/0x55d
kernel: [<ffffffff802aac5f>] __pollwait+0x0/0xe5
kernel: [<ffffffff8022e01a>] default_wake_function+0x0/0x9
kernel: [<ffffffff8022e01a>] default_wake_function+0x0/0x9
kernel: [<ffffffff80256882>] __lock_acquire+0x15d4/0x15ed
last message repeated 2 times
kernel: [<ffffffff80252913>] lock_release_holdtime+0x1b/0x16c
kernel: [<ffffffff80252913>] lock_release_holdtime+0x1b/0x16c
kernel: [<ffffffff802aa774>] core_sys_select+0x31/0x2b0
kernel: [<ffffffff802aa92f>] core_sys_select+0x1ec/0x2b0
kernel: [<ffffffff802aa774>] core_sys_select+0x31/0x2b0
kernel: [<ffffffff8024785b>] remove_wait_queue+0x12/0x41
kernel: [<ffffffff80252913>] lock_release_holdtime+0x1b/0x16c
kernel: [<ffffffff805d3dc9>] _spin_unlock_irqrestore+0x3e/0x44
kernel: [<ffffffff802549c8>] trace_hardirqs_on_caller+0x120/0x14a
kernel: [<ffffffff80237c6e>] do_wait+0x2a4/0x395
kernel: [<ffffffff8024cc09>] getnstimeofday+0x38/0x92
kernel: [<ffffffff802aac36>] sys_select+0x92/0xbb
kernel: [<ffffffff8020bedb>] system_call_fastpath+0x16/0x1b


2009-01-04 18:09:19

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [BUG] 2.6.28-git LOCKDEP: Possible recursive rq->lock

On Sun, 2009-01-04 at 23:14 +0530, Vaidyanathan Srinivasan wrote:
> Hi Ingo,
>
> Kernbench runs on latest Linux git tree stalled with the following
> lockdep warning.
>
> Lockdep warning and lockup on Jan 3 Linus git tree
> commit 7d3b56ba37a95f1f370f50258ed3954c304c524b
>
> kernbench run with two threads stalled. sched_mc was zero.
> x86_64 system with 8 logical CPUs in dual socket quad core
> configuration.
>
> I will post more information as I debug this warning/bug.

Its ca109491f612aab5c8152207631c0444f63da97f, I've some ideas on how to
fix this, just haven't gotten around to actually doing anything --
seeing how it was holidays and such..

2009-01-05 04:03:56

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [BUG] 2.6.28-git LOCKDEP: Possible recursive rq->lock

* Peter Zijlstra <[email protected]> [2009-01-04 19:08:43]:

> On Sun, 2009-01-04 at 23:14 +0530, Vaidyanathan Srinivasan wrote:
> > Hi Ingo,
> >
> > Kernbench runs on latest Linux git tree stalled with the following
> > lockdep warning.
> >
> > Lockdep warning and lockup on Jan 3 Linus git tree
> > commit 7d3b56ba37a95f1f370f50258ed3954c304c524b
> >
> > kernbench run with two threads stalled. sched_mc was zero.
> > x86_64 system with 8 logical CPUs in dual socket quad core
> > configuration.
> >
> > I will post more information as I debug this warning/bug.
>
> Its ca109491f612aab5c8152207631c0444f63da97f, I've some ideas on how to
> fix this, just haven't gotten around to actually doing anything --
> seeing how it was holidays and such..

Hi Peter,

I can definitely test your fix when you have them. I have an
autotest job that hits this bug.

Another one... if that helps you.

kernel: =============================================
kernel: [ INFO: possible recursive locking detected ]
kernel: 2.6.28-autotest-sv #1
kernel: ---------------------------------------------
kernel: dd/5069 is trying to acquire lock:
kernel: (&rq->lock){++..}, at: [<ffffffff80229240>] task_rq_lock+0x44/0x7e
kernel:
kernel: but task is already holding lock:
kernel: (&rq->lock){++..}, at: [<ffffffff805d0fd4>] schedule+0x144/0x9d6
kernel:
kernel: other info that might help us debug this:
kernel: 2 locks held by dd/5069:
kernel: #0: (&sb->s_type->i_mutex_key#3){--..}, at: [<ffffffff802a3b71>] pipe_write+0x65/0x4f6
kernel: #1: (&rq->lock){++..}, at: [<ffffffff805d0fd4>] schedule+0x144/0x9d6
kernel:
kernel: stack backtrace:
kernel: Pid: 5069, comm: dd Not tainted 2.6.28-autotest-sv #1
kernel: Call Trace:
kernel: [<ffffffff802560d8>] __lock_acquire+0xe2a/0x15ed
kernel: [<ffffffff80279618>] get_page_from_freelist+0x45d/0x531
kernel: [<ffffffff80256920>] lock_acquire+0x85/0xa9
kernel: [<ffffffff80229240>] task_rq_lock+0x44/0x7e
kernel: [<ffffffff805d3f4c>] _spin_lock+0x2d/0x5f
kernel: [<ffffffff80229240>] task_rq_lock+0x44/0x7e
kernel: [<ffffffff80229240>] task_rq_lock+0x44/0x7e
kernel: [<ffffffff8022de2e>] try_to_wake_up+0x83/0x26f
kernel: [<ffffffff805d0fd4>] schedule+0x144/0x9d6
kernel: [<ffffffff805d13e8>] schedule+0x558/0x9d6
kernel: [<ffffffff80252913>] lock_release_holdtime+0x1b/0x16c
kernel: [<ffffffff805d2646>] mutex_lock_nested+0x1ae/0x2d1
kernel: [<ffffffff802a3b71>] pipe_write+0x65/0x4f6
kernel: [<ffffffff802a3b71>] pipe_write+0x65/0x4f6
kernel: [<ffffffff80256882>] __lock_acquire+0x15d4/0x15ed
kernel: [<ffffffff802a3b71>] pipe_write+0x65/0x4f6
kernel: [<ffffffff80256882>] __lock_acquire+0x15d4/0x15ed
kernel: [<ffffffff8029d83f>] do_sync_write+0xce/0x113
kernel: [<ffffffff802476e9>] autoremove_wake_function+0x0/0x2e
kernel: [<ffffffff802c2652>] dnotify_parent+0x1f/0x6d
kernel: [<ffffffff80252913>] lock_release_holdtime+0x1b/0x16c
kernel: [<ffffffff8029e085>] vfs_write+0xad/0x136
kernel: [<ffffffff8029e1ca>] sys_write+0x45/0x6e
kernel: [<ffffffff8020bedb>] system_call_fastpath+0x16/0x1b

--Vaidy

2009-01-05 13:06:56

by Ingo Molnar

[permalink] [raw]
Subject: Re: [BUG] 2.6.28-git LOCKDEP: Possible recursive rq->lock


* Vaidyanathan Srinivasan <[email protected]> wrote:

> * Peter Zijlstra <[email protected]> [2009-01-04 19:08:43]:
>
> > On Sun, 2009-01-04 at 23:14 +0530, Vaidyanathan Srinivasan wrote:
> > > Hi Ingo,
> > >
> > > Kernbench runs on latest Linux git tree stalled with the following
> > > lockdep warning.
> > >
> > > Lockdep warning and lockup on Jan 3 Linus git tree
> > > commit 7d3b56ba37a95f1f370f50258ed3954c304c524b
> > >
> > > kernbench run with two threads stalled. sched_mc was zero.
> > > x86_64 system with 8 logical CPUs in dual socket quad core
> > > configuration.
> > >
> > > I will post more information as I debug this warning/bug.
> >
> > Its ca109491f612aab5c8152207631c0444f63da97f, I've some ideas on how to
> > fix this, just haven't gotten around to actually doing anything --
> > seeing how it was holidays and such..
>
> Hi Peter,
>
> I can definitely test your fix when you have them. I have an
> autotest job that hits this bug.

could you check latest tip/master, it has Peter's and Thomas's hrtimer
fixes.

Ingo

2009-01-06 07:12:37

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [BUG] 2.6.28-git LOCKDEP: Possible recursive rq->lock

* Ingo Molnar <[email protected]> [2009-01-05 14:06:38]:

>
> * Vaidyanathan Srinivasan <[email protected]> wrote:
>
> > * Peter Zijlstra <[email protected]> [2009-01-04 19:08:43]:
> >
> > > On Sun, 2009-01-04 at 23:14 +0530, Vaidyanathan Srinivasan wrote:
> > > > Hi Ingo,
> > > >
> > > > Kernbench runs on latest Linux git tree stalled with the following
> > > > lockdep warning.
> > > >
> > > > Lockdep warning and lockup on Jan 3 Linus git tree
> > > > commit 7d3b56ba37a95f1f370f50258ed3954c304c524b
> > > >
> > > > kernbench run with two threads stalled. sched_mc was zero.
> > > > x86_64 system with 8 logical CPUs in dual socket quad core
> > > > configuration.
> > > >
> > > > I will post more information as I debug this warning/bug.
> > >
> > > Its ca109491f612aab5c8152207631c0444f63da97f, I've some ideas on how to
> > > fix this, just haven't gotten around to actually doing anything --
> > > seeing how it was holidays and such..
> >
> > Hi Peter,
> >
> > I can definitely test your fix when you have them. I have an
> > autotest job that hits this bug.
>
> could you check latest tip/master, it has Peter's and Thomas's hrtimer
> fixes.

Hi Ingo,

I have started a test run with the today's sched-tip. I will report
results.

--Vaidy

2009-01-06 13:59:29

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [BUG] 2.6.28-git LOCKDEP: Possible recursive rq->lock

* Ingo Molnar <[email protected]> [2009-01-05 14:06:38]:

>
> * Vaidyanathan Srinivasan <[email protected]> wrote:
>
> > * Peter Zijlstra <[email protected]> [2009-01-04 19:08:43]:
> >
> > > On Sun, 2009-01-04 at 23:14 +0530, Vaidyanathan Srinivasan wrote:
> > > > Hi Ingo,
> > > >
> > > > Kernbench runs on latest Linux git tree stalled with the following
> > > > lockdep warning.
> > > >
> > > > Lockdep warning and lockup on Jan 3 Linus git tree
> > > > commit 7d3b56ba37a95f1f370f50258ed3954c304c524b
> > > >
> > > > kernbench run with two threads stalled. sched_mc was zero.
> > > > x86_64 system with 8 logical CPUs in dual socket quad core
> > > > configuration.
> > > >
> > > > I will post more information as I debug this warning/bug.
> > >
> > > Its ca109491f612aab5c8152207631c0444f63da97f, I've some ideas on how to
> > > fix this, just haven't gotten around to actually doing anything --
> > > seeing how it was holidays and such..
> >
> > Hi Peter,
> >
> > I can definitely test your fix when you have them. I have an
> > autotest job that hits this bug.
>
> could you check latest tip/master, it has Peter's and Thomas's hrtimer
> fixes.

Hi Peter,

I queued the same kernbench runs with the latest tip. The run
completed without any lockups, but I still got one lockdep warning.

sched-tip date Jan 5 commit:4c1ae1dfea7a5fcab3444220a38054dd50c08441

kernel: =============================================
kernel: [ INFO: possible recursive locking detected ]
kernel: 2.6.28-autotest-tip-sv #1
kernel: ---------------------------------------------
kernel: autotest/6012 is trying to acquire lock:
kernel: (&rq->lock){++..}, at: [<ffffffff8022a4dd>] task_rq_lock+0x44/0x7e
kernel:
kernel: but task is already holding lock:
kernel: (&rq->lock){++..}, at: [<ffffffff805d6000>] schedule+0x144/0x9e5
kernel:
kernel: other info that might help us debug this:
kernel: 1 lock held by autotest/6012:
kernel: #0: (&rq->lock){++..}, at: [<ffffffff805d6000>] schedule+0x144/0x9e5
kernel:
kernel: stack backtrace:
kernel: Pid: 6012, comm: autotest Not tainted 2.6.28-autotest-tip-sv #1
kernel: Call Trace:
kernel: [<ffffffff80257a0c>] __lock_acquire+0xe2a/0x15ed
kernel: [<ffffffff80258254>] lock_acquire+0x85/0xa9
kernel: [<ffffffff8022a4dd>] task_rq_lock+0x44/0x7e
kernel: [<ffffffff805d8fac>] _spin_lock+0x2d/0x5f
kernel: [<ffffffff8022a4dd>] task_rq_lock+0x44/0x7e
kernel: [<ffffffff8022a4dd>] task_rq_lock+0x44/0x7e
kernel: [<ffffffff80232947>] try_to_wake_up+0x83/0x26f
kernel: [<ffffffff805d6000>] schedule+0x144/0x9e5
kernel: [<ffffffff805d6415>] schedule+0x559/0x9e5
kernel: [<ffffffff802562fc>] trace_hardirqs_on_caller+0x120/0x14a
kernel: [<ffffffff8024b721>] hrtimer_start_range_ns+0x1e4/0x1f7
kernel: [<ffffffff805d7bda>] schedule_hrtimeout_range+0xcc/0xff
kernel: [<ffffffff8024af3d>] hrtimer_wakeup+0x0/0x22
kernel: [<ffffffff805d7bbc>] schedule_hrtimeout_range+0xae/0xff
kernel: [<ffffffff802ae194>] do_select+0x502/0x55d
kernel: [<ffffffff802adc92>] do_select+0x0/0x55d
kernel: [<ffffffff802ae70b>] __pollwait+0x0/0xe5
kernel: [<ffffffff80232b33>] default_wake_function+0x0/0x9
kernel: [<ffffffff80232b33>] default_wake_function+0x0/0x9
kernel: [<ffffffff80254211>] lock_release_holdtime+0x1b/0x16c
kernel: [<ffffffff802581b6>] __lock_acquire+0x15d4/0x15ed
last message repeated 2 times
kernel: [<ffffffff80254211>] lock_release_holdtime+0x1b/0x16c
kernel: [<ffffffff80254211>] lock_release_holdtime+0x1b/0x16c
kernel: [<ffffffff802ae220>] core_sys_select+0x31/0x2b0
kernel: [<ffffffff802ae3db>] core_sys_select+0x1ec/0x2b0
kernel: [<ffffffff802ae220>] core_sys_select+0x31/0x2b0
kernel: [<ffffffff80248d47>] remove_wait_queue+0x12/0x4d
kernel: [<ffffffff80254211>] lock_release_holdtime+0x1b/0x16c
kernel: [<ffffffff805d8e29>] _spin_unlock_irqrestore+0x3e/0x44
kernel: [<ffffffff802562fc>] trace_hardirqs_on_caller+0x120/0x14a
kernel: [<ffffffff8023909b>] do_wait+0x2a4/0x395
kernel: [<ffffffff8024ec91>] getnstimeofday+0x56/0xb0
kernel: [<ffffffff802ae6e2>] sys_select+0x92/0xbb
kernel: [<ffffffff8020bedb>] system_call_fastpath+0x16/0x1b

I will rerun and check again.

--Vaidy

2009-01-07 11:47:22

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [BUG] 2.6.28-git LOCKDEP: Possible recursive rq->lock

* Ingo Molnar <[email protected]> [2009-01-05 14:06:38]:

>
> * Vaidyanathan Srinivasan <[email protected]> wrote:
>
> > * Peter Zijlstra <[email protected]> [2009-01-04 19:08:43]:
> >
> > > On Sun, 2009-01-04 at 23:14 +0530, Vaidyanathan Srinivasan wrote:
> > > > Hi Ingo,
> > > >
> > > > Kernbench runs on latest Linux git tree stalled with the following
> > > > lockdep warning.
> > > >
> > > > Lockdep warning and lockup on Jan 3 Linus git tree
> > > > commit 7d3b56ba37a95f1f370f50258ed3954c304c524b
> > > >
> > > > kernbench run with two threads stalled. sched_mc was zero.
> > > > x86_64 system with 8 logical CPUs in dual socket quad core
> > > > configuration.
> > > >
> > > > I will post more information as I debug this warning/bug.
> > >
> > > Its ca109491f612aab5c8152207631c0444f63da97f, I've some ideas on how to
> > > fix this, just haven't gotten around to actually doing anything --
> > > seeing how it was holidays and such..
> >
> > Hi Peter,
> >
> > I can definitely test your fix when you have them. I have an
> > autotest job that hits this bug.
>
> could you check latest tip/master, it has Peter's and Thomas's hrtimer
> fixes.

Hi Peter,

I still get the following warning at the first kernel build (kernbench
run) but the entire tests complete without any lockups.

Please let me know if these make sense. I have CONFIG_FRAME_POINTER=y
in these runs.

Test run on sched-tip on Jan 5 at commit 4c1ae1dfea7a5fcab3444220a38054dd50c08441

kernel: =============================================
kernel: [ INFO: possible recursive locking detected ]
kernel: 2.6.28-autotest-tip-sv #1
kernel: ---------------------------------------------
kernel: autotest/6019 is trying to acquire lock:
kernel: (&rq->lock){++..}, at: [<ffffffff8022aca2>] task_rq_lock+0x45/0x7e
kernel:
kernel: but task is already holding lock:
kernel: (&rq->lock){++..}, at: [<ffffffff805f7354>] schedule+0x158/0xa31
kernel:
kernel: other info that might help us debug this:
kernel: 1 lock held by autotest/6019:
kernel: #0: (&rq->lock){++..}, at: [<ffffffff805f7354>] schedule+0x158/0xa31
kernel:
kernel: stack backtrace:
kernel: Pid: 6019, comm: autotest Not tainted 2.6.28-autotest-tip-sv #1
kernel: Call Trace:
kernel: [<ffffffff80259ef1>] __lock_acquire+0xeb9/0x16a4
kernel: [<ffffffff8025a6c0>] ? __lock_acquire+0x1688/0x16a4
kernel: [<ffffffff8025a761>] lock_acquire+0x85/0xa9
kernel: [<ffffffff8022aca2>] ? task_rq_lock+0x45/0x7e
kernel: [<ffffffff805fa4d4>] _spin_lock+0x31/0x66
kernel: [<ffffffff8022aca2>] ? task_rq_lock+0x45/0x7e
kernel: [<ffffffff8022aca2>] task_rq_lock+0x45/0x7e
kernel: [<ffffffff80233363>] try_to_wake_up+0x88/0x27a
kernel: [<ffffffff80233581>] wake_up_process+0x10/0x12
kernel: [<ffffffff805f775c>] schedule+0x560/0xa31
kernel: [<ffffffff805fa330>] ? _spin_unlock_irqrestore+0x3f/0x47
kernel: [<ffffffff80258701>] ? trace_hardirqs_on_caller+0x11f/0x14a
kernel: [<ffffffff8024d3dc>] ? hrtimer_start_range_ns+0x1ef/0x201
kernel: [<ffffffff805f8fd6>] schedule_hrtimeout_range+0xd1/0x106
kernel: [<ffffffff8024cb9a>] ? hrtimer_wakeup+0x0/0x21
kernel: [<ffffffff805f8fbb>] ? schedule_hrtimeout_range+0xb6/0x106
kernel: [<ffffffff802b3d34>] do_select+0x55a/0x5b6
kernel: [<ffffffff802b37da>] ? do_select+0x0/0x5b6
kernel: [<ffffffff802b42aa>] ? __pollwait+0x0/0xe2
kernel: [<ffffffff80233555>] ? default_wake_function+0x0/0xf
kernel: [<ffffffff80233555>] ? default_wake_function+0x0/0xf
kernel: [<ffffffff8025a6c0>] ? __lock_acquire+0x1688/0x16a4
last message repeated 2 times
kernel: [<ffffffff805fa35e>] ? _spin_unlock+0x26/0x2a
kernel: [<ffffffff802b3dcd>] ? core_sys_select+0x3d/0x2b4
kernel: [<ffffffff802b3f7a>] core_sys_select+0x1ea/0x2b4
kernel: [<ffffffff802b3dcd>] ? core_sys_select+0x3d/0x2b4
kernel: [<ffffffff8025a6c0>] ? __lock_acquire+0x1688/0x16a4
kernel: [<ffffffff805fa330>] ? _spin_unlock_irqrestore+0x3f/0x47
kernel: [<ffffffff80258701>] ? trace_hardirqs_on_caller+0x11f/0x14a
kernel: [<ffffffff80258739>] ? trace_hardirqs_on+0xd/0xf
kernel: [<ffffffff8024a865>] ? remove_wait_queue+0x48/0x4d
kernel: [<ffffffff80250c7a>] ? getnstimeofday+0x58/0xb4
kernel: [<ffffffff8024d5d4>] ? ktime_get_ts+0x49/0x4e
kernel: [<ffffffff802b4282>] sys_select+0x94/0xbc
kernel: [<ffffffff8020c05b>] system_call_fastpath+0x16/0x1b
kernel: hardirqs last enabled at (19261): [<ffffffff805fa330>] _spin_unlock_irqrestore+0x3f/0x47
kernel: hardirqs last disabled at (19260): [<ffffffff805fa5f8>] _spin_lock_irqsave+0x19/0x74
kernel: softirqs last enabled at (18668): [<ffffffff8023c30a>] __do_softirq+0x154/0x162
kernel: softirqs last disabled at (18657): [<ffffffff8020d1dc>] call_softirq+0x1c/0x34

--Vaidy

2009-01-07 12:26:35

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [BUG] 2.6.28-git LOCKDEP: Possible recursive rq->lock

* Vaidyanathan Srinivasan <[email protected]> [2009-01-07 17:19:47]:

> * Ingo Molnar <[email protected]> [2009-01-05 14:06:38]:
>
> >
> > * Vaidyanathan Srinivasan <[email protected]> wrote:
> >
> > > * Peter Zijlstra <[email protected]> [2009-01-04 19:08:43]:
> > >
> > > > On Sun, 2009-01-04 at 23:14 +0530, Vaidyanathan Srinivasan wrote:
> > > > > Hi Ingo,
> > > > >
> > > > > Kernbench runs on latest Linux git tree stalled with the following
> > > > > lockdep warning.
> > > > >
> > > > > Lockdep warning and lockup on Jan 3 Linus git tree
> > > > > commit 7d3b56ba37a95f1f370f50258ed3954c304c524b
> > > > >
> > > > > kernbench run with two threads stalled. sched_mc was zero.
> > > > > x86_64 system with 8 logical CPUs in dual socket quad core
> > > > > configuration.
> > > > >
> > > > > I will post more information as I debug this warning/bug.
> > > >
> > > > Its ca109491f612aab5c8152207631c0444f63da97f, I've some ideas on how to
> > > > fix this, just haven't gotten around to actually doing anything --
> > > > seeing how it was holidays and such..
> > >
> > > Hi Peter,
> > >
> > > I can definitely test your fix when you have them. I have an
> > > autotest job that hits this bug.
> >
> > could you check latest tip/master, it has Peter's and Thomas's hrtimer
> > fixes.
>
> Hi Peter,
>
> I still get the following warning at the first kernel build (kernbench
> run) but the entire tests complete without any lockups.
>
> Please let me know if these make sense. I have CONFIG_FRAME_POINTER=y
> in these runs.
>
> Test run on sched-tip on Jan 5 at commit 4c1ae1dfea7a5fcab3444220a38054dd50c08441

Another one...

=============================================
[ INFO: possible recursive locking detected ]
2.6.28-autotest-tip-sv #1
---------------------------------------------
klogd/5062 is trying to acquire lock:
(&rq->lock){++..}, at: [<ffffffff8022aca2>] task_rq_lock+0x45/0x7e

but task is already holding lock:
(&rq->lock){++..}, at: [<ffffffff805f7354>] schedule+0x158/0xa31

other info that might help us debug this:
1 lock held by klogd/5062:
#0: (&rq->lock){++..}, at: [<ffffffff805f7354>] schedule+0x158/0xa31

stack backtrace:
Pid: 5062, comm: klogd Not tainted 2.6.28-autotest-tip-sv #1
Call Trace:
[<ffffffff80259ef1>] __lock_acquire+0xeb9/0x16a4
[<ffffffff8025a6c0>] ? __lock_acquire+0x1688/0x16a4
[<ffffffff8025a761>] lock_acquire+0x85/0xa9
[<ffffffff8022aca2>] ? task_rq_lock+0x45/0x7e
[<ffffffff805fa4d4>] _spin_lock+0x31/0x66
[<ffffffff8022aca2>] ? task_rq_lock+0x45/0x7e
[<ffffffff8022aca2>] task_rq_lock+0x45/0x7e
[<ffffffff80233363>] try_to_wake_up+0x88/0x27a
[<ffffffff80233581>] wake_up_process+0x10/0x12
[<ffffffff805f775c>] schedule+0x560/0xa31
[<ffffffff805f805f>] schedule_timeout+0x22/0xb4
[<ffffffff805fa35e>] ? _spin_unlock+0x26/0x2a
[<ffffffff805961e0>] unix_wait_for_peer+0x9c/0xbb
[<ffffffff8024a686>] ? autoremove_wake_function+0x0/0x38
[<ffffffff805fa500>] ? _spin_lock+0x5d/0x66
[<ffffffff8059679e>] unix_dgram_sendmsg+0x3a8/0x4a3
[<ffffffff8053661f>] sock_aio_write+0x107/0x117
[<ffffffff802a6377>] do_sync_write+0xe7/0x12d
[<ffffffff802a98c6>] ? cp_new_stat+0xe2/0xef
[<ffffffff8024a686>] ? autoremove_wake_function+0x0/0x38
[<ffffffff802a6c23>] vfs_write+0xc1/0x137
[<ffffffff802a6d5d>] sys_write+0x47/0x70
[<ffffffff8020c05b>] system_call_fastpath+0x16/0x1b


I have debug info enabled, I can lookup any address you would like.

--Vaidy

2009-01-07 13:20:47

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [BUG] 2.6.28-git LOCKDEP: Possible recursive rq->lock

On Wed, 2009-01-07 at 17:59 +0530, Vaidyanathan Srinivasan wrote:

> =============================================
> [ INFO: possible recursive locking detected ]
> 2.6.28-autotest-tip-sv #1
> ---------------------------------------------
> klogd/5062 is trying to acquire lock:
> (&rq->lock){++..}, at: [<ffffffff8022aca2>] task_rq_lock+0x45/0x7e
>
> but task is already holding lock:
> (&rq->lock){++..}, at: [<ffffffff805f7354>] schedule+0x158/0xa31
>
> other info that might help us debug this:
> 1 lock held by klogd/5062:
> #0: (&rq->lock){++..}, at: [<ffffffff805f7354>] schedule+0x158/0xa31
>
> stack backtrace:
> Pid: 5062, comm: klogd Not tainted 2.6.28-autotest-tip-sv #1
> Call Trace:
> [<ffffffff80259ef1>] __lock_acquire+0xeb9/0x16a4
> [<ffffffff8025a6c0>] ? __lock_acquire+0x1688/0x16a4
> [<ffffffff8025a761>] lock_acquire+0x85/0xa9
> [<ffffffff8022aca2>] ? task_rq_lock+0x45/0x7e
> [<ffffffff805fa4d4>] _spin_lock+0x31/0x66
> [<ffffffff8022aca2>] ? task_rq_lock+0x45/0x7e
> [<ffffffff8022aca2>] task_rq_lock+0x45/0x7e
> [<ffffffff80233363>] try_to_wake_up+0x88/0x27a
> [<ffffffff80233581>] wake_up_process+0x10/0x12
> [<ffffffff805f775c>] schedule+0x560/0xa31

I'd be most curious to know where in schedule we are.

Thanks

2009-01-07 14:20:04

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [BUG] 2.6.28-git LOCKDEP: Possible recursive rq->lock

* Peter Zijlstra <[email protected]> [2009-01-07 14:12:43]:

> On Wed, 2009-01-07 at 17:59 +0530, Vaidyanathan Srinivasan wrote:
>
> > =============================================
> > [ INFO: possible recursive locking detected ]
> > 2.6.28-autotest-tip-sv #1
> > ---------------------------------------------
> > klogd/5062 is trying to acquire lock:
> > (&rq->lock){++..}, at: [<ffffffff8022aca2>] task_rq_lock+0x45/0x7e
> >
> > but task is already holding lock:
> > (&rq->lock){++..}, at: [<ffffffff805f7354>] schedule+0x158/0xa31
> >
> > other info that might help us debug this:
> > 1 lock held by klogd/5062:
> > #0: (&rq->lock){++..}, at: [<ffffffff805f7354>] schedule+0x158/0xa31
> >
> > stack backtrace:
> > Pid: 5062, comm: klogd Not tainted 2.6.28-autotest-tip-sv #1
> > Call Trace:
> > [<ffffffff80259ef1>] __lock_acquire+0xeb9/0x16a4
> > [<ffffffff8025a6c0>] ? __lock_acquire+0x1688/0x16a4
> > [<ffffffff8025a761>] lock_acquire+0x85/0xa9
> > [<ffffffff8022aca2>] ? task_rq_lock+0x45/0x7e
> > [<ffffffff805fa4d4>] _spin_lock+0x31/0x66
> > [<ffffffff8022aca2>] ? task_rq_lock+0x45/0x7e
> > [<ffffffff8022aca2>] task_rq_lock+0x45/0x7e
> > [<ffffffff80233363>] try_to_wake_up+0x88/0x27a
> > [<ffffffff80233581>] wake_up_process+0x10/0x12
> > [<ffffffff805f775c>] schedule+0x560/0xa31
>
> I'd be most curious to know where in schedule we are.

ok, we are in sched.c:3777

double_unlock_balance(this_rq, busiest);
if (active_balance)
>>>>>>>>>>> wake_up_process(busiest->migration_thread);

} else

In active balance in newidle. This implies sched_mc was 2 at that time.
let me trace this and debug further.

--Vaidy

2009-01-07 14:29:31

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [BUG] 2.6.28-git LOCKDEP: Possible recursive rq->lock

On Wed, 2009-01-07 at 19:50 +0530, Vaidyanathan Srinivasan wrote:
> * Peter Zijlstra <[email protected]> [2009-01-07 14:12:43]:
>
> > On Wed, 2009-01-07 at 17:59 +0530, Vaidyanathan Srinivasan wrote:
> >
> > > =============================================
> > > [ INFO: possible recursive locking detected ]
> > > 2.6.28-autotest-tip-sv #1
> > > ---------------------------------------------
> > > klogd/5062 is trying to acquire lock:
> > > (&rq->lock){++..}, at: [<ffffffff8022aca2>] task_rq_lock+0x45/0x7e
> > >
> > > but task is already holding lock:
> > > (&rq->lock){++..}, at: [<ffffffff805f7354>] schedule+0x158/0xa31
> > >
> > > other info that might help us debug this:
> > > 1 lock held by klogd/5062:
> > > #0: (&rq->lock){++..}, at: [<ffffffff805f7354>] schedule+0x158/0xa31
> > >
> > > stack backtrace:
> > > Pid: 5062, comm: klogd Not tainted 2.6.28-autotest-tip-sv #1
> > > Call Trace:
> > > [<ffffffff80259ef1>] __lock_acquire+0xeb9/0x16a4
> > > [<ffffffff8025a6c0>] ? __lock_acquire+0x1688/0x16a4
> > > [<ffffffff8025a761>] lock_acquire+0x85/0xa9
> > > [<ffffffff8022aca2>] ? task_rq_lock+0x45/0x7e
> > > [<ffffffff805fa4d4>] _spin_lock+0x31/0x66
> > > [<ffffffff8022aca2>] ? task_rq_lock+0x45/0x7e
> > > [<ffffffff8022aca2>] task_rq_lock+0x45/0x7e
> > > [<ffffffff80233363>] try_to_wake_up+0x88/0x27a
> > > [<ffffffff80233581>] wake_up_process+0x10/0x12
> > > [<ffffffff805f775c>] schedule+0x560/0xa31
> >
> > I'd be most curious to know where in schedule we are.
>
> ok, we are in sched.c:3777
>
> double_unlock_balance(this_rq, busiest);
> if (active_balance)
> >>>>>>>>>>> wake_up_process(busiest->migration_thread);
>
> } else
>
> In active balance in newidle. This implies sched_mc was 2 at that time.
> let me trace this and debug further.

How about something like this? Strictly speaking we'll not deadlock,
because ttwu will not be able to place the migration task on our rq, but
since the code can deal with both rqs getting unlocked, this seems the
easiest way out.

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -3777,8 +3777,13 @@ redo:
}

double_unlock_balance(this_rq, busiest);
+ /*
+ * Should not call ttwu while holding a rq->lock
+ */
+ spin_unlock(&this_rq->lock);
if (active_balance)
wake_up_process(busiest->migration_thread);
+ spin_lock(&this_rq->lock);

} else
sd->nr_balance_failed = 0;

2009-01-07 16:33:27

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [BUG] 2.6.28-git LOCKDEP: Possible recursive rq->lock

* Peter Zijlstra <[email protected]> [2009-01-07 15:28:57]:

> On Wed, 2009-01-07 at 19:50 +0530, Vaidyanathan Srinivasan wrote:
> > * Peter Zijlstra <[email protected]> [2009-01-07 14:12:43]:
> >
> > > On Wed, 2009-01-07 at 17:59 +0530, Vaidyanathan Srinivasan wrote:
> > >
> > > > =============================================
> > > > [ INFO: possible recursive locking detected ]
> > > > 2.6.28-autotest-tip-sv #1
> > > > ---------------------------------------------
> > > > klogd/5062 is trying to acquire lock:
> > > > (&rq->lock){++..}, at: [<ffffffff8022aca2>] task_rq_lock+0x45/0x7e
> > > >
> > > > but task is already holding lock:
> > > > (&rq->lock){++..}, at: [<ffffffff805f7354>] schedule+0x158/0xa31
> > > >
> > > > other info that might help us debug this:
> > > > 1 lock held by klogd/5062:
> > > > #0: (&rq->lock){++..}, at: [<ffffffff805f7354>] schedule+0x158/0xa31
> > > >
> > > > stack backtrace:
> > > > Pid: 5062, comm: klogd Not tainted 2.6.28-autotest-tip-sv #1
> > > > Call Trace:
> > > > [<ffffffff80259ef1>] __lock_acquire+0xeb9/0x16a4
> > > > [<ffffffff8025a6c0>] ? __lock_acquire+0x1688/0x16a4
> > > > [<ffffffff8025a761>] lock_acquire+0x85/0xa9
> > > > [<ffffffff8022aca2>] ? task_rq_lock+0x45/0x7e
> > > > [<ffffffff805fa4d4>] _spin_lock+0x31/0x66
> > > > [<ffffffff8022aca2>] ? task_rq_lock+0x45/0x7e
> > > > [<ffffffff8022aca2>] task_rq_lock+0x45/0x7e
> > > > [<ffffffff80233363>] try_to_wake_up+0x88/0x27a
> > > > [<ffffffff80233581>] wake_up_process+0x10/0x12
> > > > [<ffffffff805f775c>] schedule+0x560/0xa31
> > >
> > > I'd be most curious to know where in schedule we are.
> >
> > ok, we are in sched.c:3777
> >
> > double_unlock_balance(this_rq, busiest);
> > if (active_balance)
> > >>>>>>>>>>> wake_up_process(busiest->migration_thread);
> >
> > } else
> >
> > In active balance in newidle. This implies sched_mc was 2 at that time.
> > let me trace this and debug further.
>
> How about something like this? Strictly speaking we'll not deadlock,
> because ttwu will not be able to place the migration task on our rq, but
> since the code can deal with both rqs getting unlocked, this seems the
> easiest way out.

Hi Peter,

I agree. Unlocking this_rq is an easy way out. Thanks for the
suggestion. I have moved the unlock and lock withing the if
condition.

--Vaidy

sched: bug fix -- do not call ttwu while holding rq->lock

When sched_mc=2 wake_up_process() is called on busiest_rq
while holding this_rq lock in load_balance_newidle()
Though this will not deadlock, this is a lockdep warning
and the situation is easily solved by releasing the this_rq
lock at this point in code

Signed-off-by: Vaidyanathan Srinivasan <[email protected]>

diff --git a/kernel/sched.c b/kernel/sched.c
index 71a054f..703a669 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3773,8 +3773,12 @@ redo:
}

double_unlock_balance(this_rq, busiest);
- if (active_balance)
+ if (active_balance) {
+ /* Should not call ttwu while holding a rq->lock */
+ spin_unlock(&this_rq->lock);
wake_up_process(busiest->migration_thread);
+ spin_lock(&this_rq->lock);
+ }

} else
sd->nr_balance_failed = 0;

2009-01-07 18:06:59

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [BUG] 2.6.28-git LOCKDEP: Possible recursive rq->lock

* Vaidyanathan Srinivasan <[email protected]> [2009-01-07 22:01:00]:

> * Peter Zijlstra <[email protected]> [2009-01-07 15:28:57]:
>
> > On Wed, 2009-01-07 at 19:50 +0530, Vaidyanathan Srinivasan wrote:
> > > * Peter Zijlstra <[email protected]> [2009-01-07 14:12:43]:
> > >
> > > > On Wed, 2009-01-07 at 17:59 +0530, Vaidyanathan Srinivasan wrote:
> > > >
> > > > > =============================================
> > > > > [ INFO: possible recursive locking detected ]
> > > > > 2.6.28-autotest-tip-sv #1
> > > > > ---------------------------------------------
> > > > > klogd/5062 is trying to acquire lock:
> > > > > (&rq->lock){++..}, at: [<ffffffff8022aca2>] task_rq_lock+0x45/0x7e
> > > > >
> > > > > but task is already holding lock:
> > > > > (&rq->lock){++..}, at: [<ffffffff805f7354>] schedule+0x158/0xa31
> > > > >
> > > > > other info that might help us debug this:
> > > > > 1 lock held by klogd/5062:
> > > > > #0: (&rq->lock){++..}, at: [<ffffffff805f7354>] schedule+0x158/0xa31
> > > > >
> > > > > stack backtrace:
> > > > > Pid: 5062, comm: klogd Not tainted 2.6.28-autotest-tip-sv #1
> > > > > Call Trace:
> > > > > [<ffffffff80259ef1>] __lock_acquire+0xeb9/0x16a4
> > > > > [<ffffffff8025a6c0>] ? __lock_acquire+0x1688/0x16a4
> > > > > [<ffffffff8025a761>] lock_acquire+0x85/0xa9
> > > > > [<ffffffff8022aca2>] ? task_rq_lock+0x45/0x7e
> > > > > [<ffffffff805fa4d4>] _spin_lock+0x31/0x66
> > > > > [<ffffffff8022aca2>] ? task_rq_lock+0x45/0x7e
> > > > > [<ffffffff8022aca2>] task_rq_lock+0x45/0x7e
> > > > > [<ffffffff80233363>] try_to_wake_up+0x88/0x27a
> > > > > [<ffffffff80233581>] wake_up_process+0x10/0x12
> > > > > [<ffffffff805f775c>] schedule+0x560/0xa31
> > > >
> > > > I'd be most curious to know where in schedule we are.
> > >
> > > ok, we are in sched.c:3777
> > >
> > > double_unlock_balance(this_rq, busiest);
> > > if (active_balance)
> > > >>>>>>>>>>> wake_up_process(busiest->migration_thread);
> > >
> > > } else
> > >
> > > In active balance in newidle. This implies sched_mc was 2 at that time.
> > > let me trace this and debug further.
> >
> > How about something like this? Strictly speaking we'll not deadlock,
> > because ttwu will not be able to place the migration task on our rq, but
> > since the code can deal with both rqs getting unlocked, this seems the
> > easiest way out.
>
> Hi Peter,
>
> I agree. Unlocking this_rq is an easy way out. Thanks for the
> suggestion. I have moved the unlock and lock withing the if
> condition.
>
> --Vaidy
>
> sched: bug fix -- do not call ttwu while holding rq->lock
>
> When sched_mc=2 wake_up_process() is called on busiest_rq
> while holding this_rq lock in load_balance_newidle()
> Though this will not deadlock, this is a lockdep warning
> and the situation is easily solved by releasing the this_rq
> lock at this point in code
>
> Signed-off-by: Vaidyanathan Srinivasan <[email protected]>
>
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 71a054f..703a669 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -3773,8 +3773,12 @@ redo:
> }
>
> double_unlock_balance(this_rq, busiest);
> - if (active_balance)
> + if (active_balance) {
> + /* Should not call ttwu while holding a rq->lock */
> + spin_unlock(&this_rq->lock);
> wake_up_process(busiest->migration_thread);
> + spin_lock(&this_rq->lock);
> + }
>
> } else
> sd->nr_balance_failed = 0;


Hi Peter and Ingo,

The above fix seem to have fixed the lockdep warning. Please include
in sched-tip for further testing and later push to mainline.

Thanks,
Vaidy

2009-01-07 18:12:57

by Ingo Molnar

[permalink] [raw]
Subject: Re: [BUG] 2.6.28-git LOCKDEP: Possible recursive rq->lock


* Vaidyanathan Srinivasan <[email protected]> wrote:

> * Vaidyanathan Srinivasan <[email protected]> [2009-01-07 22:01:00]:
>
> > * Peter Zijlstra <[email protected]> [2009-01-07 15:28:57]:
> >
> > > On Wed, 2009-01-07 at 19:50 +0530, Vaidyanathan Srinivasan wrote:
> > > > * Peter Zijlstra <[email protected]> [2009-01-07 14:12:43]:
> > > >
> > > > > On Wed, 2009-01-07 at 17:59 +0530, Vaidyanathan Srinivasan wrote:
> > > > >
> > > > > > =============================================
> > > > > > [ INFO: possible recursive locking detected ]
> > > > > > 2.6.28-autotest-tip-sv #1
> > > > > > ---------------------------------------------
> > > > > > klogd/5062 is trying to acquire lock:
> > > > > > (&rq->lock){++..}, at: [<ffffffff8022aca2>] task_rq_lock+0x45/0x7e
> > > > > >
> > > > > > but task is already holding lock:
> > > > > > (&rq->lock){++..}, at: [<ffffffff805f7354>] schedule+0x158/0xa31
> > > > > >
> > > > > > other info that might help us debug this:
> > > > > > 1 lock held by klogd/5062:
> > > > > > #0: (&rq->lock){++..}, at: [<ffffffff805f7354>] schedule+0x158/0xa31
> > > > > >
> > > > > > stack backtrace:
> > > > > > Pid: 5062, comm: klogd Not tainted 2.6.28-autotest-tip-sv #1
> > > > > > Call Trace:
> > > > > > [<ffffffff80259ef1>] __lock_acquire+0xeb9/0x16a4
> > > > > > [<ffffffff8025a6c0>] ? __lock_acquire+0x1688/0x16a4
> > > > > > [<ffffffff8025a761>] lock_acquire+0x85/0xa9
> > > > > > [<ffffffff8022aca2>] ? task_rq_lock+0x45/0x7e
> > > > > > [<ffffffff805fa4d4>] _spin_lock+0x31/0x66
> > > > > > [<ffffffff8022aca2>] ? task_rq_lock+0x45/0x7e
> > > > > > [<ffffffff8022aca2>] task_rq_lock+0x45/0x7e
> > > > > > [<ffffffff80233363>] try_to_wake_up+0x88/0x27a
> > > > > > [<ffffffff80233581>] wake_up_process+0x10/0x12
> > > > > > [<ffffffff805f775c>] schedule+0x560/0xa31
> > > > >
> > > > > I'd be most curious to know where in schedule we are.
> > > >
> > > > ok, we are in sched.c:3777
> > > >
> > > > double_unlock_balance(this_rq, busiest);
> > > > if (active_balance)
> > > > >>>>>>>>>>> wake_up_process(busiest->migration_thread);
> > > >
> > > > } else
> > > >
> > > > In active balance in newidle. This implies sched_mc was 2 at that time.
> > > > let me trace this and debug further.
> > >
> > > How about something like this? Strictly speaking we'll not deadlock,
> > > because ttwu will not be able to place the migration task on our rq, but
> > > since the code can deal with both rqs getting unlocked, this seems the
> > > easiest way out.
> >
> > Hi Peter,
> >
> > I agree. Unlocking this_rq is an easy way out. Thanks for the
> > suggestion. I have moved the unlock and lock withing the if
> > condition.
> >
> > --Vaidy
> >
> > sched: bug fix -- do not call ttwu while holding rq->lock
> >
> > When sched_mc=2 wake_up_process() is called on busiest_rq
> > while holding this_rq lock in load_balance_newidle()
> > Though this will not deadlock, this is a lockdep warning
> > and the situation is easily solved by releasing the this_rq
> > lock at this point in code
> >
> > Signed-off-by: Vaidyanathan Srinivasan <[email protected]>
> >
> > diff --git a/kernel/sched.c b/kernel/sched.c
> > index 71a054f..703a669 100644
> > --- a/kernel/sched.c
> > +++ b/kernel/sched.c
> > @@ -3773,8 +3773,12 @@ redo:
> > }
> >
> > double_unlock_balance(this_rq, busiest);
> > - if (active_balance)
> > + if (active_balance) {
> > + /* Should not call ttwu while holding a rq->lock */
> > + spin_unlock(&this_rq->lock);
> > wake_up_process(busiest->migration_thread);
> > + spin_lock(&this_rq->lock);
> > + }
> >
> > } else
> > sd->nr_balance_failed = 0;
>
>
> Hi Peter and Ingo,
>
> The above fix seem to have fixed the lockdep warning. Please include
> in sched-tip for further testing and later push to mainline.

already in tip/sched/urgent, thanks guys!

Ingo