2013-07-11 09:44:16

by Rolf Eike Beer

[permalink] [raw]
Subject: Hard lockups using 3.10.0

Hi,

I'm running 3.10.0 (from openSUSE packages) on an "Intel(R) Core(TM) i7-2600
CPU @ 3.40GHz". I got a hard lockup on one of my CPUs twice, once with
backtrace (see attached image). Graphics is the builtin Intel, used with X 7.6
and KDE 4.10beta2 (basically current openSUSE 12.3+KDE).

I'm not aware that I had done anything special, just "normal" desktop and
development usage, but no heavy compile work at the moment the lockups
happened.

Any ideas?

Eike


Attachments:
lockup.jpg (260.10 kB)

2013-07-11 10:07:41

by Borislav Petkov

[permalink] [raw]
Subject: Re: Hard lockups using 3.10.0

On Thu, Jul 11, 2013 at 11:38:37AM +0200, Rolf Eike Beer wrote:
> Hi,
>
> I'm running 3.10.0 (from openSUSE packages) on an "Intel(R) Core(TM) i7-2600
> CPU @ 3.40GHz". I got a hard lockup on one of my CPUs twice, once with
> backtrace (see attached image). Graphics is the builtin Intel, used with X 7.6
> and KDE 4.10beta2 (basically current openSUSE 12.3+KDE).
>
> I'm not aware that I had done anything special, just "normal" desktop and
> development usage, but no heavy compile work at the moment the lockups
> happened.

Hmm, I can see commit_creds() doing some rcu pointers assignment and rcu
calling into the scheduler which screams about a cpu runqueue of the
task we're about to reschedule not being locked. Let's add some more
people who should know better.

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

2013-07-11 10:17:33

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Hard lockups using 3.10.0

On Thu, Jul 11, 2013 at 12:07:21PM +0200, Borislav Petkov wrote:
> On Thu, Jul 11, 2013 at 11:38:37AM +0200, Rolf Eike Beer wrote:
> > Hi,
> >
> > I'm running 3.10.0 (from openSUSE packages) on an "Intel(R) Core(TM) i7-2600
> > CPU @ 3.40GHz". I got a hard lockup on one of my CPUs twice, once with
> > backtrace (see attached image). Graphics is the builtin Intel, used with X 7.6
> > and KDE 4.10beta2 (basically current openSUSE 12.3+KDE).
> >
> > I'm not aware that I had done anything special, just "normal" desktop and
> > development usage, but no heavy compile work at the moment the lockups
> > happened.
>
> Hmm, I can see commit_creds() doing some rcu pointers assignment and rcu
> calling into the scheduler which screams about a cpu runqueue of the
> task we're about to reschedule not being locked. Let's add some more
> people who should know better.

-ENOIMAGE

2013-07-11 10:52:59

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Hard lockups using 3.10.0

On Thu, Jul 11, 2013 at 12:07:21PM +0200, Borislav Petkov wrote:
> On Thu, Jul 11, 2013 at 11:38:37AM +0200, Rolf Eike Beer wrote:
> > Hi,
> >
> > I'm running 3.10.0 (from openSUSE packages) on an "Intel(R) Core(TM) i7-2600
> > CPU @ 3.40GHz". I got a hard lockup on one of my CPUs twice, once with
> > backtrace (see attached image). Graphics is the builtin Intel, used with X 7.6
> > and KDE 4.10beta2 (basically current openSUSE 12.3+KDE).
> >
> > I'm not aware that I had done anything special, just "normal" desktop and
> > development usage, but no heavy compile work at the moment the lockups
> > happened.
>
> Hmm, I can see commit_creds() doing some rcu pointers assignment and rcu
> calling into the scheduler which screams about a cpu runqueue of the
> task we're about to reschedule not being locked. Let's add some more
> people who should know better.

Ok, for the other people too lazy to bother finding the picture:

http://marc.info/?l=linux-kernel&m=137353587012001&q=p3

So we bug at:

kernel/sched/core.c:519 assert_raw_spin_locked(&task_rq(p)->lock);

and get there through:

resched_task()
check_preempt_wakeup()
check_preempt_curr()
try_to_wake_up()
autoremove_wake_function()
__call_rcu_nocb_enqueue()
__call_rcu()
commit_creds()
____call_usermodehelper()
ret_from_fork()

That don't make much sense though. Since:

try_to_wake_up()
ttwu_queue()
raw_spin_lock(&rq->lock)
ttwu_do_activate()
ttwu_do_wakeup()
check_preempt_curr()
check_preempt_wakeup()
resched_task(rq->curr)
assert_raw_spin_locked(task_rq(p)->lock)

It would somehow mean that 'task_rq(rq->curr) != rq', that's completely
bonkers, we do after all have rq->lock locked.

I must also say that I've _never_ seen this bug before.

2013-07-11 17:50:36

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Hard lockups using 3.10.0

On Thu, Jul 11, 2013 at 12:52:07PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 11, 2013 at 12:07:21PM +0200, Borislav Petkov wrote:
> > On Thu, Jul 11, 2013 at 11:38:37AM +0200, Rolf Eike Beer wrote:
> > > Hi,
> > >
> > > I'm running 3.10.0 (from openSUSE packages) on an "Intel(R) Core(TM) i7-2600
> > > CPU @ 3.40GHz". I got a hard lockup on one of my CPUs twice, once with
> > > backtrace (see attached image). Graphics is the builtin Intel, used with X 7.6
> > > and KDE 4.10beta2 (basically current openSUSE 12.3+KDE).
> > >
> > > I'm not aware that I had done anything special, just "normal" desktop and
> > > development usage, but no heavy compile work at the moment the lockups
> > > happened.
> >
> > Hmm, I can see commit_creds() doing some rcu pointers assignment and rcu
> > calling into the scheduler which screams about a cpu runqueue of the
> > task we're about to reschedule not being locked. Let's add some more
> > people who should know better.
>
> Ok, for the other people too lazy to bother finding the picture:
>
> http://marc.info/?l=linux-kernel&m=137353587012001&q=p3
>
> So we bug at:
>
> kernel/sched/core.c:519 assert_raw_spin_locked(&task_rq(p)->lock);
>
> and get there through:
>
> resched_task()
> check_preempt_wakeup()
> check_preempt_curr()
> try_to_wake_up()
> autoremove_wake_function()
> __call_rcu_nocb_enqueue()
> __call_rcu()
> commit_creds()
> ____call_usermodehelper()
> ret_from_fork()
>
> That don't make much sense though. Since:
>
> try_to_wake_up()
> ttwu_queue()
> raw_spin_lock(&rq->lock)
> ttwu_do_activate()
> ttwu_do_wakeup()
> check_preempt_curr()
> check_preempt_wakeup()
> resched_task(rq->curr)
> assert_raw_spin_locked(task_rq(p)->lock)
>
> It would somehow mean that 'task_rq(rq->curr) != rq', that's completely
> bonkers, we do after all have rq->lock locked.
>
> I must also say that I've _never_ seen this bug before.

New one on me as well. Is this reproducible? If so, does it happen
when CONFIG_RCU_NOCB_CPU=n? (Given the call to call_rcu_nocb_enqueue(),
I expect that you built with CONFIG_RCU_NOCB_CPU=y.) Can't say that I
see how call_rcu_nocb_enqueue() would have caused this, but...

Well, I supposed that if RCU's callback lists got corrupted, this
(and much else besides) could in fact happen. Does your build have
CONFIG_DEBUG_OBJECTS_RCU_HEAD=y? If not, could you please try it?

Thanx, Paul

2013-07-11 19:03:03

by Rolf Eike Beer

[permalink] [raw]
Subject: Re: Hard lockups using 3.10.0

Paul E. McKenney wrote:
> On Thu, Jul 11, 2013 at 12:52:07PM +0200, Peter Zijlstra wrote:
> > On Thu, Jul 11, 2013 at 12:07:21PM +0200, Borislav Petkov wrote:
> > > On Thu, Jul 11, 2013 at 11:38:37AM +0200, Rolf Eike Beer wrote:
> > > > Hi,
> > > >
> > > > I'm running 3.10.0 (from openSUSE packages) on an "Intel(R) Core(TM)
> > > > i7-2600 CPU @ 3.40GHz". I got a hard lockup on one of my CPUs twice,
> > > > once with backtrace (see attached image). Graphics is the builtin
> > > > Intel, used with X 7.6 and KDE 4.10beta2 (basically current openSUSE
> > > > 12.3+KDE).
> > > >
> > > > I'm not aware that I had done anything special, just "normal" desktop
> > > > and
> > > > development usage, but no heavy compile work at the moment the lockups
> > > > happened.
> > >
> > > Hmm, I can see commit_creds() doing some rcu pointers assignment and rcu
> > > calling into the scheduler which screams about a cpu runqueue of the
> > > task we're about to reschedule not being locked. Let's add some more
> > > people who should know better.
> >
> > Ok, for the other people too lazy to bother finding the picture:
> > http://marc.info/?l=linux-kernel&m=137353587012001&q=p3
> >
> > So we bug at:
> >
> > kernel/sched/core.c:519 assert_raw_spin_locked(&task_rq(p)->lock);
> >
> > and get there through:
> > resched_task()
> > check_preempt_wakeup()
> > check_preempt_curr()
> > try_to_wake_up()
> > autoremove_wake_function()
> > __call_rcu_nocb_enqueue()
> > __call_rcu()
> > commit_creds()
> > ____call_usermodehelper()
> > ret_from_fork()
> >
> > That don't make much sense though. Since:
> > try_to_wake_up()
> >
> > ttwu_queue()
> >
> > raw_spin_lock(&rq->lock)
> > ttwu_do_activate()
> >
> > ttwu_do_wakeup()
> >
> > check_preempt_curr()
> >
> > check_preempt_wakeup()
> >
> > resched_task(rq->curr)
> >
> > assert_raw_spin_locked(task_rq(p)->lock)
> >
> > It would somehow mean that 'task_rq(rq->curr) != rq', that's completely
> > bonkers, we do after all have rq->lock locked.
> >
> > I must also say that I've _never_ seen this bug before.
>
> New one on me as well. Is this reproducible? If so, does it happen
> when CONFIG_RCU_NOCB_CPU=n? (Given the call to call_rcu_nocb_enqueue(),
> I expect that you built with CONFIG_RCU_NOCB_CPU=y.) Can't say that I
> see how call_rcu_nocb_enqueue() would have caused this, but...
>
> Well, I supposed that if RCU's callback lists got corrupted, this
> (and much else besides) could in fact happen. Does your build have
> CONFIG_DEBUG_OBJECTS_RCU_HEAD=y? If not, could you please try it?

I will look tomorrow. This is a "standard" openSUSE kernel RPM, dunno right
now which repository. It is not really reproducible, it suddenly happened
again today but this time without backtrace.

Eike


Attachments:
signature.asc (198.00 B)
This is a digitally signed message part.

2013-08-11 06:09:38

by Rolf Eike Beer

[permalink] [raw]
Subject: Re: Hard lockups using 3.10.0

Peter Zijlstra wrote:
> On Thu, Jul 11, 2013 at 12:07:21PM +0200, Borislav Petkov wrote:
> > On Thu, Jul 11, 2013 at 11:38:37AM +0200, Rolf Eike Beer wrote:
> > > Hi,
> > >
> > > I'm running 3.10.0 (from openSUSE packages) on an "Intel(R) Core(TM)
> > > i7-2600 CPU @ 3.40GHz". I got a hard lockup on one of my CPUs twice,
> > > once with backtrace (see attached image). Graphics is the builtin
> > > Intel, used with X 7.6 and KDE 4.10beta2 (basically current openSUSE
> > > 12.3+KDE).
> > >
> > > I'm not aware that I had done anything special, just "normal" desktop
> > > and
> > > development usage, but no heavy compile work at the moment the lockups
> > > happened.
> >
> > Hmm, I can see commit_creds() doing some rcu pointers assignment and rcu
> > calling into the scheduler which screams about a cpu runqueue of the
> > task we're about to reschedule not being locked. Let's add some more
> > people who should know better.
>
> Ok, for the other people too lazy to bother finding the picture:
>
> http://marc.info/?l=linux-kernel&m=137353587012001&q=p3
>
> So we bug at:
>
> kernel/sched/core.c:519 assert_raw_spin_locked(&task_rq(p)->lock);
>
> and get there through:
>
> resched_task()
> check_preempt_wakeup()
> check_preempt_curr()
> try_to_wake_up()
> autoremove_wake_function()
> __call_rcu_nocb_enqueue()
> __call_rcu()
> commit_creds()
> ____call_usermodehelper()
> ret_from_fork()
>
> That don't make much sense though. Since:
>
> try_to_wake_up()
> ttwu_queue()
> raw_spin_lock(&rq->lock)
> ttwu_do_activate()
> ttwu_do_wakeup()
> check_preempt_curr()
> check_preempt_wakeup()
> resched_task(rq->curr)
> assert_raw_spin_locked(task_rq(p)->lock)
>
> It would somehow mean that 'task_rq(rq->curr) != rq', that's completely
> bonkers, we do after all have rq->lock locked.
>
> I must also say that I've _never_ seen this bug before.

Meanwhile I found that there was a hardware defect on this machine. So if it
does not happen again I will assume that this was caused by this.

Thanks for looking into this.

Eike


Attachments:
signature.asc (198.00 B)
This is a digitally signed message part.

2013-08-11 08:37:26

by Borislav Petkov

[permalink] [raw]
Subject: Re: Hard lockups using 3.10.0

On Sun, Aug 11, 2013 at 08:09:19AM +0200, Rolf Eike Beer wrote:
> Meanwhile I found that there was a hardware defect on this machine.
> So if it does not happen again I will assume that this was caused by
> this.

What hardware defect exactly? DIMMs failing...? Probably, since it looks
like the spinlock gets corrupted and the assertion fires... In any case,
it would be interesting to know for future reference.

Thanks.

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

2013-08-11 11:10:22

by Rolf Eike Beer

[permalink] [raw]
Subject: Re: Hard lockups using 3.10.0

Borislav Petkov wrote:
> On Sun, Aug 11, 2013 at 08:09:19AM +0200, Rolf Eike Beer wrote:
> > Meanwhile I found that there was a hardware defect on this machine.
> > So if it does not happen again I will assume that this was caused by
> > this.
>
> What hardware defect exactly? DIMMs failing...? Probably, since it looks
> like the spinlock gets corrupted and the assertion fires... In any case,
> it would be interesting to know for future reference.

The RAM seems fine. It looks like it is the mainboard or a harddisk. The issues
have magically disappeared since 3 weeks, but I have not done any suspend2disk
since then anymore. Before that I had suspended the machine on the evening and
resumed when I came to work. So it's possible that there was some corrupted
stuff in the image.

This is the smart output I got of one disk yesterday:

Vendor: /0:0:0:0
Product:
User Capacity: 600,332,565,813,390,450 bytes [600 PB]
Logical block size: 774843950 bytes
scsiModePageOffset: response length too short, resp_len=47 offset=50 bd_len=46
>> Terminate command early due to bad response to IEC mode page
A mandatory SMART command failed: exiting. To continue, add one or more '-T
permissive' options.

Eike


Attachments:
signature.asc (198.00 B)
This is a digitally signed message part.

2013-08-13 10:38:12

by Borislav Petkov

[permalink] [raw]
Subject: Re: Hard lockups using 3.10.0

On Sun, Aug 11, 2013 at 01:10:11PM +0200, Rolf Eike Beer wrote:
> The RAM seems fine. It looks like it is the mainboard or a harddisk.
> The issues have magically disappeared since 3 weeks, but I have not
> done any suspend2disk since then anymore. Before that I had suspended
> the machine on the evening and resumed when I came to work. So it's
> possible that there was some corrupted stuff in the image.

Hmm, probably...

> This is the smart output I got of one disk yesterday:
>
> Vendor: /0:0:0:0
> Product:
> User Capacity: 600,332,565,813,390,450 bytes [600 PB]

Is this for real? 600 PB??

I wanna hdd like that :-)

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

2013-08-13 11:57:41

by Rolf Eike Beer

[permalink] [raw]
Subject: Re: Hard lockups using 3.10.0

Borislav Petkov wrote:
> On Sun, Aug 11, 2013 at 01:10:11PM +0200, Rolf Eike Beer wrote:
>> The RAM seems fine. It looks like it is the mainboard or a harddisk.
>> The issues have magically disappeared since 3 weeks, but I have not
>> done any suspend2disk since then anymore. Before that I had suspended
>> the machine on the evening and resumed when I came to work. So it's
>> possible that there was some corrupted stuff in the image.
>
> Hmm, probably...
>
>> This is the smart output I got of one disk yesterday:
>>
>> Vendor: /0:0:0:0
>> Product:
>> User Capacity: 600,332,565,813,390,450 bytes [600 PB]
>
> Is this for real? 600 PB??
>
> I wanna hdd like that :-)

We have problems getting such a disk again. Seems all available one
have disappeared somewhere near Bluffdale.

I'm not sure how good ext4 can handle sector sizes of several hundred
megabytes, so it may be not that fun ;)

Eike