LinuxLists.cc - BUG: spinlock lockup

2014-01-18 07:25:55

Subject: BUG: spinlock lockup

Dear All,

We are using 3.8.x kernel on ARM, We are facing soft lockup issue.
Following are the logs.

BUG: spinlock lockup suspected on CPU#0, process1/525
lock: 0xd8ac9a64, .magic: dead4ead, .owner: <none>/-1, .owner_cpu: -1

1 . Looks like lock is available as owner is -1, why arch_spin_trylock
is getting failed ?

2. There is a patch : ARM: spinlock: retry trylock operation if strex
fails on free lock
http://permalink.gmane.org/gmane.linux.ports.arm.kernel/240913
In this patch, A loop has been added around strexeq %2, %0, [%3]".
{Comment "retry the trylock operation if the lock appears
to be free but the strex reported failure"}

but arch_spin_trylock is called by __spin_lock_debug and its already
getting called in loops. So what purpose is resolves?

static void __spin_lock_debug(raw_spinlock_t *lock)
{
u64 i;
u64 loops = loops_per_jiffy * HZ;

for (i = 0; i < loops; i++) {
if (arch_spin_trylock(&lock->raw_lock))
return;
__delay(1);
}
/* lockup suspected: */
spin_dump(lock, "lockup suspected");
}

3. Is this patch useful to us, How can we reproduce this scenario ?
Scenario : Lock is available but arch_spin_trylock is returning as failure

Thanks

2014-01-20 10:21:09

by Will Deacon

[permalink] [raw]

Subject: Re: BUG: spinlock lockup

On Sat, Jan 18, 2014 at 07:25:51AM +0000, naveen yadav wrote:
> We are using 3.8.x kernel on ARM, We are facing soft lockup issue.
> Following are the logs.

Which CPU/SoC are you using?

> BUG: spinlock lockup suspected on CPU#0, process1/525
> lock: 0xd8ac9a64, .magic: dead4ead, .owner: <none>/-1, .owner_cpu: -1
>
>
> 1 . Looks like lock is available as owner is -1, why arch_spin_trylock
> is getting failed ?

Is this with or without the ticket lock patches? Can you inspect the actual
value of the arch_spinlock_t?

> 2. There is a patch : ARM: spinlock: retry trylock operation if strex
> fails on free lock
> http://permalink.gmane.org/gmane.linux.ports.arm.kernel/240913
> In this patch, A loop has been added around strexeq %2, %0, [%3]".
> {Comment "retry the trylock operation if the lock appears
> to be free but the strex reported failure"}
>
> but arch_spin_trylock is called by __spin_lock_debug and its already
> getting called in loops. So what purpose is resolves?

Does this patch help your issue? The purpose of it is to distinguish between
two types of contention:

(1) The lock is actually taken
(2) The lock is free, but two people are doing a trylock at the same time

In the case of (2), we do actually want to spin again otherwise you could
potentially end up in a pathological case where the two CPUs repeatedly
shoot down each other's monitor and forward progress isn't made until the
sequence is broken by something like an interrupt.

> static void __spin_lock_debug(raw_spinlock_t *lock)
> {
> u64 i;
> u64 loops = loops_per_jiffy * HZ;
>
> for (i = 0; i < loops; i++) {
> if (arch_spin_trylock(&lock->raw_lock))
> return;
> __delay(1);
> }
> /* lockup suspected: */
> spin_dump(lock, "lockup suspected");
> }
>
> 3. Is this patch useful to us, How can we reproduce this scenario ?
> Scenario : Lock is available but arch_spin_trylock is returning as failure

Potentially. Why can't you simply apply the patch and see if it resolves your
issue?

Will

2014-01-21 06:37:35

by naveen yadav

[permalink] [raw]

Subject: Re: BUG: spinlock lockup

Dear Will,

Thanks for your reply,

We are using Cortex A15.
yes, this is with ticket lock.

We will check value of arch_spinlock_t and share it. It is bit
difficult to reproduce this scenario.

If you have some idea ,please suggest how to reproduce it.

thanks

On Mon, Jan 20, 2014 at 3:50 PM, Will Deacon <[email protected]> wrote:
> On Sat, Jan 18, 2014 at 07:25:51AM +0000, naveen yadav wrote:
>> We are using 3.8.x kernel on ARM, We are facing soft lockup issue.
>> Following are the logs.
>
> Which CPU/SoC are you using?

>
>> BUG: spinlock lockup suspected on CPU#0, process1/525
>> lock: 0xd8ac9a64, .magic: dead4ead, .owner: <none>/-1, .owner_cpu: -1
>>
>>
>> 1 . Looks like lock is available as owner is -1, why arch_spin_trylock
>> is getting failed ?
>
> Is this with or without the ticket lock patches? Can you inspect the actual
> value of the arch_spinlock_t?

>
>> 2. There is a patch : ARM: spinlock: retry trylock operation if strex
>> fails on free lock
>> http://permalink.gmane.org/gmane.linux.ports.arm.kernel/240913
>> In this patch, A loop has been added around strexeq %2, %0, [%3]".
>> {Comment "retry the trylock operation if the lock appears
>> to be free but the strex reported failure"}
>>
>> but arch_spin_trylock is called by __spin_lock_debug and its already
>> getting called in loops. So what purpose is resolves?
>
> Does this patch help your issue? The purpose of it is to distinguish between
> two types of contention:
>
> (1) The lock is actually taken
> (2) The lock is free, but two people are doing a trylock at the same time
>
> In the case of (2), we do actually want to spin again otherwise you could
> potentially end up in a pathological case where the two CPUs repeatedly
> shoot down each other's monitor and forward progress isn't made until the
> sequence is broken by something like an interrupt.
>
>> static void __spin_lock_debug(raw_spinlock_t *lock)
>> {
>> u64 i;
>> u64 loops = loops_per_jiffy * HZ;
>>
>> for (i = 0; i < loops; i++) {
>> if (arch_spin_trylock(&lock->raw_lock))
>> return;
>> __delay(1);
>> }
>> /* lockup suspected: */
>> spin_dump(lock, "lockup suspected");
>> }
>>
>> 3. Is this patch useful to us, How can we reproduce this scenario ?
>> Scenario : Lock is available but arch_spin_trylock is returning as failure
>
> Potentially. Why can't you simply apply the patch and see if it resolves your
> issue?
>
> Will

2014-01-21 10:14:20

by Will Deacon

[permalink] [raw]

Subject: Re: BUG: spinlock lockup

On Tue, Jan 21, 2014 at 06:37:31AM +0000, naveen yadav wrote:
> Thanks for your reply,
>
> We are using Cortex A15.
> yes, this is with ticket lock.
>
> We will check value of arch_spinlock_t and share it. It is bit
> difficult to reproduce this scenario.
>
> If you have some idea ,please suggest how to reproduce it.

You could try enabling lockdep and see if it catches anything earlier
on.

Will

2014-01-29 10:47:38

by naveen yadav

[permalink] [raw]

Subject: Re: BUG: spinlock lockup

Dear Will,

Thanks for your input. We debug by adding print as below and found
very big value difference between next and owner(more then 1000). So
it seams memory corruption.

linux/lib/spinlock_debug.c

msg, raw_smp_processor_id(),
current->comm, task_pid_nr(current));
printk(KERN_EMERG " lock: %pS, .magic: %08x, .owner: %s/%d, "
- ".owner_cpu: %d\n",
+ ".owner_cpu: %d raw_lock.tickets.next %u raw_lock.tickets.owner %u \n",
lock, lock->magic,
owner ? owner->comm : "<none>",
owner ? task_pid_nr(owner) : -1,
- lock->owner_cpu);
+ lock->owner_cpu,
+ lock->raw_lock.tickets.next,
+ lock->raw_lock.tickets.owner);
dump_stack();
}

I have one request, is it possible to change like below, if any
corruption, it is easy to debug .
if magic is corrupt, we can find quickly.

typedef struct raw_spinlock {

#ifdef CONFIG_DEBUG_SPINLOCK
unsigned int magic, owner_cpu;
void *owner;
#endif

arch_spinlock_t raw_lock;
#ifdef CONFIG_GENERIC_LOCKBREAK
unsigned int break_lock;
#endif
#ifdef CONFIG_DEBUG_LOCK_ALLOC
struct lockdep_map dep_map;
#endif
} raw_spinlock_t;

So if this structure got corrupt,
On Tue, Jan 21, 2014 at 3:44 PM, Will Deacon <[email protected]> wrote:
> On Tue, Jan 21, 2014 at 06:37:31AM +0000, naveen yadav wrote:
>> Thanks for your reply,
>>
>> We are using Cortex A15.
>> yes, this is with ticket lock.
>>
>> We will check value of arch_spinlock_t and share it. It is bit
>> difficult to reproduce this scenario.
>>
>> If you have some idea ,please suggest how to reproduce it.
>
> You could try enabling lockdep and see if it catches anything earlier
> on.
>
> Will