2005-03-09 10:58:15

by Suparna Bhattacharya

[permalink] [raw]
Subject: Re: aio stress panic on 2.6.11-mm1


Any sense of how costly it is to use spin_lock_irq's vs spin_lock
(across different architectures) ? Isn't rwsem used very widely ?

Regards
Suparna

On Wed, Mar 09, 2005 at 10:33:58AM +0000, David Howells wrote:
> Andrew Morton <[email protected]> wrote:
>
> > If we want to take the spinlock from interrupt context, the non-interrupt
> > context code needs to do spin_lock_irq(), not spin_lock().
>
> Yeah. I think I had a patch for that somewhere, but I think Linus turned it
> down. I can't find any emails on that subject though. I'll knock together a
> new patch for it.
>
> David
> --
> To unsubscribe, send a message with 'unsubscribe linux-aio' in
> the body to [email protected]. For more info on Linux AIO,
> see: http://www.kvack.org/aio/
> Don't email: <a href=mailto:"[email protected]">[email protected]</a>

--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India


2005-03-09 11:05:59

by Andrew Morton

[permalink] [raw]
Subject: Re: aio stress panic on 2.6.11-mm1

Suparna Bhattacharya <[email protected]> wrote:
>
> Any sense of how costly it is to use spin_lock_irq's vs spin_lock
> (across different architectures) ? Isn't rwsem used very widely ?

It's only on the slow path, and we've already done a bunch of atomic ops
and a schedule()/wakeup() anyway.

2005-03-09 11:08:21

by Arjan van de Ven

[permalink] [raw]
Subject: Re: aio stress panic on 2.6.11-mm1

On Wed, 2005-03-09 at 16:34 +0530, Suparna Bhattacharya wrote:
> Any sense of how costly it is to use spin_lock_irq's vs spin_lock
> (across different architectures)

on x86 it makes a difference of maybe a few cycles. At most.
However please consider using spin_lock_irqsave(); the _irq() variant,
while it can be used correctly, is a major source of bugs since it
unconditionally enables interrupts on unlock.



2005-03-09 11:10:47

by Arjan van de Ven

[permalink] [raw]
Subject: Re: aio stress panic on 2.6.11-mm1

On Wed, 2005-03-09 at 16:34 +0530, Suparna Bhattacharya wrote:
> Any sense of how costly it is to use spin_lock_irq's vs spin_lock
> (across different architectures) ? Isn't rwsem used very widely ?

oh also rwsems aren't used all that much simply because they are quite
more expensive than regular semaphores, so that you need a HUGE bias in
reader/writer ratio to make it even worth it...

2005-03-09 11:14:18

by Andi Kleen

[permalink] [raw]
Subject: Re: aio stress panic on 2.6.11-mm1

Suparna Bhattacharya <[email protected]> writes:

> Any sense of how costly it is to use spin_lock_irq's vs spin_lock
> (across different architectures) ? Isn't rwsem used very widely ?

On P4s cli/sti is quite costly, let's says 100+ cycles. That is mostly
because it synchronizes the CPU partly. The Intel tables say 26/36 cycles
latency, but in practice it seems to take longer because of the
synchronization.

I would assume this is the worst case, everywhere else it should
be cheaper (except perhaps in some virtualized environments)
On P-M and AMD K7/K8 it is only a few cycles difference.

-Andi

2005-03-09 11:18:28

by Andi Kleen

[permalink] [raw]
Subject: Re: aio stress panic on 2.6.11-mm1

Arjan van de Ven <[email protected]> writes:

> On Wed, 2005-03-09 at 16:34 +0530, Suparna Bhattacharya wrote:
>> Any sense of how costly it is to use spin_lock_irq's vs spin_lock
>> (across different architectures)
>
> on x86 it makes a difference of maybe a few cycles. At most.
> However please consider using spin_lock_irqsave(); the _irq() variant,
> while it can be used correctly, is a major source of bugs since it
> unconditionally enables interrupts on unlock.

irqsave is much worse than _irq on P4.

However the spinlock already synchronizes the CPU, so some of the
cost should be mitigated.

-Andi

2005-03-09 11:23:12

by Andi Kleen

[permalink] [raw]
Subject: Re: aio stress panic on 2.6.11-mm1

Arjan van de Ven <[email protected]> writes:

> On Wed, 2005-03-09 at 16:34 +0530, Suparna Bhattacharya wrote:
>> Any sense of how costly it is to use spin_lock_irq's vs spin_lock
>> (across different architectures) ? Isn't rwsem used very widely ?
>
> oh also rwsems aren't used all that much simply because they are quite
> more expensive than regular semaphores, so that you need a HUGE bias in
> reader/writer ratio to make it even worth it...

I agree. I think in fact once Christopher L's lockless page fault fast path
goes in it would be a good idea to reevaluate if mmap_sem should
be really a rwsem and possibly change it back to a normal semaphore
that perhaps gets dropped only on a page cache miss.

-Andi

2005-03-09 11:29:37

by Andrew Morton

[permalink] [raw]
Subject: Re: aio stress panic on 2.6.11-mm1

Arjan van de Ven <[email protected]> wrote:
>
> On Wed, 2005-03-09 at 16:34 +0530, Suparna Bhattacharya wrote:
> > Any sense of how costly it is to use spin_lock_irq's vs spin_lock
> > (across different architectures)
>
> on x86 it makes a difference of maybe a few cycles. At most.
> However please consider using spin_lock_irqsave(); the _irq() variant,
> while it can be used correctly, is a major source of bugs since it
> unconditionally enables interrupts on unlock.
>

spin_lock_irq() is OK for down_*(), since down() can call schedule() anyway.

spin_lock_irqsave() should be used in up_*() and I guess down_*_trylock(),
although the latter shouldn't need to go into the slowpath anyway.

2005-03-09 12:00:07

by David Howells

[permalink] [raw]
Subject: Re: aio stress panic on 2.6.11-mm1

Andrew Morton <[email protected]> wrote:

> spin_lock_irq() is OK for down_*(), since down() can call schedule() anyway.
>
> spin_lock_irqsave() should be used in up_*() and I guess down_*_trylock(),
> although the latter shouldn't need to go into the slowpath anyway.

That's what I've done. I'm just testing my changes.

David

2005-03-09 12:13:38

by David Howells

[permalink] [raw]
Subject: [PATCH] rwsem: Make rwsems use interrupt disabling spinlocks


The attached patch makes read/write semaphores use interrupt disabling
spinlocks, thus rendering the up functions and trylock functions available for
use in interrupt context.

I've assumed that the normal down functions must be called with interrupts
enabled (since they might schedule), and used the irq-disabling spinlock
variants that don't save the flags.

Signed-Off-By: David Howells <[email protected]>
---
warthog>diffstat -p1 rwsem-irqspin-2611mm2.diff
lib/rwsem-spinlock.c | 42 ++++++++++++++++++++++++++----------------
lib/rwsem.c | 16 ++++++++++------
2 files changed, 36 insertions(+), 22 deletions(-)

diff -uNrp linux-2.6.11-mm2/lib/rwsem.c linux-2.6.11-mm2-rwsem/lib/rwsem.c
--- linux-2.6.11-mm2/lib/rwsem.c 2004-10-19 10:42:19.000000000 +0100
+++ linux-2.6.11-mm2-rwsem/lib/rwsem.c 2005-03-09 10:45:16.000000000 +0000
@@ -150,7 +150,7 @@ rwsem_down_failed_common(struct rw_semap
set_task_state(tsk, TASK_UNINTERRUPTIBLE);

/* set up my own style of waitqueue */
- spin_lock(&sem->wait_lock);
+ spin_lock_irq(&sem->wait_lock);
waiter->task = tsk;
get_task_struct(tsk);

@@ -163,7 +163,7 @@ rwsem_down_failed_common(struct rw_semap
if (!(count & RWSEM_ACTIVE_MASK))
sem = __rwsem_do_wake(sem, 0);

- spin_unlock(&sem->wait_lock);
+ spin_unlock_irq(&sem->wait_lock);

/* wait to be given the lock */
for (;;) {
@@ -219,15 +219,17 @@ rwsem_down_write_failed(struct rw_semaph
*/
struct rw_semaphore fastcall *rwsem_wake(struct rw_semaphore *sem)
{
+ unsigned long flags;
+
rwsemtrace(sem, "Entering rwsem_wake");

- spin_lock(&sem->wait_lock);
+ spin_lock_irqsave(&sem->wait_lock, flags);

/* do nothing if list empty */
if (!list_empty(&sem->wait_list))
sem = __rwsem_do_wake(sem, 0);

- spin_unlock(&sem->wait_lock);
+ spin_unlock_irqrestore(&sem->wait_lock, flags);

rwsemtrace(sem, "Leaving rwsem_wake");

@@ -241,15 +243,17 @@ struct rw_semaphore fastcall *rwsem_wake
*/
struct rw_semaphore fastcall *rwsem_downgrade_wake(struct rw_semaphore *sem)
{
+ unsigned long flags;
+
rwsemtrace(sem, "Entering rwsem_downgrade_wake");

- spin_lock(&sem->wait_lock);
+ spin_lock_irqsave(&sem->wait_lock, flags);

/* do nothing if list empty */
if (!list_empty(&sem->wait_list))
sem = __rwsem_do_wake(sem, 1);

- spin_unlock(&sem->wait_lock);
+ spin_unlock_irqrestore(&sem->wait_lock, flags);

rwsemtrace(sem, "Leaving rwsem_downgrade_wake");
return sem;
diff -uNrp linux-2.6.11-mm2/lib/rwsem-spinlock.c linux-2.6.11-mm2-rwsem/lib/rwsem-spinlock.c
--- linux-2.6.11-mm2/lib/rwsem-spinlock.c 2004-09-16 12:06:23.000000000 +0100
+++ linux-2.6.11-mm2-rwsem/lib/rwsem-spinlock.c 2005-03-09 10:43:47.000000000 +0000
@@ -140,12 +140,12 @@ void fastcall __sched __down_read(struct

rwsemtrace(sem, "Entering __down_read");

- spin_lock(&sem->wait_lock);
+ spin_lock_irq(&sem->wait_lock);

if (sem->activity >= 0 && list_empty(&sem->wait_list)) {
/* granted */
sem->activity++;
- spin_unlock(&sem->wait_lock);
+ spin_unlock_irq(&sem->wait_lock);
goto out;
}

@@ -160,7 +160,7 @@ void fastcall __sched __down_read(struct
list_add_tail(&waiter.list, &sem->wait_list);

/* we don't need to touch the semaphore struct anymore */
- spin_unlock(&sem->wait_lock);
+ spin_unlock_irq(&sem->wait_lock);

/* wait to be given the lock */
for (;;) {
@@ -181,10 +181,12 @@ void fastcall __sched __down_read(struct
*/
int fastcall __down_read_trylock(struct rw_semaphore *sem)
{
+ unsigned long flags;
int ret = 0;
+
rwsemtrace(sem, "Entering __down_read_trylock");

- spin_lock(&sem->wait_lock);
+ spin_lock_irqsave(&sem->wait_lock, flags);

if (sem->activity >= 0 && list_empty(&sem->wait_list)) {
/* granted */
@@ -192,7 +194,7 @@ int fastcall __down_read_trylock(struct
ret = 1;
}

- spin_unlock(&sem->wait_lock);
+ spin_unlock_irqrestore(&sem->wait_lock, flags);

rwsemtrace(sem, "Leaving __down_read_trylock");
return ret;
@@ -209,12 +211,12 @@ void fastcall __sched __down_write(struc

rwsemtrace(sem, "Entering __down_write");

- spin_lock(&sem->wait_lock);
+ spin_lock_irq(&sem->wait_lock);

if (sem->activity == 0 && list_empty(&sem->wait_list)) {
/* granted */
sem->activity = -1;
- spin_unlock(&sem->wait_lock);
+ spin_unlock_irq(&sem->wait_lock);
goto out;
}

@@ -229,7 +231,7 @@ void fastcall __sched __down_write(struc
list_add_tail(&waiter.list, &sem->wait_list);

/* we don't need to touch the semaphore struct anymore */
- spin_unlock(&sem->wait_lock);
+ spin_unlock_irq(&sem->wait_lock);

/* wait to be given the lock */
for (;;) {
@@ -250,10 +252,12 @@ void fastcall __sched __down_write(struc
*/
int fastcall __down_write_trylock(struct rw_semaphore *sem)
{
+ unsigned long flags;
int ret = 0;
+
rwsemtrace(sem, "Entering __down_write_trylock");

- spin_lock(&sem->wait_lock);
+ spin_lock_irqsave(&sem->wait_lock, flags);

if (sem->activity == 0 && list_empty(&sem->wait_list)) {
/* granted */
@@ -261,7 +265,7 @@ int fastcall __down_write_trylock(struct
ret = 1;
}

- spin_unlock(&sem->wait_lock);
+ spin_unlock_irqrestore(&sem->wait_lock, flags);

rwsemtrace(sem, "Leaving __down_write_trylock");
return ret;
@@ -272,14 +276,16 @@ int fastcall __down_write_trylock(struct
*/
void fastcall __up_read(struct rw_semaphore *sem)
{
+ unsigned long flags;
+
rwsemtrace(sem, "Entering __up_read");

- spin_lock(&sem->wait_lock);
+ spin_lock_irqsave(&sem->wait_lock, flags);

if (--sem->activity == 0 && !list_empty(&sem->wait_list))
sem = __rwsem_wake_one_writer(sem);

- spin_unlock(&sem->wait_lock);
+ spin_unlock_irqrestore(&sem->wait_lock, flags);

rwsemtrace(sem, "Leaving __up_read");
}
@@ -289,15 +295,17 @@ void fastcall __up_read(struct rw_semaph
*/
void fastcall __up_write(struct rw_semaphore *sem)
{
+ unsigned long flags;
+
rwsemtrace(sem, "Entering __up_write");

- spin_lock(&sem->wait_lock);
+ spin_lock_irqsave(&sem->wait_lock, flags);

sem->activity = 0;
if (!list_empty(&sem->wait_list))
sem = __rwsem_do_wake(sem, 1);

- spin_unlock(&sem->wait_lock);
+ spin_unlock_irqrestore(&sem->wait_lock, flags);

rwsemtrace(sem, "Leaving __up_write");
}
@@ -308,15 +316,17 @@ void fastcall __up_write(struct rw_semap
*/
void fastcall __downgrade_write(struct rw_semaphore *sem)
{
+ unsigned long flags;
+
rwsemtrace(sem, "Entering __downgrade_write");

- spin_lock(&sem->wait_lock);
+ spin_lock_irqsave(&sem->wait_lock, flags);

sem->activity = 1;
if (!list_empty(&sem->wait_list))
sem = __rwsem_do_wake(sem, 0);

- spin_unlock(&sem->wait_lock);
+ spin_unlock_irqrestore(&sem->wait_lock, flags);

rwsemtrace(sem, "Leaving __downgrade_write");
}

2005-03-09 14:39:49

by Suparna Bhattacharya

[permalink] [raw]
Subject: Re: aio stress panic on 2.6.11-mm1

On Wed, Mar 09, 2005 at 12:21:01PM +0100, Andi Kleen wrote:
> Arjan van de Ven <[email protected]> writes:
>
> > On Wed, 2005-03-09 at 16:34 +0530, Suparna Bhattacharya wrote:
> >> Any sense of how costly it is to use spin_lock_irq's vs spin_lock
> >> (across different architectures) ? Isn't rwsem used very widely ?
>
> On P4s cli/sti is quite costly, let's says 100+ cycles. That is mostly
> because it synchronizes the CPU partly. The Intel tables say 26/36 cycles
> latency, but in practice it seems to take longer because of the
> synchronization.
>
> I would assume this is the worst case, everywhere else it should
> be cheaper (except perhaps in some virtualized environments)
> On P-M and AMD K7/K8 it is only a few cycles difference.
> >
> > oh also rwsems aren't used all that much simply because they are quite
> > more expensive than regular semaphores, so that you need a HUGE bias in
> > reader/writer ratio to make it even worth it...
>
> I agree. I think in fact once Christopher L's lockless page fault fast path
> goes in it would be a good idea to reevaluate if mmap_sem should
> be really a rwsem and possibly change it back to a normal semaphore
> that perhaps gets dropped only on a page cache miss.


OK - makes sense. Thanks !

Regards
Suparna

--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India

2005-03-09 17:40:23

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [PATCH] rwsem: Make rwsems use interrupt disabling spinlocks

Your patch seems to have helped. I don't see the Oops anymore - my
tests are still running (past 1 hour - it used to panic in 10 min).

Thanks,
Badari

On Wed, 2005-03-09 at 04:12, David Howells wrote:
> The attached patch makes read/write semaphores use interrupt disabling
> spinlocks, thus rendering the up functions and trylock functions available for
> use in interrupt context.
>
> I've assumed that the normal down functions must be called with interrupts
> enabled (since they might schedule), and used the irq-disabling spinlock
> variants that don't save the flags.
>
> Signed-Off-By: David Howells <[email protected]>
> ---
> warthog>diffstat -p1 rwsem-irqspin-2611mm2.diff


2005-03-09 19:21:57

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [PATCH] rwsem: Make rwsems use interrupt disabling spinlocks

Hi,

I am not sure if this is related to your patch. But I ran into
BUG() with sysrq-t with your patch.

Thanks,
Badari

BUG: soft lockup detected on CPU#1!

Modules linked in: joydev sg st floppy usbserial parport_pc lp parport
ipv6 ohci_hcd i2c_amd756 i2c_core evdev usbcore raid0 dm_mod nls_utf8
Pid: 15433, comm: bash Not tainted 2.6.11-mm1n
RIP: 0010:[<ffffffff8013d094>] <ffffffff8013d094>{__do_softirq+84}
RSP: 0018:ffff8101dff83f68 EFLAGS: 00000206
RAX: ffffffff80651880 RBX: 0000000000000002 RCX: 0000000000000004
RDX: 0000000000000002 RSI: 0000000000000103 RDI: ffff8101d7c77680
RBP: ffff810177ffbe48 R08: 0000000000000002 R09: 0000000000000100
R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000001
R13: 00002aaaaaafb000 R14: 000000000000000a R15: 0000000000000001
FS: 00002aaaab2890a0(0000) GS:ffffffff80651880(0000)
knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00002aaaaaafb000 CR3: 00000001bb2a0000 CR4: 00000000000006e0

Call Trace:<IRQ> <ffffffff8013d165>{do_softirq+53}
<ffffffff8010ef59>{apic_timer_interrupt+133}
<EOI> <ffffffff80403055>{_spin_unlock_irqrestore+5}
<ffffffff801bd357>{write_sysrq_trigger+55}
<ffffffff80183579>{vfs_write+233}
<ffffffff80183713>{sys_write+83}
<ffffffff8010e5ce>{system_call+126}


On Wed, 2005-03-09 at 04:12, David Howells wrote:
> The attached patch makes read/write semaphores use interrupt disabling
> spinlocks, thus rendering the up functions and trylock functions available for
> use in interrupt context.
>
> I've assumed that the normal down functions must be called with interrupts
> enabled (since they might schedule), and used the irq-disabling spinlock
> variants that don't save the flags.
>
> Signed-Off-By: David Howells <[email protected]>


2005-03-09 19:31:09

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [PATCH] rwsem: Make rwsems use interrupt disabling spinlocks

Well, aio-stress seems to run better with your patch (no Oops) but
I think we still have a problem in AIO. It looks like aio-stress
is stuck (unable to kill it).

Here is the sysrq-t output:

aio-stress D ffff8101be224970 0 15430 1 15429
(NOTLB)
ffff8101be21bd58 0000000000000082 ffff8101be21be18 0000000000010000
ffff8101be2248b0 ffff810100000074 000000000000007a
ffff810198868e90
ffff8101d6daaf50 ffff8101d6dab160
Call Trace:<ffffffff804013c8>{__down+152}
<ffffffff80132750>{default_wake_function+0}
<ffffffff80402d39>{__down_failed+53}
<ffffffff80161f13>{.text.lock.filemap+65}
<ffffffff801a23c0>{aio_pwrite+0}
<ffffffff801a23e1>{aio_pwrite+33}
<ffffffff801a35d0>{__aio_run_iocbs+384}
<ffffffff801a401e>{io_submit_one+494}
<ffffffff801a4149>{sys_io_submit+217}
<ffffffff8010e5ce>{system_call+126}


Top shows:

top - 12:22:33 up 2:57, 2 users, load average: 5.08, 5.08, 5.01
Tasks: 79 total, 1 running, 77 sleeping, 0 stopped, 1 zombie
Cpu(s): 0.0% us, 25.0% sy, 0.0% ni, 75.0% id, 0.0% wa, 0.0% hi,
0.0% si
Mem: 7148100k total, 176708k used, 6971392k free, 18600k buffers
Swap: 1048784k total, 0k used, 1048784k free, 44708k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
15425 root 16 0 0 0 0 Z 99.8 0.0 172:17.98 aio-stress
<defunct>
15803 root 16 0 4048 1116 820 R 0.3 0.0 0:00.51 top



Thanks,
Badari


On Wed, 2005-03-09 at 09:35, Badari Pulavarty wrote:
> Your patch seems to have helped. I don't see the Oops anymore - my
> tests are still running (past 1 hour - it used to panic in 10 min).
>
> Thanks,
> Badari
>
> On Wed, 2005-03-09 at 04:12, David Howells wrote:
> > The attached patch makes read/write semaphores use interrupt disabling
> > spinlocks, thus rendering the up functions and trylock functions available for
> > use in interrupt context.
> >
> > I've assumed that the normal down functions must be called with interrupts
> > enabled (since they might schedule), and used the irq-disabling spinlock
> > variants that don't save the flags.
> >
> > Signed-Off-By: David Howells <[email protected]>
> > ---
> > warthog>diffstat -p1 rwsem-irqspin-2611mm2.diff
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-aio' in
> the body to [email protected]. For more info on Linux AIO,
> see: http://www.kvack.org/aio/
> Don't email: <a href=mailto:"[email protected]">[email protected]</a>
>

2005-03-09 19:58:50

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] rwsem: Make rwsems use interrupt disabling spinlocks

Badari Pulavarty <[email protected]> wrote:
>
> I am not sure if this is related to your patch. But I ran into
> BUG() with sysrq-t with your patch.
>
> Thanks,
> Badari
>
> BUG: soft lockup detected on CPU#1!
>
> Modules linked in: joydev sg st floppy usbserial parport_pc lp parport
> ipv6 ohci_hcd i2c_amd756 i2c_core evdev usbcore raid0 dm_mod nls_utf8
> Pid: 15433, comm: bash Not tainted 2.6.11-mm1n
> RIP: 0010:[<ffffffff8013d094>] <ffffffff8013d094>{__do_softirq+84}
> RSP: 0018:ffff8101dff83f68 EFLAGS: 00000206
> RAX: ffffffff80651880 RBX: 0000000000000002 RCX: 0000000000000004
> RDX: 0000000000000002 RSI: 0000000000000103 RDI: ffff8101d7c77680
> RBP: ffff810177ffbe48 R08: 0000000000000002 R09: 0000000000000100
> R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000001
> R13: 00002aaaaaafb000 R14: 000000000000000a R15: 0000000000000001
> FS: 00002aaaab2890a0(0000) GS:ffffffff80651880(0000)
> knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 00002aaaaaafb000 CR3: 00000001bb2a0000 CR4: 00000000000006e0
>
> Call Trace:<IRQ> <ffffffff8013d165>{do_softirq+53}
> <ffffffff8010ef59>{apic_timer_interrupt+133}
> <EOI> <ffffffff80403055>{_spin_unlock_irqrestore+5}
> <ffffffff801bd357>{write_sysrq_trigger+55}
> <ffffffff80183579>{vfs_write+233}
> <ffffffff80183713>{sys_write+83}
> <ffffffff8010e5ce>{system_call+126}

That's probably just a false positive in Ingo's soft-lockup detector. Long
streams of irq-context serial console output will do that.

Ingo, we already have a touch_nmi_watchdog() in the sysrq code. It might be
worth adding a touch_softlockup_watchdog() wherever we have a
touch_nmi_watchdog().

2005-03-09 21:23:53

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [PATCH] rwsem: Make rwsems use interrupt disabling spinlocks


> Ingo, we already have a touch_nmi_watchdog() in the sysrq code. It might be
> worth adding a touch_softlockup_watchdog() wherever we have a
> touch_nmi_watchdog().

....or add touch_softlockup_watchdog to touch_nmi_watchdog() instead
and rename it tickle_watchdog() overtime.

2005-03-09 22:54:43

by David Howells

[permalink] [raw]
Subject: Re: aio stress panic on 2.6.11-mm1

Arjan van de Ven <[email protected]> wrote:

> On Wed, 2005-03-09 at 16:34 +0530, Suparna Bhattacharya wrote:
> > Any sense of how costly it is to use spin_lock_irq's vs spin_lock
> > (across different architectures) ? Isn't rwsem used very widely ?
>
> oh also rwsems aren't used all that much simply because they are quite
> more expensive than regular semaphores, so that you need a HUGE bias in
> reader/writer ratio to make it even worth it...

I can put some numbers to that. I've attached the module that I use for
destruct testing rwsems. I've extended it to splat ordinary semaphores too.

You run it by insmod'ing it. It will hammer the box for a few seconds, print a
summary to the console log and then return -ENOANO to cause insmod to abort
module addition.

You can give it a number of parameters:

mx=N Number of mutex splatting threads to run
rd=N Number of rwsem read-splatting threads to run
wr=N Number of rwsem write-splatting threads to run
dg=N Number of rwsem downgrader threads ro run
elapse=N Number of seconds to run for (default 5)
do_sched=1 Schedule occasionally
load=N Number of microseconds of load
verbose=0 Print only a summary (as in the table below)
verbose=1 Print more information

So, some statistics for a dual 200MHz Pentium Pro box, running the test for
the default time:

MODULE PARAM RESULTS
===================== =======================================
mx rd wr dg S ld mutexes reads writes downgrade
=== === === === = === ========= ========= ========= =========

With no load and without scheduling:

1 0 0 0 - 0 7331475 0 0 0
1 0 0 0 - 0 7465404 0 0 0
1 0 0 0 - 0 7319429 0 0 0
0 1 0 0 - 0 0 7743129 0 0
0 1 0 0 - 0 0 7698473 0 0
0 1 0 0 - 0 0 7614090 0 0
0 0 1 0 - 0 0 0 7051591 0
0 0 1 0 - 0 0 0 7027214 0
0 0 1 0 - 0 0 0 7054375 0
0 1 1 0 - 0 0 119838 106730 0
0 1 1 0 - 0 0 637862 96867 0
0 1 1 0 - 0 0 520168 89630 0

10 0 0 0 - 0 1068401 0 0 0
10 0 0 0 - 0 1035501 0 0 0
10 0 0 0 - 0 1170587 0 0 0
0 10 0 0 - 0 0 2865253 0 0
0 10 0 0 - 0 0 2983333 0 0
0 10 0 0 - 0 0 2969689 0 0
0 0 10 0 - 0 0 0 503357 0
0 0 10 0 - 0 0 0 657964 0
0 0 10 0 - 0 0 0 758048 0
0 10 10 0 - 0 0 382710 117488 0
0 10 10 0 - 0 0 519159 121845 0
0 10 10 0 - 0 0 639660 103995 0
0 10 1 0 - 0 0 2876112 0 0
0 10 1 0 - 0 0 2954678 0 0
0 10 1 0 - 0 0 1438340 37437 0

With no load and with occasional scheduling:

0 1 1 0 s 0 0 130326 110929 0
0 1 1 0 s 0 0 135551 99816 0
0 1 1 0 s 0 0 136236 117179 0

10 0 0 0 s 0 283945 0 0 0
10 0 0 0 s 0 253822 0 0 0
10 0 0 0 s 0 275887 0 0 0
0 10 0 0 s 0 0 2398587 0 0
0 10 0 0 s 0 0 2329326 0 0
0 10 0 0 s 0 0 2326537 0 0
0 0 10 0 s 0 0 0 305368 0
0 0 10 0 s 0 0 0 277164 0
0 0 10 0 s 0 0 0 310533 0
0 10 10 0 s 0 0 155986 156444 0
0 10 10 0 s 0 0 141763 172333 0
0 10 10 0 s 0 0 157835 130404 0
0 10 1 0 s 0 0 2120867 4928 0
0 10 1 0 s 0 0 2076650 11902 0
0 10 1 0 s 0 0 2243057 16289 0

With a load of 2uS:

1 0 0 0 - 2 1607186 0 0 0
0 1 0 0 - 2 0 1627802 0 0
0 0 1 0 - 2 0 0 1595146 0
0 1 1 0 - 2 0 92531 87686 0

1 0 0 0 s 2 988549 0 0 0
0 1 0 0 s 2 0 987335 0 0
0 0 1 0 s 2 0 0 969585 0
0 1 1 0 s 2 0 101914 93392 0

10 0 0 0 - 2 188636 0 0 0
0 10 0 0 - 2 0 2139068 0 0
0 0 10 0 - 2 0 0 273576 0
0 10 10 0 - 2 0 178319 113583 0
0 10 1 0 - 2 0 2132616 0 0

10 0 0 0 s 2 175498 0 0 0
0 10 0 0 s 2 0 1303410 0 0
0 0 10 0 s 2 0 0 239851 0
0 10 10 0 s 2 0 116714 125319 0
0 10 1 0 s 2 0 1063463 7678 0

2 0 0 0 s 2 190007 0 0 0
0 2 0 0 s 2 0 1309071 0 0
0 0 2 0 s 2 0 0 187197 0

3 0 0 0 s 2 180995 0 0 0
0 3 0 0 s 2 0 1282154 0 0
0 0 3 0 s 2 0 0 219858 0

It's interesting to note that using only write-locks on an rwsem can actually
be somewhat _more_ efficient than using an ordinary semaphore configured as a
mutex; particularly as the number of competing threads increases.

And, of course, you have to take these results with a small pinch of salt; the
box I've been running them on has most the usual clutch of userspace services
you'd expect to find on an FC3 or newer installation. Ideally, these tests
should be run from a shell started by the kernel in place of /sbin/init.

David
---
/* rwsem-any.c: run some threads to do R/W semaphore tests
*
* Copyright (C) 2005 Red Hat, Inc. All Rights Reserved.
* Written by David Howells ([email protected])
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU General Public License
* as published by the Free Software Foundation; either version
* 2 of the License, or (at your option) any later version.
*
* run as: insmod rwsem-any.ko rd=1 wr=1 dg=0 mx=0
*/

#include <linux/config.h>
#include <linux/module.h>
#include <linux/poll.h>
#include <linux/moduleparam.h>
#include <linux/stat.h>
#include <linux/init.h>
#include <asm/atomic.h>
#include <linux/personality.h>
#include <linux/smp_lock.h>
#include <linux/delay.h>
#include <linux/timer.h>
#include <linux/completion.h>

#define SCHED
#define LOAD_TEST

static int nummx = 0, numrd = 1, numwr = 1, numdg = 1, elapse = 5;
static int verbose = 0;

MODULE_AUTHOR("David Howells");
MODULE_DESCRIPTION("R/W semaphore test demo");
MODULE_LICENSE("GPL");

module_param_named(v, verbose, int, 0);
MODULE_PARM_DESC(verbose, "Verbosity");

module_param_named(mx, nummx, int, 0);
MODULE_PARM_DESC(nummx, "Number of mutex threads");

module_param_named(rd, numrd, int, 0);
MODULE_PARM_DESC(numrd, "Number of reader threads");

module_param_named(wr, numwr, int, 0);
MODULE_PARM_DESC(numwr, "Number of writer threads");

module_param_named(dg, numdg, int, 0);
MODULE_PARM_DESC(numdg, "Number of downgrader threads");

module_param(elapse, int, 0);
MODULE_PARM_DESC(elapse, "Number of seconds to run for");

#ifdef LOAD_TEST
static int load = 0;
module_param(load, int, 0);
MODULE_PARM_DESC(load, "Length of load in uS");
#endif

#ifdef SCHED
static int do_sched = 0;
module_param(do_sched, int, 0);
MODULE_PARM_DESC(do_sched, "True if each thread should schedule regularly");
#endif

/* the semaphores under test */
static struct semaphore mutex_sem;
static struct rw_semaphore rwsem_sem;


#ifdef DEBUG_RWSEM
extern struct rw_semaphore *rwsem_to_monitor;
#endif

static atomic_t mutexes = ATOMIC_INIT(0);
static atomic_t readers = ATOMIC_INIT(0);
static atomic_t writers = ATOMIC_INIT(0);
static atomic_t do_stuff = ATOMIC_INIT(0);
static atomic_t mutexes_taken = ATOMIC_INIT(0);
static atomic_t reads_taken = ATOMIC_INIT(0);
static atomic_t writes_taken = ATOMIC_INIT(0);
static atomic_t downgrades_taken = ATOMIC_INIT(0);

static struct completion mx_comp[20], rd_comp[20], wr_comp[20], dg_comp[20];

static struct timer_list timer;

#define CHECK(expr) \
do { \
if (!(expr)) \
printk("check " #expr " failed in %s\n", __func__); \
} while (0)

#define CHECKA(var, val) \
do { \
int x = atomic_read(&(var)); \
if (x != (val)) \
printk("check [%s != %d, == %d] failed in %s\n", #var, (val), x, __func__); \
} while (0)

static inline void d(void)
{
down(&mutex_sem);
atomic_inc(&mutexes);
atomic_inc(&mutexes_taken);
CHECKA(mutexes, 1);
}

static inline void u(void)
{
CHECKA(mutexes, 1);
atomic_dec(&mutexes);
up(&mutex_sem);
}

static inline void dr(void)
{
down_read(&rwsem_sem);
atomic_inc(&readers);
atomic_inc(&reads_taken);
CHECKA(writers, 0);
}

static inline void ur(void)
{
CHECKA(writers, 0);
atomic_dec(&readers);
up_read(&rwsem_sem);
}

static inline void dw(void)
{
down_write(&rwsem_sem);
atomic_inc(&writers);
atomic_inc(&writes_taken);
CHECKA(writers, 1);
CHECKA(readers, 0);
}

static inline void uw(void)
{
CHECKA(writers, 1);
CHECKA(readers, 0);
atomic_dec(&writers);
up_write(&rwsem_sem);
}

static inline void dgw(void)
{
CHECKA(writers, 1);
CHECKA(readers, 0);
atomic_dec(&writers);
atomic_inc(&readers);
downgrade_write(&rwsem_sem);
atomic_inc(&downgrades_taken);
}

static inline void sched(void)
{
#ifdef SCHED
if (do_sched)
schedule();
#endif
}

int mutexer(void *arg)
{
unsigned int N = (unsigned long) arg;

daemonize("Mutex%u", N);

while (atomic_read(&do_stuff)) {
d();
#ifdef LOAD_TEST
if (load)
udelay(load);
#endif
u();
sched();
}

if (verbose)
printk("%s: done\n", current->comm);
complete_and_exit(&mx_comp[N], 0);
}

int reader(void *arg)
{
unsigned int N = (unsigned long) arg;

daemonize("Read%u", N);

while (atomic_read(&do_stuff)) {
dr();
#ifdef LOAD_TEST
if (load)
udelay(load);
#endif
ur();
sched();
}

if (verbose)
printk("%s: done\n", current->comm);
complete_and_exit(&rd_comp[N], 0);
}

int writer(void *arg)
{
unsigned int N = (unsigned long) arg;

daemonize("Write%u", N);

while (atomic_read(&do_stuff)) {
dw();
#ifdef LOAD_TEST
if (load)
udelay(load);
#endif
uw();
sched();
}

if (verbose)
printk("%s: done\n", current->comm);
complete_and_exit(&wr_comp[N], 0);
}

int downgrader(void *arg)
{
unsigned int N = (unsigned long) arg;

daemonize("Down%u", N);

while (atomic_read(&do_stuff)) {
dw();
#ifdef LOAD_TEST
if (load)
udelay(load);
#endif
dgw();
#ifdef LOAD_TEST
if (load)
udelay(load);
#endif
ur();
sched();
}

if (verbose)
printk("%s: done\n", current->comm);
complete_and_exit(&dg_comp[N], 0);
}

static void stop_test(unsigned long dummy)
{
atomic_set(&do_stuff, 0);
}

/*****************************************************************************/
/*
*
*/
static int __init rwsem_init_module(void)
{
unsigned long loop;

if (verbose)
printk("\nrwsem_any starting tests...\n");

init_MUTEX(&mutex_sem);
init_rwsem(&rwsem_sem);
atomic_set(&do_stuff, 1);

#ifdef DEBUG_RWSEM
rwsem_to_monitor = &rwsem_sem;
#endif

/* kick off all the children */
for (loop = 0; loop < 20; loop++) {
if (loop < nummx) {
init_completion(&mx_comp[loop]);
kernel_thread(mutexer, (void *) loop, 0);
}

if (loop < numrd) {
init_completion(&rd_comp[loop]);
kernel_thread(reader, (void *) loop, 0);
}

if (loop < numwr) {
init_completion(&wr_comp[loop]);
kernel_thread(writer, (void *) loop, 0);
}

if (loop < numdg) {
init_completion(&dg_comp[loop]);
kernel_thread(downgrader, (void *) loop, 0);
}
}

/* set a stop timer */
init_timer(&timer);
timer.function = stop_test;
timer.expires = jiffies + elapse * HZ;
add_timer(&timer);

/* now wait until it's all done */
for (loop = 0; loop < nummx; loop++)
wait_for_completion(&mx_comp[loop]);

for (loop = 0; loop < numrd; loop++)
wait_for_completion(&rd_comp[loop]);

for (loop = 0; loop < numwr; loop++)
wait_for_completion(&wr_comp[loop]);

for (loop = 0; loop < numdg; loop++)
wait_for_completion(&dg_comp[loop]);

atomic_set(&do_stuff, 0);
del_timer(&timer);

/* print the results */
if (verbose) {
#ifdef CONFIG_RWSEM_XCHGADD_ALGORITHM
printk("rwsem counter = %ld\n", rwsem_sem.count);
#else
printk("rwsem counter = %d\n", rwsem_sem.activity);
#endif

printk("mutexes taken: %d\n", atomic_read(&mutexes_taken));
printk("reads taken: %d\n", atomic_read(&reads_taken));
printk("writes taken: %d\n", atomic_read(&writes_taken));
printk("downgrades taken: %d\n", atomic_read(&downgrades_taken));
}
else {
printk("%3d %3d %3d %3d %c %3d %9d %9d %9d %9d\n",
nummx, numrd, numwr, numdg,
do_sched ? 's' : '-',
#ifdef LOAD_TEST
load,
#else
0,
#endif
atomic_read(&mutexes_taken),
atomic_read(&reads_taken),
atomic_read(&writes_taken),
atomic_read(&downgrades_taken));
}

#ifdef DEBUG_RWSEM
rwsem_to_monitor = NULL;
#endif

/* tell insmod to discard the module */
return -ENOANO;
} /* end rwsem_init_module() */

module_init(rwsem_init_module);

2005-03-11 09:41:18

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] rwsem: Make rwsems use interrupt disabling spinlocks


* Arjan van de Ven <[email protected]> wrote:

> > Ingo, we already have a touch_nmi_watchdog() in the sysrq code. It might be
> > worth adding a touch_softlockup_watchdog() wherever we have a
> > touch_nmi_watchdog().
>
> ....or add touch_softlockup_watchdog to touch_nmi_watchdog() instead
> and rename it tickle_watchdog() overtime.

you mean like:

+extern void touch_softlockup_watchdog(void);

in:

http://kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.11/2.6.11-mm2/broken-out/detect-soft-lockups.patch

?

Ingo

2005-03-11 10:09:00

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] rwsem: Make rwsems use interrupt disabling spinlocks

Ingo Molnar <[email protected]> wrote:
>
>
> * Arjan van de Ven <[email protected]> wrote:
>
> > > Ingo, we already have a touch_nmi_watchdog() in the sysrq code. It might be
> > > worth adding a touch_softlockup_watchdog() wherever we have a
> > > touch_nmi_watchdog().
> >
> > ....or add touch_softlockup_watchdog to touch_nmi_watchdog() instead
> > and rename it tickle_watchdog() overtime.
>
> you mean like:
>
> +extern void touch_softlockup_watchdog(void);
>
> in:
>
> http://kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.11/2.6.11-mm2/broken-out/detect-soft-lockups.patch
>
> ?
>

Nope.

This particular lockup happened because a huge stream of stuff was sent to
the serial console.

We already have a touch_nmi_watchdog() in that code.

We should arrange for touch_softlockup_watchdog() to be called whenever
touch_nmi_watchdog() is called.

2005-03-11 10:26:45

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] rwsem: Make rwsems use interrupt disabling spinlocks


* Andrew Morton <[email protected]> wrote:

> We should arrange for touch_softlockup_watchdog() to be called
> whenever touch_nmi_watchdog() is called.

the patch below adds a touch_softlockup_watchdog() call to every
touch_nmi_watchdog() call.

[A future consolidation patch should introduce a touch_watchdogs() call
that will do both a touch_nmi_watchdog() [if available on the platform]
and a touch_softlockup_watchdog() call.]

Signed-off-by: Ingo Molnar <[email protected]>

--- linux/arch/x86_64/kernel/nmi.c.orig
+++ linux/arch/x86_64/kernel/nmi.c
@@ -378,6 +378,11 @@ void touch_nmi_watchdog (void)
*/
for (i = 0; i < NR_CPUS; i++)
alert_counter[i] = 0;
+
+ /*
+ * Tickle the softlockup detector too:
+ */
+ touch_softlockup_watchdog();
}

void nmi_watchdog_tick (struct pt_regs * regs, unsigned reason)
--- linux/arch/i386/kernel/nmi.c.orig
+++ linux/arch/i386/kernel/nmi.c
@@ -469,6 +469,11 @@ void touch_nmi_watchdog (void)
*/
for (i = 0; i < NR_CPUS; i++)
alert_counter[i] = 0;
+
+ /*
+ * Tickle the softlockup detector too:
+ */
+ touch_softlockup_watchdog();
}

extern void die_nmi(struct pt_regs *, const char *msg);