2006-03-07 12:34:14

by Magnus Damm

[permalink] [raw]
Subject: SMP and 101% cpu max?

Hi guys,

I'm doing some memory related hacking, and part of that is testing the
current behaviour of the Linux VM. This testing involves running some
simple tests (building the linux kernel is one of the tests) while
varying the amount of RAM available to the kernel.

I've tested memory configurations of 64MB, 128MB and 256MB on a Dual
PIII machine. The tested kernel is 2.6.16-rc5, and user space is based
on debian-3.1. I run 5 tests per memory configuration, and the machine
is rebooted between each test.

Problem:

With 128MB and 256MB configurations, a majority of the tests never
make it over 101% CPU usage when I run "make -j 2 bzImage", building a
allnoconfig kernel. With 64MB memory, everything seems to be working
as expected. Also, running "make bzImage" works as expected too.

Results for "make bzImage":

# time: real user sys, cpu: percentage, ram: KB, swap: KB

time: 229.261 211.6 18.18, cpu: 100, ram: 256716, swap: 125944
time: 229.621 211.45 17.51, cpu: 100, ram: 256716, swap: 125944
time: 229.698 212.11 17.68, cpu: 100, ram: 256716, swap: 125944
time: 230.711 211.89 17.86, cpu: 100, ram: 256716, swap: 125944
time: 232.219 210.55 18.5, cpu: 99, ram: 256716, swap: 125944
time: 233.203 213.34 17.79, cpu: 99, ram: 126876, swap: 125944
time: 233.371 213.82 17.15, cpu: 99, ram: 126876, swap: 125944
time: 234.18 213.43 17.92, cpu: 99, ram: 126876, swap: 125944
time: 234.315 213.11 17.92, cpu: 99, ram: 126876, swap: 125944
time: 235.334 215.06 17.22, cpu: 99, ram: 126876, swap: 125944
time: 241.159 222.82 15.38, cpu: 99, ram: 61956, swap: 125944
time: 241.299 222.39 15.34, cpu: 99, ram: 61956, swap: 125944
time: 241.475 223.16 15.38, cpu: 99, ram: 61956, swap: 125944
time: 241.955 223.24 15.66, cpu: 99, ram: 61956, swap: 125944
time: 242.099 222.92 15.98, cpu: 99, ram: 61956, swap: 125944

Results for "make -j 2 bzImage":

# time: real user sys, cpu: percentage, ram: KB, swap: KB

time: 124.336 220.03 18.98, cpu: 192, ram: 126876, swap: 125944
time: 124.547 217.57 19.15, cpu: 190, ram: 256716, swap: 125944
time: 125.162 218.35 19.51, cpu: 190, ram: 256716, swap: 125944
time: 132.488 228.71 17.1, cpu: 186, ram: 61956, swap: 125944
time: 132.502 230.93 16.26, cpu: 187, ram: 61956, swap: 125944
time: 132.634 230.55 16.42, cpu: 186, ram: 61956, swap: 125944
time: 132.643 229.89 17.63, cpu: 187, ram: 61956, swap: 125944
time: 133.2 230.09 16.28, cpu: 185, ram: 61956, swap: 125944
time: 227.371 211.45 17.36, cpu: 101, ram: 256716, swap: 125944
time: 227.898 211.93 17.3, cpu: 101, ram: 256716, swap: 125944
time: 228.071 212.21 17.15, cpu: 101, ram: 256716, swap: 125944
time: 228.788 212.46 17.88, cpu: 101, ram: 126876, swap: 125944
time: 229.223 214.14 16.46, cpu: 101, ram: 126876, swap: 125944
time: 229.255 213.56 17.19, cpu: 101, ram: 126876, swap: 125944
time: 230.296 214.25 17.44, cpu: 101, ram: 126876, swap: 125944

Any ideas what is causing this? Is it a memory subsystem issue, or cpu
scheduling?

Thanks,

/ magnus


2006-03-07 13:07:16

by Jesper Juhl

[permalink] [raw]
Subject: Re: SMP and 101% cpu max?

On 3/7/06, Magnus Damm <[email protected]> wrote:
> Hi guys,
>
> I'm doing some memory related hacking, and part of that is testing the
> current behaviour of the Linux VM. This testing involves running some
> simple tests (building the linux kernel is one of the tests) while
> varying the amount of RAM available to the kernel.
>
> I've tested memory configurations of 64MB, 128MB and 256MB on a Dual
> PIII machine. The tested kernel is 2.6.16-rc5, and user space is based
> on debian-3.1. I run 5 tests per memory configuration, and the machine
> is rebooted between each test.
>
> Problem:
>
> With 128MB and 256MB configurations, a majority of the tests never
> make it over 101% CPU usage when I run "make -j 2 bzImage", building a
> allnoconfig kernel. With 64MB memory, everything seems to be working
> as expected. Also, running "make bzImage" works as expected too.
>
Hmm, I wonder if it's related to the problem I reported here :
http://lkml.org/lkml/2006/2/28/219
Where I need to run make -j 5 or higher to load both cores of my Athlon X2.

--
Jesper Juhl <[email protected]>
Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please http://www.expita.com/nomime.html

2006-03-07 13:21:03

by Bryan Holty

[permalink] [raw]
Subject: Re: [PATCH] IRQ: prevent enabling of previously disabled interrupt

On Tuesday 07 March 2006 05:55, Andrew Morton wrote:
> "[email protected]" <[email protected]> wrote:
>>
>> Hi,
>> This fix prevents re-disabling and enabling of a previously disabled
>> interrupt in 2.6.16-rc5. On an SMP system with irq balancing enabled;
>> If an interrupt is disabled from within its own interrupt context with
>> disable_irq_nosync and is also earmarked for processor migration,
>> the interrupt is blindly moved to the other processor and enabled
>> without regard for its current "enabled" state. If there is an
>> interrupt
>> pending, it will unexpectedly invoke the irq handler on the new irq
>> owning processor (even though the irq was previously disabled)
>>
>> The more intuitive fix would be to invoke disable_irq_nosync and
>> enable_irq, but since we already have the desc->lock from __do_IRQ, we
>> cannot call them directly. Instead we can use the same logic to
>> disable and enable found in disable_irq_nosync and enable_irq, with
>> regards to the desc->depth.
>>
>> This now prevents a disabled interrupt from being re-disabled,
>> and more importantly prevents a disabled interrupt from being
>> incorrectly enabled on a different processor.
>>
>> Signed-off-by: Bryan Holty <[email protected]>
>>
>> --- 2.6.16-rc5/include/linux/irq.h
>> +++ b/include/linux/irq.h
>> @@ -155,9 +155,13 @@
>> * Being paranoid i guess!
>> */
>> if (unlikely(!cpus_empty(tmp))) {
>> - desc->handler->disable(irq);
>> + if (likely(!desc->depth++))
>> + desc->handler->disable(irq);
>> +
>> desc->handler->set_affinity(irq,tmp);
>> - desc->handler->enable(irq);
>> +
>> + if (likely(!--desc->depth))
>> + desc->handler->enable(irq);
>> }
>> cpus_clear(pending_irq_cpumask[irq]);
>> }
>
> But desc->lock isn't held here. We need that for the update to ->depth (at
> least).
>
> And we can't take it here because one of the two ->end callers in __do_IRQ
> already holds that lock. Possibly we should require that ->end callers
> hold the lock, but that would incur considerable cost for cpu-local
> interrupts.
>
> So we'd need to require that ->end gets called outside the lock for
> non-CPU-local interrupts. I'm not sure what the implications of that would
> be - the ->end handlers don't need to be threaded at present and perhaps we
> could put hardware into a bad state?
>
> Or we add a new ->local_end, just for the CPU-local IRQs.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
>

Another option is to check for the disabled flag explicitly. The check prior
to the disable could arguably be removed, but the check prior to the
enable is necessary. If the interrupt has been explicitly disabled, as with
the IRQ_DISABLED flag, then it will take an explicit effort to re-enable it.


--- 2.6.16-rc5/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -155,9 +155,13 @@
* Being paranoid i guess!
*/
if (unlikely(!cpus_empty(tmp))) {
- desc->handler->disable(irq);
+ if (likely(!(desc->status & IRQ_DISABLED)))
+ desc->handler->disable(irq);
+
desc->handler->set_affinity(irq,tmp);
- desc->handler->enable(irq);
+
+ if (likely(!(desc->status & IRQ_DISABLED)))
+ desc->handler->enable(irq);
}
cpus_clear(pending_irq_cpumask[irq]);
}

--
Bryan Holty

2006-03-07 23:56:36

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] IRQ: prevent enabling of previously disabled interrupt

Bryan Holty <[email protected]> wrote:
>
> On Tuesday 07 March 2006 05:55, Andrew Morton wrote:
> > "[email protected]" <[email protected]> wrote:
> >>
> >> Hi,
> >> This fix prevents re-disabling and enabling of a previously disabled
> >> interrupt in 2.6.16-rc5. On an SMP system with irq balancing enabled;
> >> If an interrupt is disabled from within its own interrupt context with
> >> disable_irq_nosync and is also earmarked for processor migration,
> >> the interrupt is blindly moved to the other processor and enabled
> >> without regard for its current "enabled" state. If there is an
> >> interrupt
> >> pending, it will unexpectedly invoke the irq handler on the new irq
> >> owning processor (even though the irq was previously disabled)
> >>
> >> The more intuitive fix would be to invoke disable_irq_nosync and
> >> enable_irq, but since we already have the desc->lock from __do_IRQ, we
> >> cannot call them directly. Instead we can use the same logic to
> >> disable and enable found in disable_irq_nosync and enable_irq, with
> >> regards to the desc->depth.
> >>
> >> This now prevents a disabled interrupt from being re-disabled,
> >> and more importantly prevents a disabled interrupt from being
> >> incorrectly enabled on a different processor.
> >>
> >> Signed-off-by: Bryan Holty <[email protected]>
> >>
> >> --- 2.6.16-rc5/include/linux/irq.h
> >> +++ b/include/linux/irq.h
> >> @@ -155,9 +155,13 @@
> >> * Being paranoid i guess!
> >> */
> >> if (unlikely(!cpus_empty(tmp))) {
> >> - desc->handler->disable(irq);
> >> + if (likely(!desc->depth++))
> >> + desc->handler->disable(irq);
> >> +
> >> desc->handler->set_affinity(irq,tmp);
> >> - desc->handler->enable(irq);
> >> +
> >> + if (likely(!--desc->depth))
> >> + desc->handler->enable(irq);
> >> }
> >> cpus_clear(pending_irq_cpumask[irq]);
> >> }
> >
> > But desc->lock isn't held here. We need that for the update to ->depth (at
> > least).
> >
> > And we can't take it here because one of the two ->end callers in __do_IRQ
> > already holds that lock. Possibly we should require that ->end callers
> > hold the lock, but that would incur considerable cost for cpu-local
> > interrupts.
> >
> > So we'd need to require that ->end gets called outside the lock for
> > non-CPU-local interrupts. I'm not sure what the implications of that would
> > be - the ->end handlers don't need to be threaded at present and perhaps we
> > could put hardware into a bad state?
> >
> > Or we add a new ->local_end, just for the CPU-local IRQs.
>
> ...
>
> Another option is to check for the disabled flag explicitly. The check prior
> to the disable could arguably be removed, but the check prior to the
> enable is necessary. If the interrupt has been explicitly disabled, as with
> the IRQ_DISABLED flag, then it will take an explicit effort to re-enable it.
>
>
> --- 2.6.16-rc5/include/linux/irq.h
> +++ b/include/linux/irq.h
> @@ -155,9 +155,13 @@
> * Being paranoid i guess!
> */
> if (unlikely(!cpus_empty(tmp))) {
> - desc->handler->disable(irq);
> + if (likely(!(desc->status & IRQ_DISABLED)))
> + desc->handler->disable(irq);
> +
> desc->handler->set_affinity(irq,tmp);
> - desc->handler->enable(irq);
> +
> + if (likely(!(desc->status & IRQ_DISABLED)))
> + desc->handler->enable(irq);
> }
> cpus_clear(pending_irq_cpumask[irq]);
> }

Yes, but we're still racy against (say) disable_irq() and enable_irq(). We
can end up getting the wrong value of desc->status, or we can get
desc->status and irq-enabledness out of sync, or we can end up running
->enable() or ->disable() on two CPUs at the same time.

And I think it's a bug we _already_ have: if one CPU runs disable_irq()
against a different CPU's cpu-local interrupt, both CPUs could end up
talking to the IRQ hardware concurrently. Probably there's nowhere where
that happens, but the APIs permit it.


I guess what we could do is to add, in move_native_irq():

if (CHECK_IRQ_PER_CPU(desc->status))
return;

because there's no reason to consider migrating a cpu-local interrupt.
Surely that effect is already happening in there by some means, but being
explicit about it won't hurt.

Once we've done that, we know that your patch is safe - the caller holds
the lock and we can stick an assert_spin_locked() in there.

In fact there's a comment in there asserting that the caller always holds
the lock.

Like this? (Note that I uninlined move_native_irq(). Nearly fell out of
my chair when I saw that thing).


--- 25/kernel/irq/migration.c~irq-prevent-enabling-of-previously-disabled-interrupt Tue Mar 7 15:54:25 2006
+++ 25-akpm/kernel/irq/migration.c Tue Mar 7 15:58:19 2006
@@ -18,9 +18,17 @@ void move_native_irq(int irq)
cpumask_t tmp;
irq_desc_t *desc = irq_descp(irq);

- if (likely (!desc->move_irq))
+ if (likely(!desc->move_irq))
return;

+ /*
+ * Paranoia: cpu-local interrupts shouldn't be calling in here anyway.
+ */
+ if (CHECK_IRQ_PER_CPU(desc->status)) {
+ WARN_ON(1);
+ return;
+ }
+
desc->move_irq = 0;

if (likely(cpus_empty(pending_irq_cpumask[irq])))
@@ -29,7 +37,8 @@ void move_native_irq(int irq)
if (!desc->handler->set_affinity)
return;

- /* note - we hold the desc->lock */
+ assert_spin_locked(&desc->lock);
+
cpus_and(tmp, pending_irq_cpumask[irq], cpu_online_map);

/*
@@ -42,9 +51,13 @@ void move_native_irq(int irq)
* Being paranoid i guess!
*/
if (unlikely(!cpus_empty(tmp))) {
- desc->handler->disable(irq);
+ if (likely(!(desc->status & IRQ_DISABLED)))
+ desc->handler->disable(irq);
+
desc->handler->set_affinity(irq,tmp);
- desc->handler->enable(irq);
+
+ if (likely(!(desc->status & IRQ_DISABLED)))
+ desc->handler->enable(irq);
}
cpus_clear(pending_irq_cpumask[irq]);
}
_

2006-03-08 01:53:06

by Magnus Damm

[permalink] [raw]
Subject: Re: SMP and 101% cpu max?

On 3/7/06, Jesper Juhl <[email protected]> wrote:
> On 3/7/06, Magnus Damm <[email protected]> wrote:
> > Hi guys,
> >
> > I'm doing some memory related hacking, and part of that is testing the
> > current behaviour of the Linux VM. This testing involves running some
> > simple tests (building the linux kernel is one of the tests) while
> > varying the amount of RAM available to the kernel.
> >
> > I've tested memory configurations of 64MB, 128MB and 256MB on a Dual
> > PIII machine. The tested kernel is 2.6.16-rc5, and user space is based
> > on debian-3.1. I run 5 tests per memory configuration, and the machine
> > is rebooted between each test.
> >
> > Problem:
> >
> > With 128MB and 256MB configurations, a majority of the tests never
> > make it over 101% CPU usage when I run "make -j 2 bzImage", building a
> > allnoconfig kernel. With 64MB memory, everything seems to be working
> > as expected. Also, running "make bzImage" works as expected too.
> >
> Hmm, I wonder if it's related to the problem I reported here :
> http://lkml.org/lkml/2006/2/28/219
> Where I need to run make -j 5 or higher to load both cores of my Athlon X2.

Yeah, the problem looks kind of related. But I doubt it is the CPU
scheduler only that plays tricks on us, because in the my 64MB case
things seem to work ok at all times. But maybe it's just random.

I've spent this morning testing 2.6.15, and that version seems to work
ok. And just to double check I rebuilt and retested 2.6.16-rc5. This
round gives me the similar results - now it fails to go above 101% CPU
load in one case.

Results for "make -j 2 bzImage", 2.6.15:

time: 123.527 220.11 20.27, cpu: 195, ram: 126896, swap: 125944
time: 123.823 219.48 20.57, cpu: 194, ram: 126896, swap: 125944
time: 123.867 221.1 19.35, cpu: 194, ram: 126896, swap: 125944
time: 124.6 220.49 20.25, cpu: 193, ram: 126896, swap: 125944
time: 124.96 222.14 20.18, cpu: 194, ram: 126896, swap: 125944

Results for "make -j 2 bzImage", 2.6.16-rc5:

time: 125.045 220.02 18.71, cpu: 191, ram: 126876, swap: 125944
time: 125.052 221.93 18.35, cpu: 192, ram: 126876, swap: 125944
time: 125.598 220.16 19.34, cpu: 191, ram: 126876, swap: 125944
time: 125.7 220.07 19.32, cpu: 190, ram: 126876, swap: 125944
time: 229.316 212.86 17.72, cpu: 101, ram: 126876, swap: 125944

/ magnus

2006-03-08 02:50:41

by L A Walsh

[permalink] [raw]
Subject: FWIW: Re: SMP and 101% cpu max?

FYI, running/compiling 2.6.15.5 on a 2x(1GHzxP-III), 1GB
times for "make" -jn bzImage (no modules):
(using export TIMEFORMAT="%2Rsec %2Uusr %2Ssys (%P%% cpu)" )

-j1: 815.80sec 745.64usr 78.74sys (100.00% cpu)
-j2: 445.17sec 778.68usr 86.22sys (100.00% cpu)
-j3: 444.89sec 781.66usr 87.84sys (100.00% cpu)
-j4: 443.08sec 781.81usr 87.97sys (100.00% cpu)
-j5: 445.98sec 782.53usr 87.51sys (100.00% cpu)
-----

I am not seeing the symptom you are describing. The load
increases proportionately to the 'job limit', but it doesn't
radically change the overall cpu required. As I have
only 2 cpu's, I can't expect much benefit beyond 2x, with
actual approaching closer to 1.8x.

-l


Jesper Juhl wrote:
> On 3/7/06, Magnus Damm <[email protected]> wrote:
>
>> With 128MB and 256MB configurations, a majority of the tests never
>> make it over 101% CPU usage when I run "make -j 2 bzImage", building a
>> allnoconfig kernel. With 64MB memory, everything seems to be working
>> as expected. Also, running "make bzImage" works as expected too.
>>
> Hmm, I wonder if it's related to the problem I reported here :
> http://lkml.org/lkml/2006/2/28/219
> Where I need to run make -j 5 or higher to load both cores of my Athlon X2.
>

2006-03-08 03:04:30

by Magnus Damm

[permalink] [raw]
Subject: Re: FWIW: Re: SMP and 101% cpu max?

On 3/8/06, Linda Walsh <[email protected]> wrote:
> FYI, running/compiling 2.6.15.5 on a 2x(1GHzxP-III), 1GB
> times for "make" -jn bzImage (no modules):
> (using export TIMEFORMAT="%2Rsec %2Uusr %2Ssys (%P%% cpu)" )
>
> -j1: 815.80sec 745.64usr 78.74sys (100.00% cpu)
> -j2: 445.17sec 778.68usr 86.22sys (100.00% cpu)
> -j3: 444.89sec 781.66usr 87.84sys (100.00% cpu)
> -j4: 443.08sec 781.81usr 87.97sys (100.00% cpu)
> -j5: 445.98sec 782.53usr 87.51sys (100.00% cpu)
> -----
>
> I am not seeing the symptom you are describing. The load
> increases proportionately to the 'job limit', but it doesn't
> radically change the overall cpu required. As I have
> only 2 cpu's, I can't expect much benefit beyond 2x, with
> actual approaching closer to 1.8x.

Yes, that's what I'm expecting to see. And I do see similar results
for 2.6.15. But it looks like 2.6.16-rc6 is misbehaving somehow.

Thanks,

/ magnus

> -l
>
>
> Jesper Juhl wrote:
> > On 3/7/06, Magnus Damm <[email protected]> wrote:
> >
> >> With 128MB and 256MB configurations, a majority of the tests never
> >> make it over 101% CPU usage when I run "make -j 2 bzImage", building a
> >> allnoconfig kernel. With 64MB memory, everything seems to be working
> >> as expected. Also, running "make bzImage" works as expected too.
> >>
> > Hmm, I wonder if it's related to the problem I reported here :
> > http://lkml.org/lkml/2006/2/28/219
> > Where I need to run make -j 5 or higher to load both cores of my Athlon X2.
> >
>