2004-03-13 00:56:13

by Nick Piggin

[permalink] [raw]
Subject: [BENCHMARKS] 2.6.4 vs 2.6.4-mm1

These are some benchmarks on a 16-way (4x4) NUMAQ. Basically
measures the scheduler patches with a couple of meaningless
but very scheduler intensive benchmarks.

hackbench:
The number in () is a projection for the time 1000 would take,
assuming a linear scaling. It is probably better shown on a
graph, but you can see a non linear element in 2.6.4 that is
basically absent in 2.6.4-mm1.

2.6.4 2.6.4-mm1
50 19.4 (388) 15.5 (310)
100 39.0 (390) 34.5 (345)
150 59.0 (393) 48.3 (322)
200 82.9 (414) 68.9 (344)
250 114.8 (459) 90.2 (360)
300 145.4 (484) 106.3 (354)
350 178.1 (508) 122.1 (348)
400 218.8 (547) 135.0 (337)
450 237.8 (528) 163.9 (364)
500 262.0 (524) 181.7 (363)

volanomark (MPS):
This one starts getting huge mmap_sem contention at 150+ coming
from futexes. Don't know what is taking the mmap_sem for writing.
Maybe just brk or mmap.

2.6.4 2.6.4-mm1
15 5850 6221
30 5682 5852
45 4736 5700
60 2857 5622
75 1024 4840
90 1832 5191
105 491 5036
120 1591 4228
135 393 4986
150 1056 1586


2004-03-19 09:49:56

by Ingo Molnar

[permalink] [raw]
Subject: Re: [BENCHMARKS] 2.6.4 vs 2.6.4-mm1


* Nick Piggin <[email protected]> wrote:

> volanomark (MPS):
> This one starts getting huge mmap_sem contention at 150+ coming
> from futexes. Don't know what is taking the mmap_sem for writing.
> Maybe just brk or mmap.

are you sure it's down_write() contention? down_read() can create
contention just as much, simply due to the fact that hundreds of threads
and a dozen CPUs are pounding in on the same poor lock.

i do think there should be a rw-semaphore variant that is per-cpu for
the read path. (This would also fix the 4:4 threading overhead.)

Ingo

2004-03-19 09:58:54

by Nick Piggin

[permalink] [raw]
Subject: Re: [BENCHMARKS] 2.6.4 vs 2.6.4-mm1



Ingo Molnar wrote:

>* Nick Piggin <[email protected]> wrote:
>
>
>>volanomark (MPS):
>>This one starts getting huge mmap_sem contention at 150+ coming
>>from futexes. Don't know what is taking the mmap_sem for writing.
>>Maybe just brk or mmap.
>>
>
>are you sure it's down_write() contention? down_read() can create
>contention just as much, simply due to the fact that hundreds of threads
>and a dozen CPUs are pounding in on the same poor lock.
>
>

No I'm not sure actually, it could be just read lock
contention. IIRC it was all coming from the semaphore's
spinlock, in up_read...

>i do think there should be a rw-semaphore variant that is per-cpu for
>the read path. (This would also fix the 4:4 threading overhead.)
>
>

That would be interesting, yes. I have (somewhere) a patch
that wakes up the semaphore's waiters outside its spinlock.
I think that only gave about 5% or so improvement though.

2004-03-21 04:04:15

by Nick Piggin

[permalink] [raw]
Subject: Re: [BENCHMARKS] 2.6.4 vs 2.6.4-mm1



Nick Piggin wrote:

>
> That would be interesting, yes. I have (somewhere) a patch
> that wakes up the semaphore's waiters outside its spinlock.
> I think that only gave about 5% or so improvement though.
>
>

Here is a cleaned up patch for comments. It is untested at the
moment because I don't have access to the 16-way NUMAQ now. It
moves waking of the waiters outside the spinlock.

I think it gave about 5-10% improvement when the rwsem gets
really contended. Not as much as I had hoped, but every bit
helps.

The rwsem-spinlock.c code could use the same optimisation too.


Attachments:
rwsem-scale.patch (3.97 kB)

2004-03-21 07:30:30

by Ingo Molnar

[permalink] [raw]
Subject: Re: [BENCHMARKS] 2.6.4 vs 2.6.4-mm1


your patch looks interesting.

wrt. making a fully scalable MM read side:

perphaps RCU could be used to make lookup access to the vma tree and
lookup of the pagetables lockless. This would make futexes (and
pagefaults) fundamentally scalable.

another option would be to introduce a rwsem which is read-scalable, but
this would pessimise writes quite as bad as brlocks did. I'm not sure
how acceptable that is.

Ingo

* Nick Piggin <[email protected]> wrote:

>
>
> Nick Piggin wrote:
>
> >
> >That would be interesting, yes. I have (somewhere) a patch
> >that wakes up the semaphore's waiters outside its spinlock.
> >I think that only gave about 5% or so improvement though.
> >
> >
>
> Here is a cleaned up patch for comments. It is untested at the
> moment because I don't have access to the 16-way NUMAQ now. It
> moves waking of the waiters outside the spinlock.
>
> I think it gave about 5-10% improvement when the rwsem gets
> really contended. Not as much as I had hoped, but every bit
> helps.
>
> The rwsem-spinlock.c code could use the same optimisation too.
>


2004-03-21 08:09:14

by Nick Piggin

[permalink] [raw]
Subject: Re: [BENCHMARKS] 2.6.4 vs 2.6.4-mm1



Ingo Molnar wrote:

>your patch looks interesting.
>
>

I'll see if I can get some numbers for it soon.

>wrt. making a fully scalable MM read side:
>
>perphaps RCU could be used to make lookup access to the vma tree and
>lookup of the pagetables lockless. This would make futexes (and
>pagefaults) fundamentally scalable.
>
>another option would be to introduce a rwsem which is read-scalable, but
>this would pessimise writes quite as bad as brlocks did. I'm not sure
>how acceptable that is.
>
>

It is a pretty silly benchmark. But I guess one day someone
is going to complain about mm scalability.