2012-10-05 23:14:52

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 00/33] AutoNUMA27

Andrew Morton <[email protected]> writes:

> On Thu, 4 Oct 2012 01:50:42 +0200
> Andrea Arcangeli <[email protected]> wrote:
>
>> This is a new AutoNUMA27 release for Linux v3.6.
>
> Peter's numa/sched patches have been in -next for a week.

Did they pass review? I have some doubts.

The last time I looked it also broke numactl.

> Guys, what's the plan here?

Since they are both performance features their ultimate benefit
is how much faster they make things (and how seldom they make things
slower)

IMHO needs a performance shot-out. Run both on the same 10 workloads
and see who wins. Just a lot of of work. Any volunteers?

For a change like this I think less regression is actually more
important than the highest peak numbers.

-Andi

--
[email protected] -- Speaking for myself only


2012-10-05 23:57:13

by Tim Chen

[permalink] [raw]
Subject: Re: [PATCH 00/33] AutoNUMA27

On Fri, 2012-10-05 at 16:14 -0700, Andi Kleen wrote:
> Andrew Morton <[email protected]> writes:
>
> > On Thu, 4 Oct 2012 01:50:42 +0200
> > Andrea Arcangeli <[email protected]> wrote:
> >
> >> This is a new AutoNUMA27 release for Linux v3.6.
> >
> > Peter's numa/sched patches have been in -next for a week.
>
> Did they pass review? I have some doubts.
>
> The last time I looked it also broke numactl.
>
> > Guys, what's the plan here?
>
> Since they are both performance features their ultimate benefit
> is how much faster they make things (and how seldom they make things
> slower)
>
> IMHO needs a performance shot-out. Run both on the same 10 workloads
> and see who wins. Just a lot of of work. Any volunteers?
>
> For a change like this I think less regression is actually more
> important than the highest peak numbers.
>
> -Andi
>

I remembered that 3 months ago when Alex tested the numa/sched patches
there were 20% regression on SpecJbb2005 due to the numa balancer.
Those issues may have been fixed but we probably need to run this
benchmark against the latest. For most of the other kernel performance
workloads we ran we didn't see much changes.

Maurico has a different config for this benchmark and it will be nice
if he can also check to see if there are any performance changes on his
side.

Tim

2012-10-06 00:11:36

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 00/33] AutoNUMA27

Tim Chen <[email protected]> writes:
>>
>
> I remembered that 3 months ago when Alex tested the numa/sched patches
> there were 20% regression on SpecJbb2005 due to the numa balancer.

20% on anything sounds like a show stopper to me.

-Andi

--
[email protected] -- Speaking for myself only

2012-10-08 13:56:20

by Don Morris

[permalink] [raw]
Subject: Re: [PATCH 00/33] AutoNUMA27

On 10/05/2012 05:11 PM, Andi Kleen wrote:
> Tim Chen <[email protected]> writes:
>>>
>>
>> I remembered that 3 months ago when Alex tested the numa/sched patches
>> there were 20% regression on SpecJbb2005 due to the numa balancer.
>
> 20% on anything sounds like a show stopper to me.
>
> -Andi
>

Much worse than that on an 8-way machine for a multi-node multi-threaded
process, from what I can tell. (Andrea's AutoNUMA microbenchmark is a
simple version of that). The contention on the page table lock
( &(&mm->page_table_lock)->rlock ) goes through the roof, with threads
constantly fighting to invalidate translations and re-fault them.

This is on a DL980 with Xeon E7-2870s @ 2.4 GHz, btw.

Running linux-next with no tweaks other than
kernel.sched_migration_cost_ns = 500000 gives:
numa01
8325.78
numa01_HARD_BIND
488.98

(The Hard Bind being a case where the threads are pre-bound to the
node set with their memory, so what should be a fairly "best case" for
comparison).

If the SchedNUMA scanning period is upped to 25000 ms (to keep repeated
invalidations from being triggered while the contention for the first
invalidation pass is still being fought over):
numa01
4272.93
numa01_HARD_BIND
498.98

Since this is a "big" process in the current SchedNUMA code and hence
much more likely to trip invalidations, forcing task_numa_big() to
always return false in order to avoid the frequent invalidations gives:
numa01
429.07
numa01_HARD_BIND
466.67

Finally, with SchedNUMA entirely disabled but the rest of linux-next
left intact:
numa01
1075.31
numa01_HARD_BIND
484.20

I didn't write down the lock contentions for comparison, but yes -
the contention does decrease similarly to the time decreases.

There are other microbenchmarks, but those suffice to show the
regression pattern. I mentioned this to the RedHat folks last
week, so I expect this is already being worked. It seemed pertinent
to bring up given the discussion about the current state of linux-next
though, just so folks know. From where I'm sitting, it looks to
me like the scan period is way too aggressive and there's too much
work potentially attempted during a "scan" (by which I mean the
hard tick driven choice to invalidate in order to set up potential
migration faults). The current code walks/invalidates the entire
virtual address space, skipping few vmas. For a very large 64-bit
process, that's going to be a *lot* of translations (or even vmas
if the address space is fragmented) to walk. That's a seriously
long path coming from the timer code. I would think capping the
number of translations to process per visit would help.

Hope this helps the discussion,
Don Morris

2012-10-08 20:35:36

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 00/33] AutoNUMA27

On Fri, 05 Oct 2012 16:14:44 -0700
Andi Kleen <[email protected]> wrote:

> IMHO needs a performance shot-out. Run both on the same 10 workloads
> and see who wins. Just a lot of of work. Any volunteers?

Here are some preliminary results from simple benchmarks on a
4-node, 32 CPU core (4x8 core) Dell PowerEdge R910 system.

For the simple linpack streams benchmark, both sched/numa and
autonuma are within the margin of error compared to manual
tuning of task affinity. This is a big win, since the current
upstream scheduler has regressions of 10-20% when the system
runs 4 through 16 streams processes.

For specjbb, the story is more complicated. After fixing the
obvious bugs in sched/numa, and getting some basic cpu-follows-memory
code (not yet in -tip AFAIK), Larry, Peter and I, averaged results
look like this:

baseline: 246019
manual pinning: 285481 (+16%)
autonuma: 266626 (+8%)
sched/numa: 226540 (-8%)

This is with newer sched/numa code than what is in -tip right now.
Once Peter pushes the fixes by Larry and me into -tip, as well as
his cpu-follows-memory code, others should be able to run tests
like this as well.

Now for some other workloads, and tests on 8 node systems, etc...


Full results for the specjbb run below:

BASELINE - disabling auto numa (matches RHEL6 within 1%)

[root@perf74 SPECjbb]# cat r7_36_auto27_specjbb4_noauto.txt
spec1.txt: throughput = 243639.70 SPECjbb2005 bops
spec2.txt: throughput = 249186.20 SPECjbb2005 bops
spec3.txt: throughput = 247216.72 SPECjbb2005 bops
spec4.txt: throughput = 244035.60 SPECjbb2005 bops

Manual NUMACTL results are:

[root@perf74 SPECjbb]# more r7_36_numactl_specjbb4.txt
spec1.txt: throughput = 291430.22 SPECjbb2005 bops
spec2.txt: throughput = 283550.85 SPECjbb2005 bops
spec3.txt: throughput = 284028.71 SPECjbb2005 bops
spec4.txt: throughput = 282919.37 SPECjbb2005 bops

AUTONUMA27 - 3.6.0-0.24.autonuma27.test.x86_64
[root@perf74 SPECjbb]# more r7_36_auto27_specjbb4.txt
spec1.txt: throughput = 261835.01 SPECjbb2005 bops
spec2.txt: throughput = 269053.06 SPECjbb2005 bops
spec3.txt: throughput = 261230.50 SPECjbb2005 bops
spec3.txt: throughput = 274386.81 SPECjbb2005 bops

Tuned SCHED_NUMA from Friday 10/4/2012 with fixes from Peter, Rik and
Larry:

[root@perf74 SPECjbb]# more r7_36_schednuma_specjbb4.txt
spec1.txt: throughput = 222349.74 SPECjbb2005 bops
spec2.txt: throughput = 232988.59 SPECjbb2005 bops
spec3.txt: throughput = 223386.03 SPECjbb2005 bops
spec4.txt: throughput = 227438.11 SPECjbb2005 bops

--
All rights reversed.