2002-03-09 07:14:39

by Martin J. Bligh

[permalink] [raw]
Subject: 23 second kernel compile (aka which patches help scalibility on NUMA)

"time make -j32 bzImage" is now down to 23 seconds.
(16 way NUMA-Q, 700MHz P3's, 4Gb RAM).

Below is a description of which patches helped get there.

Start (2.4.18)
47s
{make NUMA local memory allocation work}
memalloc-15setup (Pat Gaughen)
memalloc-16discont (Pat Gaughen)
pageallocnull fix + force CONFIG_NUMA (Martin Bligh)
27s
{O(1) scheduler}
sched-O1-2.4.18-pre8-K3.patch (Ingo Molnar)
25s
{NUMA scheduler}
numaK3.patch (Mike Kravetz)
24s
{dcache cacheline bouncing fixes}
dcache/fast_walkA2-2.4.18.patch (Hanna Linder)
23s

Appling Ingo's patch alone took time from 47s to 30s.
The benefits after the local mem stuff aren't quite as
stunning, but still good.

Top 10 profile hitters left:

21439 total 0.0228
9112 default_idle 175.2308
3364 _text_lock_swap 62.2963
790 lru_cache_add 8.5870
750 _text_lock_namei 0.7184
587 do_anonymous_page 1.7681
572 lru_cache_del 26.0000
569 do_generic_file_read 0.5117
510 __free_pages_ok 0.9733
421 _text_lock_dec_and_lock 17.5417
318 _text_lock_read_write 2.6949

Big locks left:

pagemap_lru_lock
20.2% 57.1% 5.4us( 86us) 111us( 16ms)(14.7%) 1014988 42.9% 57.1% 0%

pagecache_lock
17.5% 31.3% 7.5us( 99us) 52us(4023us)( 2.4%) 631988 68.7% 31.3% 0%

others:
dcache_lock (much improved, but still work to be done)
BKL (isn't it always ;-)

Planned work next:

1. Try John Stultz's mcslocks
(note high max wait vs low max hold currently)
2. Try rmap + pagemap_lru_breakup from Arjan
3. Try radix tree pagecache.
4. Try grafting NUMA-Q page local alloc onto -aa tree
5. Try SGI NUMA zone ordering stuff.
6. [HARD] Break up ZONE_NORMAL between nodes
(all currently on node 0).

Any other suggestions are welcome. I'd also be interested
to know if 23s is fast for make bzImage, or if other big
iron machines can kick this around the room.

Thanks,

Martin.


2002-03-09 16:43:33

by Erik Andersen

[permalink] [raw]
Subject: Re: 23 second kernel compile (aka which patches help scalibility on NUMA)

On Fri Mar 08, 2002 at 09:47:04PM -0800, Martin J. Bligh wrote:
> "time make -j32 bzImage" is now down to 23 seconds.
> (16 way NUMA-Q, 700MHz P3's, 4Gb RAM).
[-----------snip---------]
> Any other suggestions are welcome. I'd also be interested

I suggest that you should give me your computer. ;-)

-Erik

--
Erik B. Andersen http://codepoet-consulting.com/
--This message was written using 73% post-consumer electrons--

2002-03-09 17:52:41

by Martin J. Bligh

[permalink] [raw]
Subject: Re: 23 second kernel compile (aka which patches help scalibility on NUMA)

--On Saturday, March 09, 2002 9:43 AM -0700 Erik Andersen <[email protected]> wrote:
> On Fri Mar 08, 2002 at 09:47:04PM -0800, Martin J. Bligh wrote:
>> "time make -j32 bzImage" is now down to 23 seconds.
>> (16 way NUMA-Q, 700MHz P3's, 4Gb RAM).
> [-----------snip---------]
>> Any other suggestions are welcome. I'd also be interested
>
> I suggest that you should give me your computer. ;-)

There's a very similar machine that's publicly available
in the OSDL (http://www.osdlab.org). I don't think they'll
let you take it home, but access is half way there ;-)

M.

2002-03-09 18:38:12

by Fabio Massimo Di Nitto

[permalink] [raw]
Subject: Re: 23 second kernel compile (aka which patches help scalibility on NUMA)

On Fri Mar 08, 2002 at 09:47:04PM -0800, Martin J. Bligh wrote:

>"time make -j32 bzImage" is now down to 23 seconds.
>(16 way NUMA-Q, 700MHz P3's, 4Gb RAM).

hmmm strange... last time I compile a kernel on my m68k after 23 seconds
was still
trying to perform an CR after I pressed enter :))) are you sure you are
not running too fast??? ;)

2002-03-09 19:45:12

by Dieter Nützel

[permalink] [raw]
Subject: Re: 23 second kernel compile (aka which patches help scalibility on NUMA)

On Saturday, 9. M?rz 2002 05:47:04, Martin J. Bligh wrote:
> "time make -j32 bzImage" is now down to 23 seconds.
> (16 way NUMA-Q, 700MHz P3's, 4Gb RAM).

I want such a beast, too:-)))

[-]
Planned work next:

1. Try John Stultz's mcslocks
(note high max wait vs low max hold currently)
2. Try rmap + pagemap_lru_breakup from Arjan
3. Try radix tree pagecache.
4. Try grafting NUMA-Q page local alloc onto -aa tree
5. Try SGI NUMA zone ordering stuff.
6. [HARD] Break up ZONE_NORMAL between nodes
(all currently on node 0).
[-]

No flamewar intended, but shouldn't you start with 4. and 5.?
-aa is the way to go for the 2.4.18+ tree. -rmap later for 2.5.x.

Have you tried the OOM case?
vm_29 and before fixed it for me.
Throughput is much improved with -aa.

Have you checked latency?
I found weird behavior of latest O(1)-K3 with latencytest0.42-png and higher
latency then with clean 2.4.18.
Do you have some former O(1) versions around? Ingo removed them form his
archive.

Preemption?

Running 2.4.19-pre2-dn1 :-)
Taken from -jam3:
00-vm-29
01-vm-io-3
10-x86-fast-pte-1
11-spinlock-cacheline-3
12-clone-flags
20-sched-O1-K3
21-sched-balance
22-sched-aa-fixes
23-lowlatency-mini
24-read-latency-2
30-aic7xxx-6.2.5
31-ide-20020215
all latest ReiserFS stuff 2.4.18.pending
preempt-kernel-rml-2.4.18-rc1-ingo-K3-1.patch
lock-break-rml-2.4.18-1

Regards,
Dieter

BTW Anyone out there who have a copy of the mem "test" prog handy?
I've accidentally removed one of my development folders...

Would be nice to see some "Hammer" systems from IBM next winter;-)

--
Dieter N?tzel
Graduate Student, Computer Science

University of Hamburg
Department of Computer Science
@home: [email protected]

2002-03-09 20:19:00

by Martin J. Bligh

[permalink] [raw]
Subject: Re: 23 second kernel compile (aka which patches help scalibility on NUMA)

> [-]
> Planned work next:
>
> 1. Try John Stultz's mcslocks
> (note high max wait vs low max hold currently)
> 2. Try rmap + pagemap_lru_breakup from Arjan
> 3. Try radix tree pagecache.
> 4. Try grafting NUMA-Q page local alloc onto -aa tree
> 5. Try SGI NUMA zone ordering stuff.
> 6. [HARD] Break up ZONE_NORMAL between nodes
> (all currently on node 0).
> [-]
>
> No flamewar intended, but shouldn't you start with 4. and 5.?
> -aa is the way to go for the 2.4.18+ tree.

The ordering reflects both the difficulty of doing it, and
the expected payoff. For instance, I expect the mcslocks to
be dead easy to install, and give a reasonable payoff.

I tried (2) this morning, deadlocks during boot. I'll look
at fixing it, but it'll move down my list because it's now
harder ;-)

(6) would be a good thing to do - at the moment the page
structs for all nodes sit on node 0. The interconnect has
caches on it, so this isn't as bad as it sounds. But I
expect changing the assumption that ZONE_NORMAL == phys < 896Mb
to cause some pain.

> -rmap later for 2.5.x.

rmap has the huge advantage that it's much easier to split
up the pagemap_lru_lock per zone, do per node kswapd without
much remote referencing, etc. Remeber this is NUMA with a
remote:local mem latency of 10:1 to 20:1. Non-local access
hurts. If we can fix some of the scaling problems with rmap,
I expect that to be the real way to fix some of the harder
"global stuff is bad" problems.

> Have you tried the OOM case?
> vm_29 and before fixed it for me.
> Throughput is much improved with -aa.

I've not tried OOM really. The problem with porting to the
-aa tree is it changes a whole pile of stuff at once, in the
same area as Pat's discontigmem support stuff. It also changes
the way zone fallbacks for NUMA are done - I had to spend a
day fixing that for the main tree already ... I'd like to try
some other stuff as well. The -aa tree also seems to be
incompatible (or rather, not trivially fixable) with the O(1)
scheduler.

> Have you checked latency?
> I found weird behavior of latest O(1)-K3 with latencytest0.42-png and higher
> latency then with clean 2.4.18.

I'm not sure latency is as high up the list as locking for a
large backend server. At least we're doing *something* at the
time rather than spinning. From my own personal perception,
akpm's low latency stuff is preferable to preempt. I'd be
interested in arguments against this ...

> Do you have some former O(1) versions around? Ingo removed them form his
> archive.

I have J6 somewhere. Have you isolated which change he made
that caused latency problems?

> Preemption?

see above.

> Running 2.4.19-pre2-dn1 :-)

All sounds interesting apart from aic7xxx and ide, which I don't have.

> BTW Anyone out there who have a copy of the mem "test" prog handy?
> I've accidentally removed one of my development folders...
>
> Would be nice to see some "Hammer" systems from IBM next winter;-)

Not sure whether we're doing Hammer yet or not (IBM is huge,
and I'm in a different division), but I'd love to see a large
Hammer system too. This is the "old" Sequent hardware, and
tops out at a 900MHz P3 (I think). I should be able to build
up to a 64 proc machine w/ 64Gb out of this stuff (if I can
scrounge up the parts ;-) )

M.

2002-03-10 09:27:28

by Samuel Ortiz

[permalink] [raw]
Subject: Re: 23 second kernel compile (aka which patches help scalibility on NUMA)

On Sat, 9 Mar 2002, Martin J. Bligh wrote:

> > -rmap later for 2.5.x.
>
> rmap has the huge advantage that it's much easier to split
> up the pagemap_lru_lock per zone, do per node kswapd without
> much remote referencing, etc. Remeber this is NUMA with a
> remote:local mem latency of 10:1 to 20:1. Non-local access
> hurts. If we can fix some of the scaling problems with rmap,
> I expect that to be the real way to fix some of the harder
> "global stuff is bad" problems.
Martin, I wrote a patch in order to have a kswap daemon per node. Each
daemon swaps pages out only from its node. It might be of some interest
for your scalability problem, so let me know if you're interested in it (I
can't paste it here because it has also some other stuffs in it, and I
have to split the patch in several parts. I also need to port it to -rmap).

Cheers,
Samuel.

2002-03-10 17:12:06

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 23 second kernel compile (aka which patches help scalibility on NUMA)

On Sat, Mar 09, 2002 at 12:19:13PM -0800, Martin J. Bligh wrote:
> some other stuff as well. The -aa tree also seems to be
> incompatible (or rather, not trivially fixable) with the O(1)
> scheduler.

To apply the O(1) scheduler you only need to backout the dyn-sched and
numa-sched patches first (dyn-sched will be definitely obsoleted by the
O(1) scheduler, numa-sched should be changed like Mike described a few
weeks ago but probably O(1) will just work better than my current
numa-sched). There are no other changes to the scheduler, the
child-first is an optimization and parent-timeslice is an important
bugfix.

Andrea

2002-03-10 17:13:45

by Martin J. Bligh

[permalink] [raw]
Subject: Re: 23 second kernel compile (aka which patches help scalibility on NUMA)

> Martin, I wrote a patch in order to have a kswap daemon per node. Each
> daemon swaps pages out only from its node. It might be of some interest
> for your scalability problem, so let me know if you're interested in it (I
> can't paste it here because it has also some other stuffs in it, and I

How does this interact with the virtual scanning stuff? I
was under the impression that we scanned for suitable pages
on a per-process basis ... so I'm confused as to how you'd
have a per-node kswapd without rmap's physical scanning
(unless you assume all processes on a node have all their
mem on that node). Could you explain?

> have to split the patch in several parts. I also need to port it to -rmap).

I'd certainly be interested to see / try it - I think Bill Irwin
had an implementation of multiple kswapd's for rmap - you might
want to look at that before you port.

Thanks,

M

2002-03-11 02:13:55

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 23 second kernel compile (aka which patches help scalibility on NUMA)

On Fri, Mar 08, 2002 at 09:47:04PM -0800, Martin J. Bligh wrote:
> Big locks left:
>
> pagemap_lru_lock
> 20.2% 57.1% 5.4us( 86us) 111us( 16ms)(14.7%) 1014988 42.9% 57.1% 0%

I think this is only due the lru_cache_add executed by the anonymous
pagefaults. Pagecache should stay in the lru constantly if you're
running in hot pagecache as I guess. For a workload like this one
keeping anon pages out of the lru would be an obvious win. The only
reason we put anon pages into the lru before they are converted to
swapcache is to get a nicer swapout behaviour, but you're certainly not
swapping out anything. It's a tradeoff. Just like the additional
memory/cpu and locking overhead that rmap requires will slowdown page
faults even more than what you see now, with the only object to get a
nicer pagout behaviour (modulo the ram-binding "migration" stuff where
rmap is mandatory to do it instantly and not over time). If we don't
care of getting a nice swapout behaviour workloads like a kernel compile
could be speededup and scaled up much better, but for general purpose we
don't want to slowdown like a crawl when swapping activities become
necessary.

> Any other suggestions are welcome. I'd also be interested
> to know if 23s is fast for make bzImage, or if other big
> iron machines can kick this around the room.

It's also a big function of .config, compiler and kernel source (and if
you include make dep too or not).

>From my part my record kernel compile is been at LANL with 32cpus
wildfire with a 2.4.3-aa kernel IIRC (it just had the basic numa
scheduler optimizations), it took 37 seconds IIRC (with a quite generic
.config that could be used on most alphas except for the
CONFIG_WILDFIRE that was required at that time, but not all the possible
drivers out there included of course). With Ingo's scheduler and the
other enachements that happend during 2.4, it probably won't go down to
the teenth, but it should get into the low twenty I believe. Also 32way
scalability on a kernel compile cannot be exploited completly, unless
the .config is very full like the one used by distributions. Last but
not the least the output was scrolling so fast on the VGA console that I
guess redirecting the output to >/dev/null may save some point percent
too :). and of course that was with an alpha target, not an x86 target,
so that's not comparable also because of last variable. I think your
23 seconds figure looks very nice.

Andrea

2002-03-11 02:24:25

by Rik van Riel

[permalink] [raw]
Subject: Re: 23 second kernel compile (aka which patches help scalibility on NUMA)

On Mon, 11 Mar 2002, Andrea Arcangeli wrote:
> On Fri, Mar 08, 2002 at 09:47:04PM -0800, Martin J. Bligh wrote:
> > Big locks left:
> >
> > pagemap_lru_lock
> > 20.2% 57.1% 5.4us( 86us) 111us( 16ms)(14.7%) 1014988 42.9% 57.1% 0%
>
> I think this is only due the lru_cache_add executed by the anonymous
> pagefaults. Pagecache should stay in the lru constantly if you're
> running in hot pagecache as I guess. For a workload like this one
> keeping anon pages out of the lru would be an obvious win.

... but only if you're really dealing with anonymous pages,
I suspect that people will use NUMA machines more for workloads
where most pages belong to mappings, because if a scientific
calculation can be split out to a cluster you don't need the
cost of NUMA hardware.

Not sure if my guess is right though ;)

> It's a tradeoff. Just like the additional memory/cpu and locking
> overhead that rmap requires will slowdown page faults even more than
> what you see now, with the only object to get a nicer pagout behaviour
> (modulo the ram-binding "migration" stuff where rmap is mandatory to do
> it instantly and not over time).

Rmap will also make it possible to have the lru lock per
zone (or per node), which should give rather nice behaviour
for large SMP and NUMA systems ... even if the workload
isn't made up of anonymous pages ;)

Btw, what is the "ram binding migration stuff" you are
talking about and why would rmap not be able to do it in
a nice way ?

regards,

Rik
--
<insert bitkeeper endorsement here>

http://www.surriel.com/ http://distro.conectiva.com/

2002-03-11 04:12:00

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 23 second kernel compile (aka which patches help scalibility on NUMA)

On Sun, Mar 10, 2002 at 11:23:47PM -0300, Rik van Riel wrote:
> On Mon, 11 Mar 2002, Andrea Arcangeli wrote:
> > On Fri, Mar 08, 2002 at 09:47:04PM -0800, Martin J. Bligh wrote:
> > > Big locks left:
> > >
> > > pagemap_lru_lock
> > > 20.2% 57.1% 5.4us( 86us) 111us( 16ms)(14.7%) 1014988 42.9% 57.1% 0%
> >
> > I think this is only due the lru_cache_add executed by the anonymous
> > pagefaults. Pagecache should stay in the lru constantly if you're
> > running in hot pagecache as I guess. For a workload like this one
> > keeping anon pages out of the lru would be an obvious win.
>
> ... but only if you're really dealing with anonymous pages,

That's what the workload does, yes. The rest will stay in pagecache
persistent because there's enough ram. my comments only wanted to
explain _where_ the collisons happens and why we are adding anon pages
to the lru even before converting them to pagecache, I'm not saying we
need to change that part.

> > It's a tradeoff. Just like the additional memory/cpu and locking
> > overhead that rmap requires will slowdown page faults even more than
> > what you see now, with the only object to get a nicer pagout behaviour
> > (modulo the ram-binding "migration" stuff where rmap is mandatory to do
> > it instantly and not over time).
>
> Rmap will also make it possible to have the lru lock per
> zone (or per node), which should give rather nice behaviour
> for large SMP and NUMA systems ... even if the workload
> isn't made up of anonymous pages ;)

I don't see the relation with rmap, rmap only makes possible the
immediate migration with strong numa memory bindings and it decreases
the complexity of the pageout load, but it has nothing to do with the
lru lock per-node, that's an orthogonal problem.

> Btw, what is the "ram binding migration stuff" you are
> talking about and why would rmap not be able to do it in
> a nice way ?

Actually I was trying to say you _need_ rmap to do the ram binding
migration stuff on numa :), not the other way around. Without full rmap
we can only trivially provide weak bindings, but not strong-migration
bindings. Since we just have the rmap information for all the file
mappings, only anonymous memory and shm isn't covered by the rmap
information in 2.[45] mainline, so we could as well break the COW while
migrating the pages instead of collecting the rmap stuff for anon pages
too, and shm could be associated with an internal file).

But I think the main point is how much more efficient it is the direct
chain to the pte rather than having to walk the pgd/pmd every time (even
if it would be an O(1) operation too for every mapping in the list).

Andrea

2002-03-11 06:49:12

by Denis Vlasenko

[permalink] [raw]
Subject: Re: 23 second kernel compile (aka which patches help scalibility on NUMA)

On 9 March 2002 03:47, Martin J. Bligh wrote:
> "time make -j32 bzImage" is now down to 23 seconds.
> (16 way NUMA-Q, 700MHz P3's, 4Gb RAM).
...
> Any other suggestions are welcome. I'd also be interested
> to know if 23s is fast for make bzImage, or if other big
> iron machines can kick this around the room.

I'm curious how long "time make -j32 bzImage" takes on your setup
when:
1) only one node is enabled,
2) only one CPU is enabled?

this will give you a clue how close you are to 'perfect' scalability.
--
vda

2002-03-11 18:25:16

by Timothy D. Witham

[permalink] [raw]
Subject: Re: 23 second kernel compile (aka which patches help scalibility on NUMA)

Ours is only a 16 way 500MHz machine but I do have 16 GB of memory
and we could stripe stuff across 80 disk drives. :-)

Tim

On Sat, 2002-03-09 at 09:53, Martin J. Bligh wrote:
> --On Saturday, March 09, 2002 9:43 AM -0700 Erik Andersen <[email protected]> wrote:
> > On Fri Mar 08, 2002 at 09:47:04PM -0800, Martin J. Bligh wrote:
> >> "time make -j32 bzImage" is now down to 23 seconds.
> >> (16 way NUMA-Q, 700MHz P3's, 4Gb RAM).
> > [-----------snip---------]
> >> Any other suggestions are welcome. I'd also be interested
> >
> > I suggest that you should give me your computer. ;-)
>
> There's a very similar machine that's publicly available
> in the OSDL (http://www.osdlab.org). I don't think they'll
> let you take it home, but access is half way there ;-)
>
> M.
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
Timothy D. Witham - Lab Director - [email protected]
Open Source Development Lab Inc - A non-profit corporation
15275 SW Koll Parkway - Suite H - Beaverton OR, 97006
(503)-626-2455 x11 (office) (503)-702-2871 (cell)
(503)-626-2436 (fax)