2002-10-24 02:20:08

by Rob Landley

[permalink] [raw]
Subject: Crunch time -- the musical. (2.5 merge candidate list 1.5)

Kernel hooks is back with new links. Also new versions of Linux Trace Tookit
and sys_epoll. And new stuff from the 2.5 status list, and new stuff is STILL
showing up on linux-kernel. (Still no 2.5 patch for Alan's 32 bit dev_t,
though.)

Richard J. Moore has stepped up to defend "VM Large Page support",
which has become "hugetlb update". I don't know if this counts as
a new feature or a bugfix, but it's back...

Due to numerous complaints (okay, one, but technically that's a number)
tried to reformat a bit to have a slightly less eye-searingly hideous layout.
And reorganized the -mm stuff to be together in one clump.

And so:

----------

Linus returns from the Linux Lunacy Cruise after Sunday, October 27th.
(See "http://www.geekcruises.com/itinerary/ll2_itinerary.html". He's
off to Jamaica, mon.)

The following features aim to be ready for submission to Linus by Monday,
October 28th, to be considered for inclusion (in 2.5.45) before the feature
freeze on Thursday, October 31 (halloween). (L minus four days, and
counting...)

Note: if you want to submit a new entry to this list, PLEASE provide a URL
to where the patch can be found, and any descriptive announcement you think
useful (user space tools, etc). This doesn't have to be a web page devoted
to the patch, if the patch has been posted to linux-kernel a URL to the post
on any linux-kernel archive site is fine.

If you don't know of one, a good site for looking at the threaded archive is:
http://lists.insecure.org/lists/linux-kernel/

A more searchable archive is available at:
http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&group=mlist.linux.kernel

This archive seems less likely to mangle your patch for cut and pasting
(especially if you click "raw download" at the top of the message),
although its a real pain to actualy try to read:
http://marc.theaimsgroup.com/?l=linux-kernel

This list is just pending features trying to get in before feature freeze.
It's primarily for features that need more testing, or might otherwise get
forgotten in the rush. If you want to know what's already gone in, or what's
being worked on for the next development cycle, check out
"http://kernelnewbies.org/status".

You can get Andrew Morton's MM tree here, including a broken-out patches
directory and a description file:

http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.44

Alan Cox's -ac tree comes from here:

http://www.kernel.org/pub/linux/kernel/people/alan/

Thanks to Rusty Russell and Guillaume Boissiere, whose respective 2.5 merge
candidate lists have been ruthlessly strip-mined in the process of
assembling this. And to everybody who's emailed stuff.

And now, in no particular order:

============================ Pending features: =============================

1) New kernel configuration system (Roman Zippel)

Announcement:
http://lists.insecure.org/lists/linux-kernel/2002/Oct/6898.html

Code:
http://www.xs4all.nl/~zippel/lc/

Linus has actually looked fairly favorably on this one so far:
http://lists.insecure.org/lists/linux-kernel/2002/Oct/3250.html

----------------------------------------------------------------------------

2) ext2/ext3 extended attributes and access control lists (Ted Tso) (in -mm)

Announce:
http://lists.insecure.org/lists/linux-kernel/2002/Oct/6787.html

Code:
bk://extfs.bkbits.net/extfs-2.5-update
http://thunk.org/tytso/linux/extfs-2.5
(Or just grab it from the -mm tree.)

(Considering that EA/ACL infrastructure is already in, and supported by XFS
and JFS, this one's pretty close to a shoe-in.)

----------------------------------------------------------------------------

3) Page table sharing (Daniel Phillips, Dave McCracken) (in -mm)

Announce:
http://www.geocrawler.com/mail/msg.php3?msg_id=7855063&list=35

Patch from the -mm tree:
http://www.zipworld.com.au/~akpm/linux/patches/2.5/2.5.44/2.5.44-mm3/broken-out/shpte-ng.patch

Ed Tomlinson seems to have a show-stopper bug for this one
(although he tells me in email he'd like to see it go in anyway):

http://lists.insecure.org/lists/linux-kernel/2002/Oct/7147.html

----------------------------------------------------------------------------

4) Improved Hugetlb support (Richard J. Moore) (in -mm tree)

(Dunno if this is exactly a feature, but giving it the benfit of the doubt...)

Description:
http://www.zipworld.com.au/~akpm/linux/patches/2.5/2.5.44/2.5.44-mm3/description

Patches (everything starting with "htlb" or "hugetlb"):
http://www.zipworld.com.au/~akpm/linux/patches/2.5/2.5.44/2.5.44-mm3/broken-out/

----------------------------------------------------------------------------

5) Generic Nonlinear Mappings (Ingo Molnar) (in -mm)

It's new, very close to deadline, needs testing and discussion. I'm still a
touch vague on what it actually does, but there's a thread.

Announcement, patch, and start of thread:
http://marc.theaimsgroup.com/?l=linux-kernel&m=103530883511032&w=2

----------------------------------------------------------------------------

6) Linux Trace Toolkit (LTT) (Karim Yaghmour)

Announce:
http://lists.insecure.org/lists/linux-kernel/2002/Oct/7016.html

Patch:
http://opersys.com/ftp/pub/LTT/ExtraPatches/patch-ltt-linux-2.5.44-vanilla-021022-2.2.bz2

User tools:
http://opersys.com/ftp/pub/LTT/TraceToolkit-0.9.6pre2.tgz

----------------------------------------------------------------------------

7) Device mapper for Logical Volume Manager (LVM2) (LVM2 team) (in -ac)

Announce:
http://marc.theaimsgroup.com/?l=linux-kernel&m=103536883428443&w=2

Download:
http://people.sistina.com/~thornber/patches/2.5-stable/

Home page:
http://www.sistina.com/products_lvm.htm

----------------------------------------------------------------------------

8) EVMS (Enterprise Volume Management System) (EVMS team)

Home page:
http://sourceforge.net/projects/evms

----------------------------------------------------------------------------

9) Kernel Probes (IBM, contact: Vamsi Krishna S)

Kprobes announcement:
http://marc.theaimsgroup.com/?l=linux-kernel&m=103528410215211&w=2

Base Kprobes Patch:
http://marc.theaimsgroup.com/?l=linux-kernel&m=103528425615302&w=2

KProbes->DProbes patches:
http://marc.theaimsgroup.com/?l=linux-kernel&m=103528454215523&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m=103528454015520&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m=103528485415813&w=2

Official IBM download site for most recent versions (gzipped
tarballs):
http://www-124.ibm.com/linux/patches/?project_id=141

See also the DProbes Home Page:
http://oss.software.ibm.com/developerworks/opensource/linux/projects/dprobes

A good explanation of the difference between kprobes, dprobes,
and kernel hooks is here:

http://marc.theaimsgroup.com/?l=linux-kernel&m=103532874900445&w=2

And a clarification: just kprobes is being submitted for
2.5.45, not the whole of dprobes:

http://marc.theaimsgroup.com/?l=linux-kernel&m=103536827928012&w=2

----------------------------------------------------------------------------

10) High resolution timers (George Anzinger, etc.)

Home page:
http://high-res-timers.sourceforge.net/

Patch via evil sourceforge download auto-mirror thing:
http://prdownloads.sourceforge.net/high-res-timers/hrtimers-support-2.5.36-1.0.patch?download

Linus has unresolved concerns with this one, by the way:
http://lists.insecure.org/lists/linux-kernel/2002/Oct/3463.html

Note: The Google posix timer patch forwarded by Jim Houston is being
merged into this patch:
http://lists.insecure.org/lists/linux-kernel/2002/Oct/8068.html

----------------------------------------------------------------------------

11) Linux Kernel Crash Dumps (Matt Robinson, LKCD team)

Announce:
http://marc.theaimsgroup.com/?l=linux-kernel&m=103536576625905&w=2

Code:
http://lkcd.sourceforge.net/download/latest/

----------------------------------------------------------------------------

12) Rewrite of the console layer (James Simmons)

Home page:
http://linuxconsole.sourceforge.net/

Patch (Unknown version, but home page only has random CVS du jour link.):
http://phoenix.infradead.org/~jsimmons/fbdev.diff.gz

Bitkeeper tree:
http://linuxconsole.bkbits.net


----------------------------------------------------------------------------

13) Kexec, luanch new linux kernel from Linux (Eric W. Biederman)

Announcement with links:
http://lists.insecure.org/lists/linux-kernel/2002/Oct/6584.html

And this thread is just too brazen not to include:
http://lists.insecure.org/lists/linux-kernel/2002/Oct/7952.html

----------------------------------------------------------------------------

14) USAGI IPv6 (Yoshifujy Hideyaki)

README:
ftp://ftp.linux-ipv6.org/pub/usagi/patch/ipsec/README.IPSEC

Patch:
ftp://ftp.linux-ipv6.org/pub/usagi/patch/ipsec/ipsec-2.5.43-ALL-03.patch.gz

----------------------------------------------------------------------------

15) MMU-less processor support (Greg Ungerer)

Announcement with lots of links:
http://lists.insecure.org/lists/linux-kernel/2002/Oct/7027.html

----------------------------------------------------------------------------

16) sys_epoll (I.E. /dev/poll) (Davide Libenzi)

homepage:
http://www.xmailserver.org/linux-patches/nio-improve.html

patch:
http://www.xmailserver.org/linux-patches/sys_epoll-2.5.44-0.7.diff

Linus participated repeatedly in a thread on this one too, expressing
concerns which (hopefully) have been addressed. See:
http://lists.insecure.org/lists/linux-kernel/2002/Oct/6428.html

----------------------------------------------------------------------------

17) CD Recording/sgio patches (Jens Axboe)

Announce:
http://lists.insecure.org/lists/linux-kernel/2002/Oct/8060.html

Patch:
http://www.kernel.org/pub/linux/kernel/people/axboe/patches/v2.5/2.5.44/sgio-14b.diff.bz2

----------------------------------------------------------------------------

18) In-kernel module loader (Rusty Russell.)

Announce:
http://lists.insecure.org/lists/linux-kernel/2002/Oct/6214.html

Patch:
http://www.kernel.org/pub/linux/kernel/people/rusty/patches/module-x86-18-10-2002.2.5.43.diff.gz

----------------------------------------------------------------------------

19) Unified Boot/Module parameter support (Rusty Russell)

Note: depends on in-kernel module loader.

Huge disorganized heap 'o patches with no explanation:
http://www.kernel.org/pub/linux/kernel/people/rusty/patches/Module/

----------------------------------------------------------------------------

20) Hotplug CPU Removal (Rusty Russell)

Even bigger, more disorganized Heap 'o patches:
http://www.kernel.org/pub/linux/kernel/people/rusty/patches/Hotplug/

----------------------------------------------------------------------------

21) Unlimited groups patch (Tim Hockin.)

Announce:
http://marc.theaimsgroup.com/?l=linux-kernel&m=103524761319825&w=2

Patch set:
http://marc.theaimsgroup.com/?l=linux-kernel&m=103524717119443&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m=103524761819834&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m=103524761619831&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m=103524761519829&w=2

----------------------------------------------------------------------------

22) Initramfs (Al Viro)

Way back when, Al said:
http://www.cs.helsinki.fi/linux/linux-kernel/2001-30/0110.html

I THINK this is the most recent patch:
ftp://ftp.math.psu.edu/pub/viro/N0-initramfs-C40

And Linus recently made happy noises about the idea:
http://lists.insecure.org/lists/linux-kernel/2002/Oct/1110.html

----------------------------------------------------------------------------

23) Kernel Hooks (IBM contact: Vamsi Krishna S.)

Website:
http://www-124.ibm.com/linux/projects/kernelhooks/

Download site:
http://www-124.ibm.com/linux/patches/?patch_id=595

Posted patch:
http://marc.theaimsgroup.com/?l=linux-kernel&m=103364774926440&w=2

----------------------------------------------------------------------------

24) NMI request/release interface (Corey Minyard)

He says:
> Add a request/release mechanism to the kernel (x86 only for now) for NMIs.
...
>I have modified the nmi watchdog to use this interface, and it
>seems to work ok. Keith Owens is copied to see if he would be
>interested in converting kdb to use this, if it gets put into the kernel.

The latest patch so far:
http://marc.theaimsgroup.com/?l=linux-kernel&m=103540434409894&w=2

----------------------------------------------------------------------------

25) Digital Video Broadcasting Layer (LinuxTV team)

Home page:
http://www.linuxtv.org:81/dvb/

Download:
http://www.linuxtv.org:81/download/dvb/

----------------------------------------------------------------------------

26) NUMA aware scheduler extenstions (Erich Focht, Michael Hohnbaum)

Home page:
http://home.arcor.de/efocht/sched/

Patch:
http://home.arcor.de/efocht/sched/Nod20_numa_sched-2.5.31.patch

----------------------------------------------------------------------------

27) DriverFS Topology (Matthew Dobson)

Announcement:
http://marc.theaimsgroup.com/?l=linux-kernel&m=103523702710396&w=2

Patches:
http://marc.theaimsgroup.com/?l=linux-kernel&m=103540707113401&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m=103540757613962&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m=103540758013984&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m=103540757513957&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m=103540757813966&w=2

----------------------------------------------------------------------------

28) Advanced TCA Disk Hotswap (Steven Dake)

At the last minute, Steven Dake submitted (and if he'd cc'd the list, I could
have linked to this message as the announcement, hint hint...):

> Please add to your 2.5.45 list:
>
> "Advanced TCA Disk Hotswap".
>
> This is a generic feature that provides good hotswap support for SCSI
> and FibreChannel disk devices. The entire SCSI layer has been properly
> analyzed to provide correct locking and a complete RAMFS filesystem is
> available to control the kernel disk hotswap operations.
>
> Both Alan Cox and Greg KH have looked at the patch for 2.4 and suggested
> if I ported to 2.5 and made some changes (as I have in the latest port)
> this feature would be a good candidate for the 2.5 kernel.
>
> The sourceforge site for the latest patches is:
> https://sourceforge.net/projects/atca-hotswap/
>
> The lkml announcement for this latest port is:
> http://marc.theaimsgroup.com/?l=linux-kernel&m=103541572622729&w=2
>
> A thread discussing Advanced TCA hotswap (of which this partch is one
> part of) can be found at:
> http://marc.theaimsgroup.com/?t=103462115700001&r=1&w=2
>
> Thanks!
> -steve


======================== Unresolved issues: =========================

1) hyperthread-aware scheduler
2) connection tracking optimizations.

No URLs to patch. Anybody want to come out in favor of these
with an announcement and pointer to a version being suggested
for inclusion?

3) IPSEC (David Miller, Alexy)
4) New CryptoAPI (James Morris)

David S. Miller said:

> No URLs, being coded as I type this :-)
>
> Some of the ipv4 infrastructure is in 2.5.44

Note, this may conflict with Yoshifuji Hideyaki's ipv6 ipsec stuff. If not,
I'd like to collate or clarify the entries.) USAGI ipv6 is in the first
section and this isn't because I have a URL to an existing patch to
USAGI, and don't for this. I have no idea how much overlap there is
between these projects, and whether they're considered parts of the
same project or submitted individually...

5) ReiserFS 4

Hans Reiser said:

> We will send Reiser4 out soon, probably around the 27th.
>
> Hans

See also http://www.namesys.com/v4/fast_reiser4.html

Hans and Jens Axboe are arguing about whether or not Reiser4 is a
potential post-freeze addition. That thread starts here:

http://lists.insecure.org/lists/linux-kernel/2002/Oct/7140.html

6) 32bit dev_t

Alan Cox said:

> The big one missing is 32bit dev_t. Thats the killer item we have left.

But did not provide a URL to a patch. Presumably, it's in his tree and
is capable of being extracted out of it, so I guess it's already in
good hands? (I dunno, ask him.)

He also mentioned:

> Oh other one I missed - DVB layer - digital tv etc. Pretty much
> essential now for europe, but again its basically all driver layer

But it's not clear this is an item that must go in before feature freeze
or not at all, which is what this list tries to focus on.

Then Dan Kegel pointed out:

> One possible page to quote for 32 bit dev_t:
> http://lwn.net/Articles/11583/

7) Online EXT3 resize support:

A thread over whether or not this is self-contained enough and low
enough impact to go in after the freature freeze starts here:

http://lists.insecure.org/lists/linux-kernel/2002/Oct/7680.html

I mention it just in case it isn't. (We've had offline EXT3 resize for
a while, this is apparently twiddling a mounted partition without
unplugging it first, or even wearing rubber boots.)

--
http://penguicon.sf.net - Terry Pratchett, Eric Raymond, Pete Abrams, Illiad,
CmdrTaco, liquid nitrogen ice cream, and caffienated jello. Well why not?


2002-10-24 16:12:49

by Michael Hohnbaum

[permalink] [raw]
Subject: Re: Crunch time -- the musical. (2.5 merge candidate list 1.5)

On Wed, 2002-10-23 at 14:26, Rob Landley wrote:

> 26) NUMA aware scheduler extenstions (Erich Focht, Michael Hohnbaum)
>
> Home page:
> http://home.arcor.de/efocht/sched/
>
> Patch:
> http://home.arcor.de/efocht/sched/Nod20_numa_sched-2.5.31.patch

The simple NUMA scheduler patch, which is ready for inclusion is a
separate project from Erich's NUMA scheduler extensions. Information
on the simple NUMA scheduler is contained in this lkml posting:

http://marc.theaimsgroup.com/?l=linux-kernel&m=103351680614980&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m=103480772901235&w=2

The most recent version has been split into two patches for 2.5.44:

http://marc.theaimsgroup.com/?l=linux-kernel&m=103539626130709&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m=103540481010560&w=2

--

Michael Hohnbaum 503-578-5486
[email protected] T/L 775-5486

2002-10-24 18:56:14

by Michael Hohnbaum

[permalink] [raw]
Subject: Re: Crunch time -- the musical. (2.5 merge candidate list 1.5)

On Thu, 2002-10-24 at 05:50, Rob Landley wrote:
> On Thursday 24 October 2002 11:17, Michael Hohnbaum wrote:
> > On Wed, 2002-10-23 at 14:26, Rob Landley wrote:
> > > 26) NUMA aware scheduler extenstions (Erich Focht, Michael Hohnbaum)
> > >
> > > Home page:
> > > http://home.arcor.de/efocht/sched/
> > >
> > > Patch:
> > > http://home.arcor.de/efocht/sched/Nod20_numa_sched-2.5.31.patch
> >
> > The simple NUMA scheduler patch, which is ready for inclusion is a
> > separate project from Erich's NUMA scheduler extensions. Information
> > on the simple NUMA scheduler is contained in this lkml posting:
> >
> > http://marc.theaimsgroup.com/?l=linux-kernel&m=103351680614980&w=2
> > http://marc.theaimsgroup.com/?l=linux-kernel&m=103480772901235&w=2
> >
> > The most recent version has been split into two patches for 2.5.44:
> >
> > http://marc.theaimsgroup.com/?l=linux-kernel&m=103539626130709&w=2
> > http://marc.theaimsgroup.com/?l=linux-kernel&m=103540481010560&w=2
>
> Any relation to http://lse.sourceforge.net/numa/ which the 2.5 status list
> says is "Alpha" state, two steps down from "Ready"?
>
> Rob

Yes and no. At one point I was working with Erich moving his NUMA
scheduler to 2.5 and testing it on our NUMA hardware. However, it
was not looking like his NUMA scheduler was going to be ready for
2.5, so I went off on a separate effort to produce a much smaller,
simpler patch to provide rudimentary NUMA support within the scheduler.
This patch does not have all the functionality of Erich's, but does
provide definite performance improvements on NUMA machines with no
degradation on non-NUMA SMP. It is much smaller and less intrusive,
and has been tested on multiple NUMA architectures (including by
Erich on the NEC IA64 NUMA box).

The 2.5 status list has not been updated to reflect this separate
effort, and I believe incorrectly lists this entry as "ready". There
really are now two NUMA scheduler projects:

* Simple NUMA scheduler (Michael Hohnbaum) - ready for inclusion
* Node affine NUMA scheduler (Erich Focht) - Alpha (Beta?)

--

Michael Hohnbaum 503-578-5486
[email protected] T/L 775-5486

2002-10-24 21:45:57

by Erich Focht

[permalink] [raw]
Subject: Re: Crunch time -- the musical. (2.5 merge candidate list 1.5)

Hi Rob and Michael,

I need to correct some inexactities and, of course, advertise my aproach
:-)

On Thursday 24 October 2002 21:01, Michael Hohnbaum wrote:
> > > > 26) NUMA aware scheduler extenstions (Erich Focht, Michael Hohnbaum)
> > > >
> > > > Home page:
> > > > http://home.arcor.de/efocht/sched/
> > > >
> > > > Patch:
> > > > http://home.arcor.de/efocht/sched/Nod20_numa_sched-2.5.31.patch

These are old. I posted the newer patches (splitted up in order to clearly
separate the functionality additions) to LKML:

http://marc.theaimsgroup.com/?l=linux-kernel&m=103459387719030&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m=103459387519026&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m=103459441119407&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m=103459441319411&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m=103459441419416&w=2
They should work for any NUMA platform by just adding a call to
build_pools() in smp_cpus_done(). They work for non-NUMA platforms
the same way as the O(1) scheduler (though the code looks different).
A test overview is in: http://lwn.net/Articles/12546/
This suggests that taking only patches 01+02 already gives you a VERY
good NUMA scheduler. They deliver the infrastructure for later
developments (patches 03+05) which we can further research and tune or
give only to special customers.

> The 2.5 status list has not been updated to reflect this separate
> effort, and I believe incorrectly lists this entry as "ready". There
> really are now two NUMA scheduler projects:
>
> * Simple NUMA scheduler (Michael Hohnbaum) - ready for inclusion
> * Node affine NUMA scheduler (Erich Focht) - Alpha (Beta?)
This is not correct. We have the node affine scheduler in production
since 6 months on top of 2.4. kernels and are happy with it. It is a lot
more than alpha or beta, it already makes customers happy.

The situation is really funny: Everybody seems to agree that the design
ideas in my NUMA aproach are sane and exactly what we want to have on
a NUMA platform in the end. But instead of concentrating on tuning the
parameters for the many different NUMA platforms and reshaping this
aproach to make it acceptable, IBM concentrates on a very much stripped
down aproach. I understand that this project has been started to make
the inclusion of some NUMA scheduler easier. But in the end, the simple
NUMA scheduler will have to develop to a much more complex thing and in
some form or another replicate the design ideas of my node affine
scheduler. On machines with poor NUMA ratio like NUMAQ the simple NUMA
change helps. For machines with good NUMA ratio like NEC Azusa, NEC TX7
you need a little bit more. AMD Hammer-SMP and ppc64 are certainly in
the same class as the Azusa/TX7. And as soon as Hammer SMP systems will
be around, the pressure for a full featured NUMA scheduler will be much
higher.

A NUMA scheduler extension of the 2.6 kernel fits very well with the
development effort done for better scalability and enterprise level
fitnes of Linux. Check http://lwn.net/Articles/12546/ to see that it
makes a difference to have more than O(1) on NUMA machines! I'd
definitely prefer the inclusion of my 01+02 patches (I'd have to
maintain less code to keep the customers happy), on the other side:
including Michael's patch would be better than not adding NUMA
scheduler support at all.

Best regards,
Erich


2002-10-24 22:34:45

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Crunch time -- the musical. (2.5 merge candidate list 1.5)

> The situation is really funny: Everybody seems to agree that the design
> ideas in my NUMA aproach are sane and exactly what we want to have on
> a NUMA platform in the end. But instead of concentrating on tuning the
> parameters for the many different NUMA platforms and reshaping this
> aproach to make it acceptable, IBM concentrates on a very much stripped
> down aproach.

>From my point of view, the reason for focussing on this was that
your scheduler degraded the performance on my machine, rather than
boosting it. Half of that was the more complex stuff you added on
top ... it's a lot easier to start with something simple that works
and build on it, than fix something that's complex and doesn't work
well.

I still haven't been able to get your scheduler to boot for about
the last month without crashing the system. Andrew says he has it
booting somehow on 2.5.44-mm4, so I'll steal his kernel tommorow and
see how it looks. If the numbers look good for doing boring things
like kernel compile, SDET, etc, I'm happy.

M.

2002-10-25 00:20:41

by Jim Houston

[permalink] [raw]
Subject: Re: Crunch time -- the musical. (2.5 merge candidate list 1.5)

Hi Rob,

The Posix timers entry in your list is confused. I don't know how
my patch got the name Google.

I think Dan Kegel misunderstood George's answer to my previous announcement. George might be picking up some of my changes, but
there will still be two
patches for Linus to choose from. You included the URL to George's answer
which quoted my patch, rather than the URL I sent you.

Here is the URL for an archived copy of my latest patch:
Jim Houston's [PATCH] alternate Posix timer patch3
http://marc.theaimsgroup.com/?l=linux-kernel&m=103549000027416&w=2

I would be happy to see either version go into 2.5.

The URLs for George's patches are incomplete. I believe this is the
most recent (it's from Oct 18). The Sourceforge.net reference has the
user space library and test programs, but I did not see 2.5 kernel
patches.

[PATCH ] POSIX clocks & timers take 3 (NOT HIGH RES)
http://marc.theaimsgroup.com/?l=linux-kernel&m=103489669622397&w=2

Thanks
Jim Houston - Concurrent Computer Corp.

2002-10-25 08:09:39

by Erich Focht

[permalink] [raw]
Subject: Re: Crunch time -- the musical. (2.5 merge candidate list 1.5)

On Friday 25 October 2002 00:38, Martin J. Bligh wrote:
> > The situation is really funny: Everybody seems to agree that the design
> > ideas in my NUMA aproach are sane and exactly what we want to have on
> > a NUMA platform in the end. But instead of concentrating on tuning the
> > parameters for the many different NUMA platforms and reshaping this
> > aproach to make it acceptable, IBM concentrates on a very much stripped
> > down aproach.
>
> From my point of view, the reason for focussing on this was that
> your scheduler degraded the performance on my machine, rather than
> boosting it. Half of that was the more complex stuff you added on
> top ... it's a lot easier to start with something simple that works
> and build on it, than fix something that's complex and doesn't work
> well.

You're talking about one of the first 2.5 versions of the patch. It
changed a lot since then, thanks to your feedback, too.

> I still haven't been able to get your scheduler to boot for about
> the last month without crashing the system. Andrew says he has it
> booting somehow on 2.5.44-mm4, so I'll steal his kernel tommorow and
> see how it looks. If the numbers look good for doing boring things
> like kernel compile, SDET, etc, I'm happy.

I thought this problem is well understood! For some reasons independent of
my patch you have to boot your machines with the "notsc" option. This
leaves the cache_decay_ticks variable initialized to zero which my patch
doesn't like. I'm trying to deal with this inside the patch but there is
still a small window when the variable is zero. In my opinion this needs
to be fixed somewhere in arch/i386/kernel/smpboot.c. Booting a machine
with cache_decay_ticks=0 is pure nonsense, as it switches off cache
affinity which you absolutely need! So even if "notsc" is a legal option,
it should be fixed such that it doesn't leave your machine without cache
affinity. That would anyway give you a falsified behavior of the O(1)
scheduler.

Erich


2002-10-25 15:17:50

by Kevin Corry

[permalink] [raw]
Subject: Re: Crunch time -- the musical. (2.5 merge candidate list 1.5)

On Wednesday 23 October 2002 16:26, Rob Landley wrote:
> Due to numerous complaints (okay, one, but technically that's a number)
> tried to reformat a bit to have a slightly less eye-searingly hideous
> layout. And reorganized the -mm stuff to be together in one clump.
>
> And so:

> ......

> ---------------------------------------------------------------------------
>
> 8) EVMS (Enterprise Volume Management System) (EVMS team)
>
> Home page:
> http://sourceforge.net/projects/evms
>
> ---------------------------------------------------------------------------

Rob,

Can you please add the following links for the EVMS project:

Home page:
http://evms.sourceforge.net

Download:
http://evms.sourceforge.net/patches/

Some related discussions:
http://marc.theaimsgroup.com/?t=103359686900003&r=1&w=2
http://marc.theaimsgroup.com/?t=103439913000001&r=1&w=2
http://marc.theaimsgroup.com/?w=2&r=1&s=%5Bpatch%5D+evms+core&q=t

Thanks!
--
Kevin Corry
[email protected]
http://evms.sourceforge.net/

2002-10-25 17:52:40

by George Anzinger

[permalink] [raw]
Subject: Re: Crunch time -- the musical. (2.5 merge candidate list 1.5)

Jim Houston wrote:
>
> Hi Rob,
>
> The Posix timers entry in your list is confused. I don't know how
> my patch got the name Google.
>
> I think Dan Kegel misunderstood George's answer to my previous announcement. George might be picking up some of my changes, but
> there will still be two
> patches for Linus to choose from. You included the URL to George's answer
> which quoted my patch, rather than the URL I sent you.
>
> Here is the URL for an archived copy of my latest patch:
> Jim Houston's [PATCH] alternate Posix timer patch3
> http://marc.theaimsgroup.com/?l=linux-kernel&m=103549000027416&w=2
>
> I would be happy to see either version go into 2.5.
>
> The URLs for George's patches are incomplete. I believe this is the
> most recent (it's from Oct 18). The Sourceforge.net reference has the
> user space library and test programs, but I did not see 2.5 kernel
> patches.
>
> [PATCH ] POSIX clocks & timers take 3 (NOT HIGH RES)
> http://marc.theaimsgroup.com/?l=linux-kernel&m=103489669622397&w=2

I would be very careful picking up patches from the
digests. Some of them have message size limits that cause
truncated patches. I know mine was on the marc digest. I
will post the latest HRT patches on the project sourceforge
site.

--
George Anzinger [email protected]
High-res-timers:
http://sourceforge.net/projects/high-res-timers/
Preemption patch:
http://www.kernel.org/pub/linux/kernel/people/rml

2002-10-25 23:25:24

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Crunch time -- the musical. (2.5 merge candidate list 1.5)

> You're talking about one of the first 2.5 versions of the patch. It
> changed a lot since then, thanks to your feedback, too.

Right. But I've been struggling to boot anything later than that ;-)

> I thought this problem is well understood! For some reasons independent of
> my patch you have to boot your machines with the "notsc" option. This
> leaves the cache_decay_ticks variable initialized to zero which my patch
> doesn't like. I'm trying to deal with this inside the patch but there is
> still a small window when the variable is zero. In my opinion this needs
> to be fixed somewhere in arch/i386/kernel/smpboot.c. Booting a machine
> with cache_decay_ticks=0 is pure nonsense, as it switches off cache
> affinity which you absolutely need! So even if "notsc" is a legal option,
> it should be fixed such that it doesn't leave your machine without cache
> affinity. That would anyway give you a falsified behavior of the O(1)
> scheduler.

OK, well we seem to have it working on one machine, but not on another.
Those should be identical, I suspect it's a timing thing. I'm playing around
with the differences. First major thing I noticed is that the working box has
gcc 3.1, and the non-working gcc 2.95.4 (debian woody). I suspect it's
a subtle timing thing, or something equally horrible.

Changing the non-working box to gcc 3.1 instead (which I *really* don't
want to do long term unless we prove there's a bug in 2.95 ... gcc 3.x
is disgustingly slow) resulted in it getting a little further, but then got the
following oops ... does this provide any clues?

CPU 7 IS NOW UP!
Starting migration thread for cpu 7
Bringing up 8
CPU 8 IS NOW UP!
Starting migration thread for cpu 8
divide error: 0000

CPU: 4
EIP: 0060:[<c011ac38>] Not tainted
EFLAGS: 00010002
EIP is at task_to_steal+0x118/0x260
eax: 00000001 ebx: f01c5040 ecx: 00000000 edx: 00000000
esi: 00000063 edi: f01c5020 ebp: f0197ee8 esp: f0197eac
ds: 0068 es: 0068 ss: 0068
Process swapper (pid: 0, threadinfo=f0196000 task=f01bf060)
Stack: 00000000 f01b4120 00000000 c02ec940 f0197ed4 00000004 00000000 c02ecd3c
c02ec93c 00000000 00000001 0000007d c02ec4a0 00000001 00000004 f0197f1c
c011829c c02ec4a0 00000004 00000004 00000001 00000000 c39376c0 00000000
Call Trace:
[<c011829c>] load_balance+0x8c/0x140
[<c0118588>] scheduler_tick+0x238/0x360
[<c0123347>] tasklet_hi_action+0x77/0xc0
[<c0105420>] default_idle+0x0/0x50
[<c0126bd5>] update_process_times+0x45/0x60
[<c0113faa>] smp_apic_timer_interrupt+0x11a/0x120
[<c0105420>] default_idle+0x0/0x50
[<c010815e>] apic_timer_interrupt+0x1a/0x20
[<c0105420>] default_idle+0x0/0x50
[<c0105420>] default_idle+0x0/0x50
[<c010544a>] default_idle+0x2a/0x50
[<c01054ea>] cpu_idle+0x3a/0x50
[<c011db20>] printk+0x140/0x180

Code: f7 75 cc 8b 55 c8 83 f8 64 0f 4c f0 39 4d ec 8d 46 64 0f 44

This is 2.5.44-mm4 + your patches 1,2,3,5, I think.

M.

2002-10-25 23:47:50

by Rob Landley

[permalink] [raw]
Subject: highres timers question...

I'm guessing that of the patches here:

http://sourceforge.net/projects/high-res-timers

The -posix one adds posix support on top of the base high-res timers patch?

(Did I guess right?)

Rob

--
http://penguicon.sf.net - Terry Pratchett, Eric Raymond, Pete Abrams, Illiad,
CmdrTaco, liquid nitrogen ice cream, and caffienated jello. Well why not?

2002-10-25 23:44:25

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Crunch time -- the musical. (2.5 merge candidate list 1.5)

> divide error: 0000
>
> CPU: 4
> EIP: 0060:[<c011ac38>] Not tainted
> EFLAGS: 00010002
> EIP is at task_to_steal+0x118/0x260
> eax: 00000001 ebx: f01c5040 ecx: 00000000 edx: 00000000
> esi: 00000063 edi: f01c5020 ebp: f0197ee8 esp: f0197eac
> ds: 0068 es: 0068 ss: 0068
> Process swapper (pid: 0, threadinfo=f0196000 task=f01bf060)
> Stack: 00000000 f01b4120 00000000 c02ec940 f0197ed4 00000004 00000000 c02ecd3c
> c02ec93c 00000000 00000001 0000007d c02ec4a0 00000001 00000004 f0197f1c
> c011829c c02ec4a0 00000004 00000004 00000001 00000000 c39376c0 00000000
> Call Trace:
> [<c011829c>] load_balance+0x8c/0x140
> [<c0118588>] scheduler_tick+0x238/0x360
> [<c0123347>] tasklet_hi_action+0x77/0xc0
> [<c0105420>] default_idle+0x0/0x50
> [<c0126bd5>] update_process_times+0x45/0x60
> [<c0113faa>] smp_apic_timer_interrupt+0x11a/0x120
> [<c0105420>] default_idle+0x0/0x50
> [<c010815e>] apic_timer_interrupt+0x1a/0x20
> [<c0105420>] default_idle+0x0/0x50
> [<c0105420>] default_idle+0x0/0x50
> [<c010544a>] default_idle+0x2a/0x50
> [<c01054ea>] cpu_idle+0x3a/0x50
> [<c011db20>] printk+0x140/0x180
>
> Code: f7 75 cc 8b 55 c8 83 f8 64 0f 4c f0 39 4d ec 8d 46 64 0f 44


Dump of assembler code for function task_to_steal:
0xc011ab20 <task_to_steal>: push %ebp
0xc011ab21 <task_to_steal+1>: mov %esp,%ebp
0xc011ab23 <task_to_steal+3>: push %edi
0xc011ab24 <task_to_steal+4>: push %esi
0xc011ab25 <task_to_steal+5>: push %ebx
0xc011ab26 <task_to_steal+6>: sub $0x30,%esp
0xc011ab29 <task_to_steal+9>: movl $0x0,0xffffffdc(%ebp)
0xc011ab30 <task_to_steal+16>: mov 0xc(%ebp),%eax
0xc011ab33 <task_to_steal+19>: movl $0x0,0xffffffe8(%ebp)
0xc011ab3a <task_to_steal+26>: mov 0x8(%ebp),%edx
0xc011ab3d <task_to_steal+29>: mov 0xc034afe0(,%eax,4),%eax
0xc011ab44 <task_to_steal+36>: sar $0x4,%eax
0xc011ab47 <task_to_steal+39>: mov %eax,0xffffffec(%ebp)
0xc011ab4a <task_to_steal+42>: mov 0x20(%edx),%eax
0xc011ab4d <task_to_steal+45>: mov (%eax),%esi
0xc011ab4f <task_to_steal+47>: test %esi,%esi
0xc011ab51 <task_to_steal+49>: je 0xc011ad6a <task_to_steal+586>
0xc011ab57 <task_to_steal+55>: mov %eax,0xffffffe4(%ebp)
0xc011ab5a <task_to_steal+58>: movl $0x0,0xfffffff0(%ebp)
0xc011ab61 <task_to_steal+65>: mov 0xffffffe4(%ebp),%ebx
0xc011ab64 <task_to_steal+68>: add $0x4,%ebx
0xc011ab67 <task_to_steal+71>: mov %ebx,0xffffffd0(%ebp)
0xc011ab6a <task_to_steal+74>: lea 0x0(%esi),%esi
0xc011ab70 <task_to_steal+80>: mov 0xfffffff0(%ebp),%ebx
0xc011ab73 <task_to_steal+83>: test %ebx,%ebx
0xc011ab75 <task_to_steal+85>: jne 0xc011acec <task_to_steal+460>
0xc011ab7b <task_to_steal+91>: mov 0xffffffe4(%ebp),%edx
0xc011ab7e <task_to_steal+94>: mov 0x4(%edx),%eax
0xc011ab81 <task_to_steal+97>: test %eax,%eax
0xc011ab83 <task_to_steal+99>: jne 0xc011ace4 <task_to_steal+452>
0xc011ab89 <task_to_steal+105>: mov 0xffffffd0(%ebp),%ecx
0xc011ab8c <task_to_steal+108>: mov 0x4(%ecx),%eax
0xc011ab8f <task_to_steal+111>: test %eax,%eax
0xc011ab91 <task_to_steal+113>: jne 0xc011acd9 <task_to_steal+441>
0xc011ab97 <task_to_steal+119>: mov 0xffffffd0(%ebp),%ebx
0xc011ab9a <task_to_steal+122>: mov 0x8(%ebx),%eax
0xc011ab9d <task_to_steal+125>: test %eax,%eax
0xc011ab9f <task_to_steal+127>: jne 0xc011acce <task_to_steal+430>
0xc011aba5 <task_to_steal+133>: mov 0xffffffd0(%ebp),%edx
0xc011aba8 <task_to_steal+136>: mov 0xc(%edx),%eax
0xc011abab <task_to_steal+139>: test %eax,%eax
0xc011abad <task_to_steal+141>: je 0xc011acbf <task_to_steal+415>
0xc011abb3 <task_to_steal+147>: bsf %eax,%eax
0xc011abb6 <task_to_steal+150>: add $0x60,%eax
0xc011abb9 <task_to_steal+153>: mov %eax,0xfffffff0(%ebp)
0xc011abbc <task_to_steal+156>: cmpl $0x8c,0xfffffff0(%ebp)
0xc011abc3 <task_to_steal+163>: je 0xc011ac9e <task_to_steal+382>
0xc011abc9 <task_to_steal+169>: mov 0xfffffff0(%ebp),%ebx
0xc011abcc <task_to_steal+172>: mov 0xffffffe4(%ebp),%eax
0xc011abcf <task_to_steal+175>: mov 0xc034b4e0,%edx
0xc011abd5 <task_to_steal+181>: lea 0x18(%eax,%ebx,8),%ebx
0xc011abd9 <task_to_steal+185>: mov %ebx,0xffffffe0(%ebp)
0xc011abdc <task_to_steal+188>: mov 0x4(%ebx),%ebx
0xc011abdf <task_to_steal+191>: mov %edx,0xffffffcc(%ebp)
0xc011abe2 <task_to_steal+194>: lea 0x0(%esi,1),%esi
0xc011abe9 <task_to_steal+201>: lea 0x0(%edi,1),%edi
0xc011abf0 <task_to_steal+208>: lea 0xffffffe0(%ebx),%edi
0xc011abf3 <task_to_steal+211>: mov 0xc0348e68,%eax
0xc011abf8 <task_to_steal+216>: mov 0x30(%edi),%edx
0xc011abfb <task_to_steal+219>: sub %edx,%eax
0xc011abfd <task_to_steal+221>: cmp 0xffffffcc(%ebp),%eax
0xc011ac00 <task_to_steal+224>: jbe 0xc011ac70 <task_to_steal+336>
0xc011ac02 <task_to_steal+226>: mov 0x8(%ebp),%ecx
0xc011ac05 <task_to_steal+229>: mov 0x14(%ecx),%ecx
0xc011ac08 <task_to_steal+232>: cmp %ecx,%edi
0xc011ac0a <task_to_steal+234>: mov %ecx,0xffffffc8(%ebp)
0xc011ac0d <task_to_steal+237>: je 0xc011ac70 <task_to_steal+336>
0xc011ac0f <task_to_steal+239>: movzbl 0xc(%ebp),%ecx
0xc011ac13 <task_to_steal+243>: mov 0x38(%edi),%eax
0xc011ac16 <task_to_steal+246>: shr %cl,%eax
0xc011ac18 <task_to_steal+248>: and $0x1,%eax
0xc011ac1b <task_to_steal+251>: je 0xc011ac70 <task_to_steal+336>
0xc011ac1d <task_to_steal+253>: mov 0x48(%edi),%esi
0xc011ac20 <task_to_steal+256>: test %esi,%esi
0xc011ac22 <task_to_steal+258>: jne 0xc011ac83 <task_to_steal+355>
0xc011ac24 <task_to_steal+260>: mov 0xc0348e68,%eax
0xc011ac29 <task_to_steal+265>: xor %edx,%edx
0xc011ac2b <task_to_steal+267>: mov $0x63,%esi
0xc011ac30 <task_to_steal+272>: mov 0x30(%edi),%ecx
0xc011ac33 <task_to_steal+275>: sub %ecx,%eax
0xc011ac35 <task_to_steal+277>: mov 0x44(%edi),%ecx
0xc011ac38 <task_to_steal+280>: divl 0xffffffcc(%ebp)
0xc011ac3b <task_to_steal+283>: mov 0xffffffc8(%ebp),%edx
0xc011ac3e <task_to_steal+286>: cmp $0x64,%eax
0xc011ac41 <task_to_steal+289>: cmovl %eax,%esi
0xc011ac44 <task_to_steal+292>: cmp %ecx,0xffffffec(%ebp)
0xc011ac47 <task_to_steal+295>: lea 0x64(%esi),%eax
0xc011ac4a <task_to_steal+298>: cmove %eax,%esi
0xc011ac4d <task_to_steal+301>: mov 0x4(%edx),%eax
0xc011ac50 <task_to_steal+304>: lea 0xffffff9c(%esi),%edx
0xc011ac53 <task_to_steal+307>: mov 0xc(%eax),%eax
0xc011ac56 <task_to_steal+310>: mov 0xc034afe0(,%eax,4),%eax
0xc011ac5d <task_to_steal+317>: sar $0x4,%eax
0xc011ac60 <task_to_steal+320>: cmp %eax,%ecx
0xc011ac62 <task_to_steal+322>: cmove %edx,%esi
0xc011ac65 <task_to_steal+325>: cmp 0xffffffdc(%ebp),%esi
0xc011ac68 <task_to_steal+328>: jle 0xc011ac70 <task_to_steal+336>
0xc011ac6a <task_to_steal+330>: mov %esi,0xffffffdc(%ebp)
0xc011ac6d <task_to_steal+333>: mov %edi,0xffffffe8(%ebp)
0xc011ac70 <task_to_steal+336>: mov (%ebx),%ebx
0xc011ac72 <task_to_steal+338>: cmp 0xffffffe0(%ebp),%ebx
0xc011ac75 <task_to_steal+341>: jne 0xc011abf0 <task_to_steal+208>
0xc011ac7b <task_to_steal+347>: incl 0xfffffff0(%ebp)
0xc011ac7e <task_to_steal+350>: jmp 0xc011ab70 <task_to_steal+80>
0xc011ac83 <task_to_steal+355>: mov %edi,(%esp,1)
0xc011ac86 <task_to_steal+358>: call 0xc0118070 <upd_node_mem>
0xc011ac8b <task_to_steal+363>: mov 0x8(%ebp),%edx
0xc011ac8e <task_to_steal+366>: mov 0xc034b4e0,%eax
0xc011ac93 <task_to_steal+371>: mov %eax,0xffffffcc(%ebp)
0xc011ac96 <task_to_steal+374>: mov 0x14(%edx),%edx
0xc011ac99 <task_to_steal+377>: mov %edx,0xffffffc8(%ebp)
0xc011ac9c <task_to_steal+380>: jmp 0xc011ac24 <task_to_steal+260>
0xc011ac9e <task_to_steal+382>: mov 0x8(%ebp),%eax
0xc011aca1 <task_to_steal+385>: mov 0xffffffe4(%ebp),%edx
0xc011aca4 <task_to_steal+388>: cmp 0x20(%eax),%edx
0xc011aca7 <task_to_steal+391>: jne 0xc011acb4 <task_to_steal+404>
0xc011aca9 <task_to_steal+393>: mov 0x1c(%eax),%ecx
0xc011acac <task_to_steal+396>: mov %ecx,0xffffffe4(%ebp)
0xc011acaf <task_to_steal+399>: jmp 0xc011ab5a <task_to_steal+58>
0xc011acb4 <task_to_steal+404>: mov 0xffffffe8(%ebp),%eax
0xc011acb7 <task_to_steal+407>: add $0x30,%esp
0xc011acba <task_to_steal+410>: pop %ebx
0xc011acbb <task_to_steal+411>: pop %esi
0xc011acbc <task_to_steal+412>: pop %edi
0xc011acbd <task_to_steal+413>: pop %ebp
0xc011acbe <task_to_steal+414>: ret
0xc011acbf <task_to_steal+415>: mov 0xffffffd0(%ebp),%ecx
0xc011acc2 <task_to_steal+418>: bsf 0x10(%ecx),%eax
0xc011acc6 <task_to_steal+422>: sub $0xffffff80,%eax
0xc011acc9 <task_to_steal+425>: jmp 0xc011abb9 <task_to_steal+153>
0xc011acce <task_to_steal+430>: bsf %eax,%eax
0xc011acd1 <task_to_steal+433>: add $0x40,%eax
0xc011acd4 <task_to_steal+436>: jmp 0xc011abb9 <task_to_steal+153>
0xc011acd9 <task_to_steal+441>: bsf %eax,%eax
0xc011acdc <task_to_steal+444>: add $0x20,%eax
0xc011acdf <task_to_steal+447>: jmp 0xc011abb9 <task_to_steal+153>
0xc011ace4 <task_to_steal+452>: bsf %eax,%eax
0xc011ace7 <task_to_steal+455>: jmp 0xc011abb9 <task_to_steal+153>
0xc011acec <task_to_steal+460>: mov 0xfffffff0(%ebp),%eax
0xc011acef <task_to_steal+463>: xor %esi,%esi
0xc011acf1 <task_to_steal+465>: mov 0xfffffff0(%ebp),%ecx
0xc011acf4 <task_to_steal+468>: mov 0xffffffd0(%ebp),%ebx
0xc011acf7 <task_to_steal+471>: sar $0x5,%eax
0xc011acfa <task_to_steal+474>: and $0x1f,%ecx
0xc011acfd <task_to_steal+477>: lea (%ebx,%eax,4),%edi
0xc011ad00 <task_to_steal+480>: je 0xc011ad2b <task_to_steal+523>
0xc011ad02 <task_to_steal+482>: mov (%edi),%eax
0xc011ad04 <task_to_steal+484>: shr %cl,%eax
0xc011ad06 <task_to_steal+486>: bsf %eax,%esi
0xc011ad09 <task_to_steal+489>: jne 0xc011ad10 <task_to_steal+496>
0xc011ad0b <task_to_steal+491>: mov $0x20,%esi
0xc011ad10 <task_to_steal+496>: mov $0x20,%eax
0xc011ad15 <task_to_steal+501>: sub %ecx,%eax
0xc011ad17 <task_to_steal+503>: cmp %eax,%esi
0xc011ad19 <task_to_steal+505>: jge 0xc011ad26 <task_to_steal+518>
0xc011ad1b <task_to_steal+507>: mov 0xfffffff0(%ebp),%edx
0xc011ad1e <task_to_steal+510>: lea (%edx,%esi,1),%eax
0xc011ad21 <task_to_steal+513>: jmp 0xc011abb9 <task_to_steal+153>
0xc011ad26 <task_to_steal+518>: mov %eax,%esi
0xc011ad28 <task_to_steal+520>: add $0x4,%edi
0xc011ad2b <task_to_steal+523>: mov 0xffffffd0(%ebp),%ecx
0xc011ad2e <task_to_steal+526>: mov %edi,%eax
0xc011ad30 <task_to_steal+528>: mov $0x8c,%edx
0xc011ad35 <task_to_steal+533>: mov %edi,%ebx
0xc011ad37 <task_to_steal+535>: sub %ecx,%eax
0xc011ad39 <task_to_steal+537>: shl $0x3,%eax
0xc011ad3c <task_to_steal+540>: sub %eax,%edx
0xc011ad3e <task_to_steal+542>: add $0x1f,%edx
0xc011ad41 <task_to_steal+545>: shr $0x5,%edx
0xc011ad44 <task_to_steal+548>: mov %edx,0xffffffd4(%ebp)
0xc011ad47 <task_to_steal+551>: mov %edx,%ecx
0xc011ad49 <task_to_steal+553>: xor %eax,%eax
0xc011ad4b <task_to_steal+555>: repz scas %es:(%edi),%eax
0xc011ad4d <task_to_steal+557>: je 0xc011ad55 <task_to_steal+565>
0xc011ad4f <task_to_steal+559>: lea 0xfffffffc(%edi),%edi
0xc011ad52 <task_to_steal+562>: bsf (%edi),%eax
0xc011ad55 <task_to_steal+565>: sub %ebx,%edi
0xc011ad57 <task_to_steal+567>: shl $0x3,%edi
0xc011ad5a <task_to_steal+570>: add %edi,%eax
0xc011ad5c <task_to_steal+572>: mov %eax,%edx
0xc011ad5e <task_to_steal+574>: mov 0xfffffff0(%ebp),%eax
0xc011ad61 <task_to_steal+577>: add %esi,%eax
0xc011ad63 <task_to_steal+579>: add %edx,%eax
0xc011ad65 <task_to_steal+581>: jmp 0xc011abb9 <task_to_steal+153>
0xc011ad6a <task_to_steal+586>: mov 0x8(%ebp),%ecx
0xc011ad6d <task_to_steal+589>: mov 0x1c(%ecx),%ecx
0xc011ad70 <task_to_steal+592>: jmp 0xc011acac <task_to_steal+396>
0xc011ad75 <task_to_steal+597>: nop
0xc011ad76 <task_to_steal+598>: lea 0x0(%esi),%esi
0xc011ad79 <task_to_steal+601>: lea 0x0(%edi,1),%edi
End of assembler dump.

2002-10-26 00:01:04

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Crunch time -- the musical. (2.5 merge candidate list 1.5)

>> I thought this problem is well understood! For some reasons independent of
>> my patch you have to boot your machines with the "notsc" option. This
>> leaves the cache_decay_ticks variable initialized to zero which my patch
>> doesn't like. I'm trying to deal with this inside the patch but there is
>> still a small window when the variable is zero. In my opinion this needs
>> to be fixed somewhere in arch/i386/kernel/smpboot.c. Booting a machine
>> with cache_decay_ticks=0 is pure nonsense, as it switches off cache
>> affinity which you absolutely need! So even if "notsc" is a legal option,
>> it should be fixed such that it doesn't leave your machine without cache
>> affinity. That would anyway give you a falsified behavior of the O(1)
>> scheduler.

> EIP is at task_to_steal+0x118/0x260

This turned out to be:

weight = (jiffies - tmp->sleep_timestamp)/cache_decay_ticks;

So I guess that window is still biting you. I'll see if I can fix it properly.

M.

2002-10-26 00:52:12

by Rob Landley

[permalink] [raw]
Subject: Re: Crunch time -- the musical. (2.5 merge candidate list 1.5)

On Thursday 24 October 2002 19:25, Jim Houston wrote:
> Hi Rob,
>
> The Posix timers entry in your list is confused. I don't know how
> my patch got the name Google.

Sorry, misread "George's version" as "Google's version" at 5 am one morning.
Lot of late nights recently... :)

> I think Dan Kegel misunderstood George's answer to my previous
> announcement. George might be picking up some of my changes, but there
> will still be two patches for Linus to choose from. You included the URL to
> George's answer which quoted my patch, rather than the URL I sent you.

Had it in, then took it out. I'm trying to collate down the list wherever I
can.

> Here is the URL for an archived copy of my latest patch:
> Jim Houston's [PATCH] alternate Posix timer patch3
> http://marc.theaimsgroup.com/?l=linux-kernel&m=103549000027416&w=2

It's back now.

> I would be happy to see either version go into 2.5.

So what exactly is the difference between them?

> The URLs for George's patches are incomplete. I believe this is the
> most recent (it's from Oct 18). The Sourceforge.net reference has the
> user space library and test programs, but I did not see 2.5 kernel
> patches.
>
> [PATCH ] POSIX clocks & timers take 3 (NOT HIGH RES)
> http://marc.theaimsgroup.com/?l=linux-kernel&m=103489669622397&w=2

He's up to version 4 now.

> Thanks
> Jim Houston - Concurrent Computer Corp.

Rob

--
http://penguicon.sf.net - Terry Pratchett, Eric Raymond, Pete Abrams, Illiad,
CmdrTaco, liquid nitrogen ice cream, and caffienated jello. Well why not?

2002-10-26 08:39:10

by George Anzinger

[permalink] [raw]
Subject: Re: Crunch time -- the musical. (2.5 merge candidate list 1.5)

Rob Landley wrote:
>
> On Thursday 24 October 2002 19:25, Jim Houston wrote:
> > Hi Rob,
> >
> > The Posix timers entry in your list is confused. I don't know how
> > my patch got the name Google.
>
> Sorry, misread "George's version" as "Google's version" at 5 am one morning.
> Lot of late nights recently... :)
>
> > I think Dan Kegel misunderstood George's answer to my previous
> > announcement. George might be picking up some of my changes, but there
> > will still be two patches for Linus to choose from. You included the URL to
> > George's answer which quoted my patch, rather than the URL I sent you.
>
> Had it in, then took it out. I'm trying to collate down the list wherever I
> can.
>
> > Here is the URL for an archived copy of my latest patch:
> > Jim Houston's [PATCH] alternate Posix timer patch3
> > http://marc.theaimsgroup.com/?l=linux-kernel&m=103549000027416&w=2
>
> It's back now.
>
> > I would be happy to see either version go into 2.5.
>
> So what exactly is the difference between them?

First, to answer your question about the order of things in
my patches. The 4 patches should be applied in this order:

First, the posix patch. It introduces the POSIX clocks &
timers to the system. It is not high res and stands alone.
The rest of the patches are all about doing the high res
timers:

The 3 parts to the high res timers are:
core The core kernel (i.e. platform independent)
changes
i386 The high-res changes for the i386 (x86)
platform
posixhr The changes to the POSIX clocks & timers
patch to
use high-res timers

This last is almost entirely contained to the one file
(.../kernel/posix_timers.c). The "almost" is because it
adds a member to the posix timers structure which is defined
in sched.h.

Now, as to the differences between my patches and Jim's.
Jim's patch is an alternate for the first or "posix" patch
only. Since I picked up a variation on his id allocator,
thus removing the configuration option for the maximum
number of timers, the principle difference is that Jim keeps
the posix timers in a separate list, where as, my patch puts
them in the same list (i.e. the add_timer list) as all other
timers. I assume (not having looked in detail at his latest
patch) that he uses the systems add_timers to drive the
timers in this list, and thus has a two stage expiry
algorithm (a. the add_timer pop which then, b. causes a
check of this new list).

Jim has also attempted to address the clock_nanosleep()
interaction with signals problem. In short, the standard
says that signals that do not actually cause a handler in
the user code to run are NOT supposed to interrupt a sleep.
The straight forward way to do this is to interrupt the
sleep on the signal, call do_signal() to deliver the signal
and check the return to see if it invoked a user handler (it
returns 1 in this case, else 0) and either continue the
sleep or return. The problem is that do_signal() requires
&regs as a parameter and this is passed in different ways,
in the various platforms, to system calls. ALL other system
calls that call do_signal() reside in platform dependent
code, most likely for this reason.

My solution for this problem is to provide a couple of
macros in linux/signal.h and linux/asm-i386/signal.h to
define the entry sequence for clock_nanosleep (and nanosleep
as it is now just a call to clock_nanosleep). The macros in
linux/signal.h are general purpose and do NOT actually solve
the problem, but they do allow other platforms to work,
although, without the standard required signal handling.
These are only defined if the asm/signal.h does not supply
an alternative. This allows each platform to customize the
entry to clock_nanosleep() to pass in regs in what ever way
works for that platform. I fully admit that this is a VERY
messy bit of code, BUT at the same time, it works. I am
fully prepared to change to a cleaner solution should one
arise.

Jim has NOT provided high res timers as yet, and thus does
not have any code to replace the 3 high res patches. I
don't know if he is attempting to do this code. I suspect
he is not, but he did indicate that he wants to expand his
posix timers to be high res. If he does this, I suspect
that it would be his version of the "hrposix" patch.
>
> > The URLs for George's patches are incomplete. I believe this is the
> > most recent (it's from Oct 18). The Sourceforge.net reference has the
> > user space library and test programs, but I did not see 2.5 kernel
> > patches.
> >
> > [PATCH ] POSIX clocks & timers take 3 (NOT HIGH RES)
> > http://marc.theaimsgroup.com/?l=linux-kernel&m=103489669622397&w=2
>
> He's up to version 4 now.

As I said in another post, don't trust these archives, they
truncate long posts to less than what the lklm allows. In
particular, they have truncated my patches. The full set of
4 patches are available here:

http://sourceforge.net/projects/high-res-timers/

or, to save a few clicks:

http://sourceforge.net/project/showfiles.php?group_id=20460&release_id=118345

Please do read the notes, they tell about the order of
application, which is fixed, i.e.:
hrtimers-posix The posix clock/ timers interface, low res.
hrtimers-core The core system high res patch.
hrtimers-i386 The high res code for the i386 platform.
hrtimers-hrposix The patch to move the low res posix patch
to high res.


--
George Anzinger [email protected]
High-res-timers:
http://sourceforge.net/projects/high-res-timers/
Preemption patch:
http://www.kernel.org/pub/linux/kernel/people/rml

2002-10-26 09:01:55

by George Anzinger

[permalink] [raw]
Subject: Re: highres timers question...

Rob Landley wrote:
>
> I'm guessing that of the patches here:
>
> http://sourceforge.net/projects/high-res-timers
>
> The -posix one adds posix support on top of the base high-res timers patch?
>
> (Did I guess right?)

Uh, no. We made the command decision that even IF he does
not let in the high-res stuff we would like the POSIX API in
the kernel. Thus the patches are structured to require the
POSIX patch first. This can be changed if need be, but that
is the way it is now.
>
> Rob
>
> --
> http://penguicon.sf.net - Terry Pratchett, Eric Raymond, Pete Abrams, Illiad,
> CmdrTaco, liquid nitrogen ice cream, and caffienated jello. Well why not?

--
George Anzinger [email protected]
High-res-timers:
http://sourceforge.net/projects/high-res-timers/
Preemption patch:
http://www.kernel.org/pub/linux/kernel/people/rml

2002-10-26 18:54:22

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Crunch time -- the musical. (2.5 merge candidate list 1.5)

>> I still haven't been able to get your scheduler to boot for about
>> the last month without crashing the system. Andrew says he has it
>> booting somehow on 2.5.44-mm4, so I'll steal his kernel tommorow and
>> see how it looks. If the numbers look good for doing boring things
>> like kernel compile, SDET, etc, I'm happy.
>
> I thought this problem is well understood! For some reasons independent of
> my patch you have to boot your machines with the "notsc" option. This
> leaves the cache_decay_ticks variable initialized to zero which my patch
> doesn't like. I'm trying to deal with this inside the patch but there is
> still a small window when the variable is zero. In my opinion this needs
> to be fixed somewhere in arch/i386/kernel/smpboot.c. Booting a machine
> with cache_decay_ticks=0 is pure nonsense, as it switches off cache
> affinity which you absolutely need! So even if "notsc" is a legal option,
> it should be fixed such that it doesn't leave your machine without cache
> affinity. That would anyway give you a falsified behavior of the O(1)
> scheduler.

Oh, not sure if I ever replied to this or not. I don't *have* to boot
with notsc, I just usually do. And it crashed either way, so it's a
different problem (changing versions of gcc seems to perturb it too).
BUT ... your new patches 1 and 2 don't have this problem. See followup
email in a second.

M.

2002-10-26 19:11:16

by Martin J. Bligh

[permalink] [raw]
Subject: NUMA scheduler (was: 2.5 merge candidate list 1.5)

>> From my point of view, the reason for focussing on this was that
>> your scheduler degraded the performance on my machine, rather than
>> boosting it. Half of that was the more complex stuff you added on
>> top ... it's a lot easier to start with something simple that works
>> and build on it, than fix something that's complex and doesn't work
>> well.
>
> You're talking about one of the first 2.5 versions of the patch. It
> changed a lot since then, thanks to your feedback, too.

OK, I went to your latest patches (just 1 and 2). And they worked!
You've fixed the performance degradation problems for kernel compile
(now a 14% improvement in systime), that core set works without
further futzing about or crashing, with or without TSC, on either
version of gcc ... congrats!

It also produces the fastest system time for kernel compile I've ever
seen ... this core set seems to be good (I'm still less than convinced
about the further patches, but we can work on those one at a time now
you've got it all broken out and modular). Michael posted slightly
different looking results for virgin 44 yesterday - the main difference between virgin 44 and 44-mm4 for this stuff is probably the per-cpu
hot & cold pages (Ingo, this is like your original per-cpu pages).

All results are for a 16-way NUMA-Q (P3 700MHz 2Mb cache) 16Gb RAM.

Kernbench:
Elapsed User System CPU
2.5.44-mm4 19.676s 192.794s 42.678s 1197.4%
2.5.44-mm4-hbaum 19.422s 189.828s 40.204s 1196.2%
2.5.44-mm4-focht12 19.316s 189.514s 36.704s 1146.8%

Schedbench 4:
Elapsed TotalUser TotalSys AvgUser
2.5.44-mm4 32.45 49.47 129.86 0.82
2.5.44-mm4-hbaum 31.31 43.85 125.29 0.84
2.5.44-mm4-focht12 38.50 45.34 154.05 1.07

Schedbench 8:
Elapsed TotalUser TotalSys AvgUser
2.5.44-mm4 39.90 61.48 319.26 2.79
2.5.44-mm4-hbaum 32.63 46.56 261.10 1.99
2.5.44-mm4-focht12 35.56 46.57 284.53 1.97

Schedbench 16:
Elapsed TotalUser TotalSys AvgUser
2.5.44-mm4 62.99 93.59 1008.01 5.11
2.5.44-mm4-hbaum 49.78 76.71 796.68 4.43
2.5.44-mm4-focht12 51.94 61.43 831.26 4.68

Schedbench 32:
Elapsed TotalUser TotalSys AvgUser
2.5.44-mm4 88.13 194.53 2820.54 11.52
2.5.44-mm4-hbaum 54.67 147.30 1749.77 7.91
2.5.44-mm4-focht12 55.43 119.49 1773.97 8.41

Schedbench 64:
Elapsed TotalUser TotalSys AvgUser
2.5.44-mm4 159.92 653.79 10235.93 25.16
2.5.44-mm4-hbaum 65.20 300.58 4173.26 16.82
2.5.44-mm4-focht12 56.49 235.78 3615.71 18.05

There's a small degredation at the low end of schedbench (Erich's
numa_test) in there ... would be nice to fix, but I'm less worried
about that (where the machine is lightly loaded) than the other
numbers. Kernbench is just gcc-2.95-4 compiling the 2.4.17 kernel
doing a "make -j24 bzImage".

diffprofile 2.5.44-mm4 2.5.44-mm4-hbaum
(for kernbench, + got worse by adding the patch, - got better)

184 vm_enough_memory
154 d_lookup
83 do_schedule
75 page_add_rmap
73 strnlen_user
58 find_get_page
52 flush_signal_handlers
...
-61 pte_alloc_one
-63 do_wp_page
-85 .text.lock.file_table
-96 __set_page_dirty_buffers
-112 clear_page_tables
-118 get_empty_filp
-134 free_hot_cold_page
-144 page_remove_rmap
-150 __copy_to_user
-213 zap_pte_range
-217 buffered_rmqueue
-875 __copy_from_user
-1015 do_anonymous_page

diffprofile 2.5.44-mm4 2.5.44-mm4-focht12
(for kernbench, + got worse by adding the patch, - got better)

<nothing significantly degraded>
....
-57 path_lookup
-69 do_page_fault
-73 vm_enough_memory
-77 filemap_nopage
-78 do_no_page
-83 __set_page_dirty_buffers
-83 __fput
-84 do_schedule
-97 find_get_page
-106 file_move
-115 free_hot_cold_page
-115 clear_page_tables
-130 d_lookup
-147 atomic_dec_and_lock
-157 page_add_rmap
-197 buffered_rmqueue
-236 zap_pte_range
-264 get_empty_filp
-271 __copy_to_user
-464 page_remove_rmap
-573 .text.lock.file_table
-618 __copy_from_user
-823 do_anonymous_page


2002-10-27 18:13:24

by Martin J. Bligh

[permalink] [raw]
Subject: Re: NUMA scheduler (was: 2.5 merge candidate list 1.5)

> OK, I went to your latest patches (just 1 and 2). And they worked!
> You've fixed the performance degradation problems for kernel compile
> (now a 14% improvement in systime), that core set works without
> further futzing about or crashing, with or without TSC, on either
> version of gcc ... congrats!

So I have a slight correction to make to the above ;-) Your patches
do work just fine, no crashes any more. HOWEVER ... turns out I only
had the first patch installed, not both. Silly mistake, but turns out
to be very interesting.

So your second patch is the balance on exec stuff ... I've looked at
it, and think it's going to be very expensive to do in practice, at
least the simplistic "recalc everything on every exec" approach. It
does benefit the low end schedbench results, but not the high end ones,
and you can see the cost of your second patch in the system times of
the kernbench.

In summary, I think I like the first patch alone better than the
combination, but will have a play at making a cross between the two.
As I have very little context about the scheduler, would appreciate
any help anyone would like to volunteer ;-)

Corrected results are:

Kernbench:
Elapsed User System CPU
2.5.44-mm4 19.676s 192.794s 42.678s 1197.4%
2.5.44-mm4-hbaum 19.422s 189.828s 40.204s 1196.2%
2.5.44-mm4-focht-1 19.46s 189.838s 37.938s 1171%
2.5.44-mm4-focht-12 20.32s 190s 44.4s 1153.6%

Schedbench 4:
Elapsed TotalUser TotalSys AvgUser
2.5.44-mm4 32.45 49.47 129.86 0.82
2.5.44-mm4-hbaum 31.31 43.85 125.29 0.84
2.5.44-mm4-focht-1 38.61 45.15 154.48 1.06
2.5.44-mm4-focht-12 23.23 38.87 92.99 0.85

Schedbench 8:
Elapsed TotalUser TotalSys AvgUser
2.5.44-mm4 39.90 61.48 319.26 2.79
2.5.44-mm4-hbaum 32.63 46.56 261.10 1.99
2.5.44-mm4-focht-1 37.76 61.09 302.17 2.55
2.5.44-mm4-focht-12 28.40 34.43 227.25 2.09

Schedbench 16:
Elapsed TotalUser TotalSys AvgUser
2.5.44-mm4 62.99 93.59 1008.01 5.11
2.5.44-mm4-hbaum 49.78 76.71 796.68 4.43
2.5.44-mm4-focht-1 51.69 60.23 827.20 4.95
2.5.44-mm4-focht-12 51.24 60.86 820.08 4.23

Schedbench 32:
Elapsed TotalUser TotalSys AvgUser
2.5.44-mm4 88.13 194.53 2820.54 11.52
2.5.44-mm4-hbaum 54.67 147.30 1749.77 7.91
2.5.44-mm4-focht-1 56.71 123.62 1815.12 7.92
2.5.44-mm4-focht-12 55.69 118.85 1782.25 7.28

Schedbench 64:
Elapsed TotalUser TotalSys AvgUser
2.5.44-mm4 159.92 653.79 10235.93 25.16
2.5.44-mm4-hbaum 65.20 300.58 4173.26 16.82
2.5.44-mm4-focht-1 55.60 232.36 3558.98 17.61
2.5.44-mm4-focht-12 56.03 234.45 3586.46 15.76


2002-10-27 23:26:48

by Erich Focht

[permalink] [raw]
Subject: Re: NUMA scheduler (was: 2.5 merge candidate list 1.5)

On Sunday 27 October 2002 19:16, Martin J. Bligh wrote:
> > OK, I went to your latest patches (just 1 and 2). And they worked!
> > You've fixed the performance degradation problems for kernel compile
> > (now a 14% improvement in systime), that core set works without
> > further futzing about or crashing, with or without TSC, on either
> > version of gcc ... congrats!
>
> So I have a slight correction to make to the above ;-) Your patches
> do work just fine, no crashes any more. HOWEVER ... turns out I only
> had the first patch installed, not both. Silly mistake, but turns out
> to be very interesting.
>
> So your second patch is the balance on exec stuff ... I've looked at
> it, and think it's going to be very expensive to do in practice, at
> least the simplistic "recalc everything on every exec" approach. It
> does benefit the low end schedbench results, but not the high end ones,
> and you can see the cost of your second patch in the system times of
> the kernbench.

This is interesting, indeed. As you might have seen from the tests I
posted on LKML I could not see that effect on our IA64 NUMA machine.
Which arises the question: is it expensive to recalculate the load
when doing an exec (which I should also see) or is the strategy of
equally distributing the jobs across the nodes bad for certain
load+architecture combinations? As I'm not seeing the effect, maybe
you could do the following experiment:
In sched_best_node() keep only the "while" loop at the beginning. This
leads to a cheap selection of the next node, just a simple round robin.

Regarding the schedbench results: are they averages over multiple runs?
The numa_test needs to be repeated a few times to get statistically
meaningful results.

Thanks,
Erich

> In summary, I think I like the first patch alone better than the
> combination, but will have a play at making a cross between the two.
> As I have very little context about the scheduler, would appreciate
> any help anyone would like to volunteer ;-)
>
> Corrected results are:
>
> Kernbench:
> Elapsed User System CPU
> 2.5.44-mm4 19.676s 192.794s 42.678s 1197.4%
> 2.5.44-mm4-hbaum 19.422s 189.828s 40.204s 1196.2%
> 2.5.44-mm4-focht-1 19.46s 189.838s 37.938s 1171%
> 2.5.44-mm4-focht-12 20.32s 190s 44.4s 1153.6%
>
> Schedbench 4:
> Elapsed TotalUser TotalSys AvgUser
> 2.5.44-mm4 32.45 49.47 129.86 0.82
> 2.5.44-mm4-hbaum 31.31 43.85 125.29 0.84
> 2.5.44-mm4-focht-1 38.61 45.15 154.48 1.06
> 2.5.44-mm4-focht-12 23.23 38.87 92.99 0.85
>
> Schedbench 8:
> Elapsed TotalUser TotalSys AvgUser
> 2.5.44-mm4 39.90 61.48 319.26 2.79
> 2.5.44-mm4-hbaum 32.63 46.56 261.10 1.99
> 2.5.44-mm4-focht-1 37.76 61.09 302.17 2.55
> 2.5.44-mm4-focht-12 28.40 34.43 227.25 2.09
>
> Schedbench 16:
> Elapsed TotalUser TotalSys AvgUser
> 2.5.44-mm4 62.99 93.59 1008.01 5.11
> 2.5.44-mm4-hbaum 49.78 76.71 796.68 4.43
> 2.5.44-mm4-focht-1 51.69 60.23 827.20 4.95
> 2.5.44-mm4-focht-12 51.24 60.86 820.08 4.23
>
> Schedbench 32:
> Elapsed TotalUser TotalSys AvgUser
> 2.5.44-mm4 88.13 194.53 2820.54 11.52
> 2.5.44-mm4-hbaum 54.67 147.30 1749.77 7.91
> 2.5.44-mm4-focht-1 56.71 123.62 1815.12 7.92
> 2.5.44-mm4-focht-12 55.69 118.85 1782.25 7.28
>
> Schedbench 64:
> Elapsed TotalUser TotalSys AvgUser
> 2.5.44-mm4 159.92 653.79 10235.93 25.16
> 2.5.44-mm4-hbaum 65.20 300.58 4173.26 16.82
> 2.5.44-mm4-focht-1 55.60 232.36 3558.98 17.61
> 2.5.44-mm4-focht-12 56.03 234.45 3586.46 15.76

2002-10-27 23:49:28

by Martin J. Bligh

[permalink] [raw]
Subject: Re: NUMA scheduler (was: 2.5 merge candidate list 1.5)

> This is interesting, indeed. As you might have seen from the tests I
> posted on LKML I could not see that effect on our IA64 NUMA machine.
> Which arises the question: is it expensive to recalculate the load
> when doing an exec (which I should also see) or is the strategy of
> equally distributing the jobs across the nodes bad for certain
> load+architecture combinations?

I suspect the former. Bouncing a whole pile of cachelines every time
would be much more expensive for me than it would for you, and
kernbench will be heavy on exec.

> As I'm not seeing the effect, maybe
> you could do the following experiment:
> In sched_best_node() keep only the "while" loop at the beginning. This
> leads to a cheap selection of the next node, just a simple round robin.

Maybe I could just send you the profiles instead ;-)
If I have more time, I'll try your suggestion.
I'm trying Michael's balance_exec on top of your patch 1 at the
moment, but I'm somewhat confused by his code for sched_best_cpu.

+static int sched_best_cpu(struct task_struct *p)
+{
+ int i, minload, best_cpu, cur_cpu, node;
+ best_cpu = task_cpu(p);
+ if (cpu_rq(best_cpu)->nr_running <= 2)
+ return best_cpu;
+
+ node = __cpu_to_node(__get_cpu_var(last_exec_cpu));
+ if (++node >= numnodes)
+ node = 0;
+
+ cur_cpu = __node_to_first_cpu(node);
+ minload = cpu_rq(best_cpu)->nr_running;
+
+ for (i = 0; i < NR_CPUS; i++) {
+ if (!cpu_online(cur_cpu))
+ continue;
+
+ if (minload > cpu_rq(cur_cpu)->nr_running) {
+ minload = cpu_rq(cur_cpu)->nr_running;
+ best_cpu = cur_cpu;
+ }
+ if (++cur_cpu >= NR_CPUS)
+ cur_cpu = 0;
+ }
+ __get_cpu_var(last_exec_cpu) = best_cpu;
+ return best_cpu;
+}

Michael, the way I read the NR_CPUS loop, you walk every cpu
in the system, and take the best from all of them. In which case
what's the point of the last_exec_cpu stuff? On the other hand,
I changed your NR_CPUS to 4 (ie just walk the cpus in that node),
and it got worse. So perhaps I'm just misreading your code ...
and it does seem significantly cheaper to execute than Erich's.

Erich, on the other hand, your code does this:

+void sched_balance_exec(void)
+{
+ int new_cpu, new_node=0;
+
+ while (pooldata_is_locked())
+ cpu_relax();
+ if (numpools > 1) {
+ new_node = sched_best_node(current);
+ }
+ new_cpu = sched_best_cpu(current, new_node);
+ if (new_cpu != smp_processor_id())
+ sched_migrate_task(current, new_cpu);
+}

which seems to me to walk every runqueue in the system (in
sched_best_node), then walk one node's worth all over again
in sched_best_cpu .... doesn't it? Again, I may be misreading
this ... haven't looked at the scheduler much. But I can't
help feeling some sort of lazy evaluation is in order ....

And what's this doing?

+ do {
+ /* atomic_inc_return is not implemented on all archs [EF] */
+ atomic_inc(&sched_node);
+ best_node = atomic_read(&sched_node) % numpools;
+ } while (!(pool_mask[best_node] & mask));

I really don't think putting a global atomic in there is going to
be cheap ....

> Regarding the schedbench results: are they averages over multiple runs?
> The numa_test needs to be repeated a few times to get statistically
> meaningful results.

No. But I don't have 2 hours to run each set of tests either. I did
a couple of runs, and didn't see huge variances. Seems stable enough.

M.

2002-10-28 00:28:18

by Martin J. Bligh

[permalink] [raw]
Subject: Re: NUMA scheduler (was: 2.5 merge candidate list 1.5)

OK, so I'm trying to read your patch 1, fairly unsucessfully
(seems to be a lot more complex that Michael's).

Can you explain pool_lock? It does actually seem to work, but
it's rather confusing ....

build_pools() has a comment above it saying:

+/*
+ * Call pooldata_lock() before calling this function and
+ * pooldata_unlock() after!
+ */

But then you promptly call pooldata_lock inside build_pools
anyway ... looks like it's just a naff comment, but doesn't
help much.

Leaving aside the acknowledged mind-boggling ugliness of
pooldata_lock(), what exactly is this lock protecting, and when?
The only thing that actually calls pooldata_lock is build_pools,
right? And the only other thing that looks at it is sched_balance_exec
via pooldata_is_locked ... can that happen before build_pools
(seems like you're in deep trouble if it does anyway, as it'll
just block). If you really still need to do this, RCU is now
in the kernel ;-) If not, can we just chuck all that stuff?

M.

2002-10-28 00:42:52

by Martin J. Bligh

[permalink] [raw]
Subject: Re: NUMA scheduler (was: 2.5 merge candidate list 1.5)

OK, so I tried Michael's without the balance_exec code as well,
then Erich's main patch with Michael's balance_exec (which seems
to be cheaper to calculate). Turns out I was actually running an
older version of Michael's patch .... with his latest stuff it
actually seems to perform better pretty much across the board
(comaring 2.5.44-mm4-focht-12 and 2.5.44-mm4-hbaum-12). And it's
also a lot simpler.

Erich, what does all the pool stuff actually buy us over what
Michael is doing? Seems to be rather more complex, but maybe
it's useful for something we're just not measuring here?

2.5.44-mm4 Virgin
2.5.44-mm4-focht-1 Focht main
2.5.44-mm4-hbaum-1 Hbaum main
2.5.44-mm4-focht-12 Focht main + Focht balance_exec
2.5.44-mm4-hbaum-1 Hbaum main + Hbaum balance_exec
2.5.44-mm4-f1-h2 Focht main + Hbaum balance_exec

Kernbench:
Elapsed User System CPU
2.5.44-mm4 19.676s 192.794s 42.678s 1197.4%
2.5.44-mm4-focht-1 19.46s 189.838s 37.938s 1171%
2.5.44-mm4-hbaum-1 19.746s 189.232s 38.354s 1152.2%
2.5.44-mm4-focht-12 20.32s 190s 44.4s 1153.6%
2.5.44-mm4-hbaum-12 19.322s 190.176s 40.354s 1192.6%
2.5.44-mm4-f1-h2 19.398s 190.118s 40.06s 1186%

Schedbench 4:
Elapsed TotalUser TotalSys AvgUser
2.5.44-mm4 32.45 49.47 129.86 0.82
2.5.44-mm4-focht-1 38.61 45.15 154.48 1.06
2.5.44-mm4-hbaum-1 37.81 46.44 151.26 0.78
2.5.44-mm4-focht-12 23.23 38.87 92.99 0.85
2.5.44-mm4-hbaum-12 22.26 34.70 89.09 0.70
2.5.44-mm4-f1-h2 21.39 35.97 85.57 0.81

Schedbench 8:
Elapsed TotalUser TotalSys AvgUser
2.5.44-mm4 39.90 61.48 319.26 2.79
2.5.44-mm4-focht-1 37.76 61.09 302.17 2.55
2.5.44-mm4-hbaum-1 43.18 56.74 345.54 1.71
2.5.44-mm4-focht-12 28.40 34.43 227.25 2.09
2.5.44-mm4-hbaum-12 30.71 45.87 245.75 1.43
2.5.44-mm4-f1-h2 36.11 45.18 288.98 2.10

Schedbench 16:
Elapsed TotalUser TotalSys AvgUser
2.5.44-mm4 62.99 93.59 1008.01 5.11
2.5.44-mm4-focht-1 51.69 60.23 827.20 4.95
2.5.44-mm4-hbaum-1 52.57 61.54 841.38 3.93
2.5.44-mm4-focht-12 51.24 60.86 820.08 4.23
2.5.44-mm4-hbaum-12 52.33 62.23 837.46 3.84
2.5.44-mm4-f1-h2 51.76 60.15 828.33 5.67

Schedbench 32:
Elapsed TotalUser TotalSys AvgUser
2.5.44-mm4 88.13 194.53 2820.54 11.52
2.5.44-mm4-focht-1 56.71 123.62 1815.12 7.92
2.5.44-mm4-hbaum-1 54.57 153.56 1746.45 9.20
2.5.44-mm4-focht-12 55.69 118.85 1782.25 7.28
2.5.44-mm4-hbaum-12 54.36 135.30 1739.95 8.09
2.5.44-mm4-f1-h2 55.97 119.28 1791.39 7.20

Schedbench 64:
Elapsed TotalUser TotalSys AvgUser
2.5.44-mm4 159.92 653.79 10235.93 25.16
2.5.44-mm4-focht-1 55.60 232.36 3558.98 17.61
2.5.44-mm4-hbaum-1 71.48 361.77 4575.45 18.53
2.5.44-mm4-focht-12 56.03 234.45 3586.46 15.76
2.5.44-mm4-hbaum-12 56.91 240.89 3642.99 15.67
2.5.44-mm4-f1-h2 56.48 246.93 3615.32 16.97

2002-10-28 00:50:54

by Michael Hohnbaum

[permalink] [raw]
Subject: Re: [Lse-tech] Re: NUMA scheduler (was: 2.5 merge candidate list 1.5)


> I'm trying Michael's balance_exec on top of your patch 1 at the
> moment, but I'm somewhat confused by his code for sched_best_cpu.
>
> +static int sched_best_cpu(struct task_struct *p)
> +{
> + int i, minload, best_cpu, cur_cpu, node;
> + best_cpu = task_cpu(p);
> + if (cpu_rq(best_cpu)->nr_running <= 2)
> + return best_cpu;
> +
> + node = __cpu_to_node(__get_cpu_var(last_exec_cpu));
> + if (++node >= numnodes)
> + node = 0;
> +
> + cur_cpu = __node_to_first_cpu(node);
> + minload = cpu_rq(best_cpu)->nr_running;
> +
> + for (i = 0; i < NR_CPUS; i++) {
> + if (!cpu_online(cur_cpu))
> + continue;
> +
> + if (minload > cpu_rq(cur_cpu)->nr_running) {
> + minload = cpu_rq(cur_cpu)->nr_running;
> + best_cpu = cur_cpu;
> + }
> + if (++cur_cpu >= NR_CPUS)
> + cur_cpu = 0;
> + }
> + __get_cpu_var(last_exec_cpu) = best_cpu;
> + return best_cpu;
> +}
>
> Michael, the way I read the NR_CPUS loop, you walk every cpu
> in the system, and take the best from all of them. In which case
> what's the point of the last_exec_cpu stuff? On the other hand,
> I changed your NR_CPUS to 4 (ie just walk the cpus in that node),
> and it got worse. So perhaps I'm just misreading your code ...
> and it does seem significantly cheaper to execute than Erich's.
>
You are reading it correct. The only thing that the last_exec_cpu
does is to help spread the load across nodes. Without that what was
happening is that node 0 would get completely loaded, then node 1,
etc. With it, in cases where one or more runqueues have the same
length, the one chosen tends to get spread out a bit. Not the
greatest solution, but it helps.
>
--
Michael Hohnbaum 503-578-5486
[email protected] T/L 775-5486

2002-10-28 04:19:36

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [Lse-tech] Re: NUMA scheduler (was: 2.5 merge candidate list 1.5)

>> Michael, the way I read the NR_CPUS loop, you walk every cpu
>> in the system, and take the best from all of them. In which case
>> what's the point of the last_exec_cpu stuff? On the other hand,
>> I changed your NR_CPUS to 4 (ie just walk the cpus in that node),
>> and it got worse. So perhaps I'm just misreading your code ...
>> and it does seem significantly cheaper to execute than Erich's.
>>
> You are reading it correct. The only thing that the last_exec_cpu
> does is to help spread the load across nodes. Without that what was
> happening is that node 0 would get completely loaded, then node 1,
> etc. With it, in cases where one or more runqueues have the same
> length, the one chosen tends to get spread out a bit. Not the
> greatest solution, but it helps.

OK. I made a simple boring optimisation to your patch. Shaved almost
a second off system time for kernbench, and seems idiotproof to me,
shouldn't change anything apart from touching fewer runqueues: if
we find a runqueue with nr_running == 0, stop searching ... we ain't
going to find anything better ;-)

Kernbench:
Elapsed User System CPU
2.5.44-mm4 19.676s 192.794s 42.678s 1197.4%
2.5.44-mm4-hbaum-1 19.746s 189.232s 38.354s 1152.2%
2.5.44-mm4-hbaum-12 19.322s 190.176s 40.354s 1192.6%
2.5.44-mm4-hbaum-12-firstzero 19.292s 189.66s 39.428s 1187.4%

Patch is probably space-eaten, so just whack it in by hand.

--- 2.5.44-mm4-hbaum-12/kernel/sched.c 2002-10-27 19:54:25.000000000 -0800
+++ 2.5.44-mm4-hbaum-12-first_low/kernel/sched.c 2002-10-27 16:42:10.000000000 -0800
@@ -2206,6 +2206,8 @@
if (minload > cpu_rq(cur_cpu)->nr_running) {
minload = cpu_rq(cur_cpu)->nr_running;
best_cpu = cur_cpu;
+ if (minload == 0)
+ break;
}
if (++cur_cpu >= NR_CPUS)
cur_cpu = 0;

2002-10-28 07:13:19

by Martin J. Bligh

[permalink] [raw]
Subject: Re: NUMA scheduler (was: 2.5 merge candidate list 1.5)

> This is interesting, indeed. As you might have seen from the tests I
> posted on LKML I could not see that effect on our IA64 NUMA machine.
> Which arises the question: is it expensive to recalculate the load
> when doing an exec (which I should also see) or is the strategy of
> equally distributing the jobs across the nodes bad for certain
> load+architecture combinations? As I'm not seeing the effect, maybe
> you could do the following experiment:
> In sched_best_node() keep only the "while" loop at the beginning. This
> leads to a cheap selection of the next node, just a simple round robin.

I did this ... presume that's what you meant:

static int sched_best_node(struct task_struct *p)
{
int i, n, best_node=0, min_load, pool_load, min_pool=numa_node_id();
int cpu, pool, load;
unsigned long mask = p->cpus_allowed & cpu_online_map;

do {
/* atomic_inc_return is not implemented on all archs [EF] */
atomic_inc(&sched_node);
best_node = atomic_read(&sched_node) % numpools;
} while (!(pool_mask[best_node] & mask));

return best_node;
}

Odd. seems to make it even worse.

Kernbench:
Elapsed User System CPU
2.5.44-mm4-focht-12 20.32s 190s 44.4s 1153.6%
2.5.44-mm4-focht-12-lobo 21.362s 193.71s 48.672s 1134%

The diffprofiles below look like this just makes it make bad decisions.
Very odd ... compare with what hapenned when I put Michael's balance_exec
on instead. I'm tired, maybe I did something silly.

diffprofile 2.5.44-mm4-focht-1 2.5.44-mm4-focht-12

606 page_remove_rmap
566 do_schedule
488 page_add_rmap
475 .text.lock.file_table
370 __copy_to_user
306 strnlen_user
272 d_lookup
235 find_get_page
233 get_empty_filp
193 atomic_dec_and_lock
161 copy_process
159 sched_best_node
135 flush_signal_handlers
131 complete
116 filemap_nopage
109 __fput
105 path_lookup
103 follow_mount
95 zap_pte_range
92 file_move
91 do_no_page
87 release_task
80 do_page_fault
62 lru_cache_add
62 link_path_walk
62 do_generic_mapping_read
57 find_trylock_page
55 release_pages
50 dup_task_struct
...
-73 do_anonymous_page
-478 __copy_from_user

diffprofile 2.5.44-mm4-focht-12 2.5.44-mm4-focht-12-lobo

567 do_schedule
482 do_anonymous_page
383 page_remove_rmap
336 __copy_from_user
333 page_add_rmap
241 zap_pte_range
213 init_private_file
189 strnlen_user
186 buffered_rmqueue
172 find_get_page
124 complete
111 filemap_nopage
97 free_hot_cold_page
89 flush_signal_handlers
86 clear_page_tables
79 do_page_fault
79 copy_process
75 d_lookup
74 path_lookup
71 sched_best_cpu
68 do_no_page
58 release_pages
58 __set_page_dirty_buffers
52 wait_for_completion
51 release_task
51 handle_mm_fault
...
-53 lru_cache_add
-73 dentry_open
-100 sched_best_node
-108 file_ra_state_init
-402 .text.lock.file_table

2002-10-28 16:28:33

by Erich Focht

[permalink] [raw]
Subject: Re: NUMA scheduler (was: 2.5 merge candidate list 1.5)

On Monday 28 October 2002 01:31, Martin J. Bligh wrote:
> OK, so I'm trying to read your patch 1, fairly unsucessfully
> (seems to be a lot more complex that Michael's).
>
> Can you explain pool_lock? It does actually seem to work, but
> it's rather confusing ....

The pool data is needed to be able to loop over the CPUs of one node,
only. I'm convinced we'll need to do that sometime, no matter how simple
the core of the NUMA scheduler is.

The pool_lock is protecting that data while it is built. This can happen
in future more often, if somebody starts hotplugging CPUs.

> build_pools() has a comment above it saying:
>
> +/*
> + * Call pooldata_lock() before calling this function and
> + * pooldata_unlock() after!
> + */
>
> But then you promptly call pooldata_lock inside build_pools
> anyway ... looks like it's just a naff comment, but doesn't
> help much.

Sorry, the comment came from a former version...

> just block). If you really still need to do this, RCU is now
> in the kernel ;-) If not, can we just chuck all that stuff?

I'm preparing a core patch which doesn't need the pool_lock. I'll send it
out today.

Regards,
Erich

2002-10-28 16:56:07

by Martin J. Bligh

[permalink] [raw]
Subject: Re: NUMA scheduler (was: 2.5 merge candidate list 1.5)

> The pool data is needed to be able to loop over the CPUs of one node,
> only. I'm convinced we'll need to do that sometime, no matter how simple
> the core of the NUMA scheduler is.

Hmmm ... is using node_to_cpumask from the topology stuff, then looping
over that bitmask insufficient?

> The pool_lock is protecting that data while it is built. This can happen
> in future more often, if somebody starts hotplugging CPUs.

Heh .... when someone actually does that, we'll have a lot more problems
than just this to solve. Would be nice to keep this stuff simple for now, if
possible.

> Sorry, the comment came from a former version...

No problem, I suspected that was all it was.

>> just block). If you really still need to do this, RCU is now
>> in the kernel ;-) If not, can we just chuck all that stuff?
>
> I'm preparing a core patch which doesn't need the pool_lock. I'll send it
> out today.

Cool! Thanks,

M.

2002-10-28 17:05:42

by Erich Focht

[permalink] [raw]
Subject: Re: NUMA scheduler (was: 2.5 merge candidate list 1.5)

On Monday 28 October 2002 01:46, Martin J. Bligh wrote:
> Erich, what does all the pool stuff actually buy us over what
> Michael is doing? Seems to be rather more complex, but maybe
> it's useful for something we're just not measuring here?

The more complicated stuff is for achieving equal load between the
nodes. It delays steals more when the stealing node is averagely loaded,
less when it is unloaded. This is the place where we can make it cope
with more complex machines with multiple levels of memory hierarchy
(like our 32 CPU TX7). Equal load among the nodes is important if you
have memory bandwidth eaters, as the bandwidth in a node is limited.

When introducing node affinity (which shows good results for me!) you
also need a more careful ranking of the tasks which are candidates to
be stolen. The routine task_to_steal does this and is another source
of complexity. It is another point where the multilevel stuff comes in.
In the core part of the patch the rank of the steal candidates is computed
by only taking into account the time which a task has slept.

I attach the script for getting some statistics on the numa_test. I
consider this test more sensitive to NUMA effects, as it is a bandwidth
eater also needing good latency.
(BTW, Martin: in the numa_test script I've sent you the PROBLEMSIZE must
be set to 1000000!).

Regards,
Erich


>
> 2.5.44-mm4 Virgin
> 2.5.44-mm4-focht-1 Focht main
> 2.5.44-mm4-hbaum-1 Hbaum main
> 2.5.44-mm4-focht-12 Focht main + Focht balance_exec
> 2.5.44-mm4-hbaum-1 Hbaum main + Hbaum balance_exec
> 2.5.44-mm4-f1-h2 Focht main + Hbaum balance_exec
>
> Kernbench:
> Elapsed User System CPU
> 2.5.44-mm4 19.676s 192.794s 42.678s 1197.4%
> 2.5.44-mm4-focht-1 19.46s 189.838s 37.938s 1171%
> 2.5.44-mm4-hbaum-1 19.746s 189.232s 38.354s 1152.2%
> 2.5.44-mm4-focht-12 20.32s 190s 44.4s 1153.6%
> 2.5.44-mm4-hbaum-12 19.322s 190.176s 40.354s 1192.6%
> 2.5.44-mm4-f1-h2 19.398s 190.118s 40.06s 1186%
>
> Schedbench 4:
> Elapsed TotalUser TotalSys AvgUser
> 2.5.44-mm4 32.45 49.47 129.86 0.82
> 2.5.44-mm4-focht-1 38.61 45.15 154.48 1.06
> 2.5.44-mm4-hbaum-1 37.81 46.44 151.26 0.78
> 2.5.44-mm4-focht-12 23.23 38.87 92.99 0.85
> 2.5.44-mm4-hbaum-12 22.26 34.70 89.09 0.70
> 2.5.44-mm4-f1-h2 21.39 35.97 85.57 0.81
>
> Schedbench 8:
> Elapsed TotalUser TotalSys AvgUser
> 2.5.44-mm4 39.90 61.48 319.26 2.79
> 2.5.44-mm4-focht-1 37.76 61.09 302.17 2.55
> 2.5.44-mm4-hbaum-1 43.18 56.74 345.54 1.71
> 2.5.44-mm4-focht-12 28.40 34.43 227.25 2.09
> 2.5.44-mm4-hbaum-12 30.71 45.87 245.75 1.43
> 2.5.44-mm4-f1-h2 36.11 45.18 288.98 2.10
>
> Schedbench 16:
> Elapsed TotalUser TotalSys AvgUser
> 2.5.44-mm4 62.99 93.59 1008.01 5.11
> 2.5.44-mm4-focht-1 51.69 60.23 827.20 4.95
> 2.5.44-mm4-hbaum-1 52.57 61.54 841.38 3.93
> 2.5.44-mm4-focht-12 51.24 60.86 820.08 4.23
> 2.5.44-mm4-hbaum-12 52.33 62.23 837.46 3.84
> 2.5.44-mm4-f1-h2 51.76 60.15 828.33 5.67
>
> Schedbench 32:
> Elapsed TotalUser TotalSys AvgUser
> 2.5.44-mm4 88.13 194.53 2820.54 11.52
> 2.5.44-mm4-focht-1 56.71 123.62 1815.12 7.92
> 2.5.44-mm4-hbaum-1 54.57 153.56 1746.45 9.20
> 2.5.44-mm4-focht-12 55.69 118.85 1782.25 7.28
> 2.5.44-mm4-hbaum-12 54.36 135.30 1739.95 8.09
> 2.5.44-mm4-f1-h2 55.97 119.28 1791.39 7.20
>
> Schedbench 64:
> Elapsed TotalUser TotalSys AvgUser
> 2.5.44-mm4 159.92 653.79 10235.93 25.16
> 2.5.44-mm4-focht-1 55.60 232.36 3558.98 17.61
> 2.5.44-mm4-hbaum-1 71.48 361.77 4575.45 18.53
> 2.5.44-mm4-focht-12 56.03 234.45 3586.46 15.76
> 2.5.44-mm4-hbaum-12 56.91 240.89 3642.99 15.67
> 2.5.44-mm4-f1-h2 56.48 246.93 3615.32 16.97


Attachments:
numabench (874.00 B)

2002-10-28 17:20:35

by Erich Focht

[permalink] [raw]
Subject: Re: NUMA scheduler (was: 2.5 merge candidate list 1.5)

On Monday 28 October 2002 17:57, Martin J. Bligh wrote:
> > I'm preparing a core patch which doesn't need the pool_lock. I'll send it
> > out today.
>
> Cool! Thanks,

OK, here it comes. The core doesn't use the loop_over_nodes() macro any
more. There's one big loop over the CPUs for computing node loads and
the most loaded CPUs in find_busiest_queue. The call to build_cpus()
isn't critical any more. Functionality is the same as in the previous
patch (i.e. steal delays, ranking of task_to_steal, etc...).

I kept the loop_over_node() macro for compatibility reasons with the
additional patches. You might need to replace in the additional patches:
numpools -> numpools()
pool_nr_cpus[] -> pool_ncpus()

I'm puzzled about the initial load balancing impact and have to think
about the results I've seen from you so far... In the environments I am
used to, the frequency of exec syscalls is rather low, therefore I didn't
care too much about the sched_balance_exec performance and prefered to
try harder to achieve good distribution across the nodes.

Regards,
Erich


Attachments:
01-numa_sched_core-2.5.39-12b.patch (16.17 kB)

2002-10-28 17:47:50

by Martin J. Bligh

[permalink] [raw]
Subject: Re: NUMA scheduler (was: 2.5 merge candidate list 1.5)

> I'm puzzled about the initial load balancing impact and have to think
> about the results I've seen from you so far... In the environments I am
> used to, the frequency of exec syscalls is rather low, therefore I didn't
> care too much about the sched_balance_exec performance and prefered to
> try harder to achieve good distribution across the nodes.

OK, but take a look at Michael's second patch. It still looks at
nr_running on every queue in the system (with some slightly strange
code to make a rotating choice on nodes on the case of equality),
so should still be able to make the best decision .... *but* it
seems to be much cheaper to execute. Not sure why at this point,
given the last results I sent you last night ;-)

M.

2002-10-28 17:47:49

by Martin J. Bligh

[permalink] [raw]
Subject: Re: NUMA scheduler (was: 2.5 merge candidate list 1.5)

>> Schedbench 4:
>> Elapsed TotalUser TotalSys AvgUser
>> 2.5.44-mm4 32.45 49.47 129.86 0.82
>> 2.5.44-mm4-focht-1 38.61 45.15 154.48 1.06
>> 2.5.44-mm4-hbaum-1 37.81 46.44 151.26 0.78
>> 2.5.44-mm4-focht-12 23.23 38.87 92.99 0.85
>> 2.5.44-mm4-hbaum-12 22.26 34.70 89.09 0.70
>> 2.5.44-mm4-f1-h2 21.39 35.97 85.57 0.81
>
> One more remarks:
> You seem to have made the numa_test shorter. That reduces it to beeing
> simply a check for the initial load balancing as the hackbench running in
> the background (and aimed to disturb the initial load balancing) might
> start too late. You will most probably not see the impact of node affinity
> with such short running tests. But we weren't talking about node affinity,
> yet...

I didn't modify what you sent me at all ... perhaps my machine is
just faster than yours?

/me ducks & runs ;-)

M.

2002-10-28 17:47:51

by Erich Focht

[permalink] [raw]
Subject: Re: NUMA scheduler (was: 2.5 merge candidate list 1.5)

On Monday 28 October 2002 01:46, Martin J. Bligh wrote:
> 2.5.44-mm4 Virgin
> 2.5.44-mm4-focht-1 Focht main
> 2.5.44-mm4-hbaum-1 Hbaum main
> 2.5.44-mm4-focht-12 Focht main + Focht balance_exec
> 2.5.44-mm4-hbaum-1 Hbaum main + Hbaum balance_exec
> 2.5.44-mm4-f1-h2 Focht main + Hbaum balance_exec
>
> Schedbench 4:
> Elapsed TotalUser TotalSys AvgUser
> 2.5.44-mm4 32.45 49.47 129.86 0.82
> 2.5.44-mm4-focht-1 38.61 45.15 154.48 1.06
> 2.5.44-mm4-hbaum-1 37.81 46.44 151.26 0.78
> 2.5.44-mm4-focht-12 23.23 38.87 92.99 0.85
> 2.5.44-mm4-hbaum-12 22.26 34.70 89.09 0.70
> 2.5.44-mm4-f1-h2 21.39 35.97 85.57 0.81

One more remarks:
You seem to have made the numa_test shorter. That reduces it to beeing
simply a check for the initial load balancing as the hackbench running in
the background (and aimed to disturb the initial load balancing) might
start too late. You will most probably not see the impact of node affinity
with such short running tests. But we weren't talking about node affinity,
yet...

Erich

2002-10-28 18:32:40

by Martin J. Bligh

[permalink] [raw]
Subject: Re: NUMA scheduler (was: 2.5 merge candidate list 1.5)

>> Erich, what does all the pool stuff actually buy us over what
>> Michael is doing? Seems to be rather more complex, but maybe
>> it's useful for something we're just not measuring here?
>
> The more complicated stuff is for achieving equal load between the
> nodes. It delays steals more when the stealing node is averagely loaded,
> less when it is unloaded. This is the place where we can make it cope
> with more complex machines with multiple levels of memory hierarchy
> (like our 32 CPU TX7). Equal load among the nodes is important if you
> have memory bandwidth eaters, as the bandwidth in a node is limited.
>
> When introducing node affinity (which shows good results for me!) you
> also need a more careful ranking of the tasks which are candidates to
> be stolen. The routine task_to_steal does this and is another source
> of complexity. It is another point where the multilevel stuff comes in.
> In the core part of the patch the rank of the steal candidates is computed
> by only taking into account the time which a task has slept.

OK, it all sounds sane, just rather complicated ;-) I'm going to trawl
through your stuff with Michael, and see if we can simplify it a bit
somehow whilst not changing the functionality. Your first patch seems
to work just fine, it's just the complexity that bugs me a bit.

The combination of your first patch with Michael's balance_exec stuff
actually seems to work pretty well ... I'll poke at the new patch you
sent me + Michael's exec balance + the little perf tweak I made to it,
and see what happens ;-)

> I attach the script for getting some statistics on the numa_test. I
> consider this test more sensitive to NUMA effects, as it is a bandwidth
> eater also needing good latency.
> (BTW, Martin: in the numa_test script I've sent you the PROBLEMSIZE must
> be set to 1000000!).

It is ;-) I'm running 44-mm4, not virgin remember, so things like hot&cold
page lists may make it faster?

M.

2002-10-28 23:43:04

by Erich Focht

[permalink] [raw]
Subject: Re: NUMA scheduler (was: 2.5 merge candidate list 1.5)

On Monday 28 October 2002 18:36, Martin J. Bligh wrote:
> >> Schedbench 4:
> >> Elapsed TotalUser TotalSys AvgUser
> >> 2.5.44-mm4 32.45 49.47 129.86 0.82
> >> 2.5.44-mm4-focht-1 38.61 45.15 154.48 1.06
> >> 2.5.44-mm4-hbaum-1 37.81 46.44 151.26 0.78
> >> 2.5.44-mm4-focht-12 23.23 38.87 92.99 0.85
> >> 2.5.44-mm4-hbaum-12 22.26 34.70 89.09 0.70
> >> 2.5.44-mm4-f1-h2 21.39 35.97 85.57 0.81
> >
> > One more remarks:
> > You seem to have made the numa_test shorter. That reduces it to beeing
> > simply a check for the initial load balancing as the hackbench running in
> > the background (and aimed to disturb the initial load balancing) might
> > start too late. You will most probably not see the impact of node
> > affinity with such short running tests. But we weren't talking about node
> > affinity, yet...
>
> I didn't modify what you sent me at all ... perhaps my machine is
> just faster than yours?
>
> /me ducks & runs ;-)

:-)))

I tried with IA32, too ;-) With PROBLEMSIZE=1000000 I get on a 2.8GHz
XEON something around 16s. On a 1.6GHz Athlon it's 22s. Both times running
./numa_test 2 on a dual CPU box. The usertime is pretty independent of the
OS, (but the scheduling influences it a lot).

But: you have a node level cache! Maybe the whole memory is inside that
one and then things can go really fast. Hmmm, I guess I'll need some
cache detection in the future to enforce that the BM really runs in
memory... Increasing PROBLEMSIZE might help, but we can do that later,
when testing affinity (I'm not giving up on this idea... ;-)

Regards,
Erich

2002-10-29 00:00:17

by Martin J. Bligh

[permalink] [raw]
Subject: Re: NUMA scheduler (was: 2.5 merge candidate list 1.5)

>> I didn't modify what you sent me at all ... perhaps my machine is
>> just faster than yours?
>>
>> /me ducks & runs ;-)
>
> :-)))
>
> I tried with IA32, too ;-) With PROBLEMSIZE=1000000 I get on a 2.8GHz
> XEON something around 16s. On a 1.6GHz Athlon it's 22s. Both times running
> ./numa_test 2 on a dual CPU box. The usertime is pretty independent of the
> OS, (but the scheduling influences it a lot).

I have 700MHz P3 Xeons, but I have 2Mb L2 cache on them which is much
better than the newer chips. That might make a big differernce.

> But: you have a node level cache! Maybe the whole memory is inside that
> one and then things can go really fast. Hmmm, I guess I'll need some
> cache detection in the future to enforce that the BM really runs in
> memory... Increasing PROBLEMSIZE might help, but we can do that later,
> when testing affinity (I'm not giving up on this idea... ;-)

Yup, 32Mb cache. Not sure if it's faster than local memory or not.

M.

2002-10-29 00:01:08

by Erich Focht

[permalink] [raw]
Subject: Re: [Lse-tech] Re: NUMA scheduler (was: 2.5 merge candidate list 1.5)

On Monday 28 October 2002 18:35, Martin J. Bligh wrote:
> > I'm puzzled about the initial load balancing impact and have to think
> > about the results I've seen from you so far... In the environments I am
> > used to, the frequency of exec syscalls is rather low, therefore I didn't
> > care too much about the sched_balance_exec performance and prefered to
> > try harder to achieve good distribution across the nodes.
>
> OK, but take a look at Michael's second patch. It still looks at
> nr_running on every queue in the system (with some slightly strange
> code to make a rotating choice on nodes on the case of equality),
> so should still be able to make the best decision .... *but* it
> seems to be much cheaper to execute. Not sure why at this point,
> given the last results I sent you last night ;-)

Yes, I like it! I needed some time to understand that the per_cpu
variables can spread the execed tasks acros the nodes as well as the
atomic sched_node. Sure, I'd like to select the least loaded node instead
of the least loaded CPU. It can well be that you just have created on a
node 10 threads (by fork, therefore still on their original CPU), and have
an idle CPU in the same node (which didn't steal yet the newly created
tasks). Suppose your instant load looks like this:
node 0: cpu0: 1 , cpu1: 1, cpu2: 1, cpu3: 1
node 1: cpu4:10 , cpu5: 0, cpu6: 1, cpu7: 1

If you exec on cpu0 before cpu5 managed to steal something from cpu4,
you'll aim for cpu5. This would just increase the node-imbalance and
force more of the threads on cpu4 to move to node0, which is maybe bad
for them. Just an example... If you start considering non-trivial
cpus_allowed masks, you might get more of these cases.

We could take this as a design target for the initial load balancer
and keep the fastest version we currently have for the benchmarks
we currently use (Michael's).

Regards,
Erich







2002-10-29 01:06:43

by Gerrit Huizenga

[permalink] [raw]
Subject: Re: NUMA scheduler (was: 2.5 merge candidate list 1.5)

In message <737410000.1035849619@flay>, > : "Martin J. Bligh" writes:
>
> Yup, 32Mb cache. Not sure if it's faster than local memory or not.

Yes, NUMA-Q cache can be faster than local memory, but it *only* caches
remote memory. Some other architectures use the L3 cache to cache *all*
memory (local _and_ remote). Reasoning: why polute the valuable
cache with things that are already close at hand?

gerrit

2002-10-29 22:33:48

by Erich Focht

[permalink] [raw]
Subject: Re: NUMA scheduler (was: 2.5 merge candidate list 1.5)

On Monday 28 October 2002 18:36, Martin J. Bligh wrote:
> >> Schedbench 4:
> >> Elapsed TotalUser TotalSys AvgUser
> >> 2.5.44-mm4 32.45 49.47 129.86 0.82
> >> 2.5.44-mm4-focht-1 38.61 45.15 154.48 1.06
> >> 2.5.44-mm4-hbaum-1 37.81 46.44 151.26 0.78
> >> 2.5.44-mm4-focht-12 23.23 38.87 92.99 0.85
> >> 2.5.44-mm4-hbaum-12 22.26 34.70 89.09 0.70
> >> 2.5.44-mm4-f1-h2 21.39 35.97 85.57 0.81
> >
> > One more remarks:
> > You seem to have made the numa_test shorter. That reduces it to beeing
> > simply a check for the initial load balancing as the hackbench running in
> > the background (and aimed to disturb the initial load balancing) might
> > start too late. You will most probably not see the impact of node
> > affinity with such short running tests. But we weren't talking about node
> > affinity, yet...
>
> I didn't modify what you sent me at all ... perhaps my machine is
> just faster than yours?
>
> /me ducks & runs ;-)

Aaargh, now I understand!!! You just have wrong labels in your table,
they are permuted! More sense makes:

> >> AvgUser Elapsed TotalUser TotalSys
> >> 2.5.44-mm4 32.45 49.47 129.86 0.82
> >> 2.5.44-mm4-focht-1 38.61 45.15 154.48 1.06
> >> 2.5.44-mm4-hbaum-1 37.81 46.44 151.26 0.78
> >> 2.5.44-mm4-focht-12 23.23 38.87 92.99 0.85
> >> 2.5.44-mm4-hbaum-12 22.26 34.70 89.09 0.70
> >> 2.5.44-mm4-f1-h2 21.39 35.97 85.57 0.81

Regards,
Erich